Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix chrM / MT normalization #226

Merged
merged 2 commits into from
Sep 5, 2019
Merged

Fix chrM / MT normalization #226

merged 2 commits into from
Sep 5, 2019

Conversation

iskandr
Copy link
Contributor

@iskandr iskandr commented Sep 5, 2019

In the dark ages of PyEnsembl we needed a quick way to annotate variants from hg19 and GRCh37 using Ensembl reference data. This lead to a dirty chromosome normalization hack where we turn e.g. "chr1" -> "1" and "chrM" -> "MT".

This was always a little questionable (since the mitochondrial sequences aren't actually the same) but even worse is incorrect for references like GRCh38.

So, this PR gets rid of two aspects of contig normaliztion: "chr" prefix is now preserved and we don't convert "M" -> "MT" for the mitochondrial genome.

Fixes: #225

To restore use of Ensembl data for hg19 variants we'll have to make a more local contig conversion option in Varcode.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.08%) to 79.923% when pulling f28cf5e on fix-MT-normalization into b48d736 on master.

# standardize mitochondrial genome to be "MT"
if result == "M":
result = "MT"
if result.startswith("chr") and "_" not in result:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought for the future: normalize_chromosome could be aware of its Pyensembl release and normalize it to that set of contig names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking something similar! It's a little tricky since the normalization also gets applied to GTF file contigs which are kind of a lower level concept but I suspect we could figure out a sane rewiring.

@iskandr iskandr merged commit dd3117c into master Sep 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect normalization of "chrM" on GRCh38 to "MT"
3 participants