-
Notifications
You must be signed in to change notification settings - Fork 26
collect effect annotation errors in dictionary, only look up overlapping... #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,5 @@ | ||
| from __future__ import print_function, division, absolute_import | ||
| import logging | ||
|
|
||
| from .common import group_by | ||
| from .core_logic import infer_transcript_effect | ||
|
|
@@ -11,7 +12,7 @@ class VariantAnnotator(object): | |
| def __init__(self, ensembl_release): | ||
| self.ensembl = pyensembl.EnsemblRelease(ensembl_release) | ||
|
|
||
| def describe_variant(self, variant): | ||
| def describe_variant(self, variant, raise_on_error=True): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The most common use case is
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I figured
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The constructor feels a little clumsier to me since the typical usecase seems to be:
Having to create a new
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
| """ | ||
| Determine the effects of a variant on any transcripts it overlaps. | ||
| Returns a VariantEffect object. | ||
|
|
@@ -22,32 +23,55 @@ def describe_variant(self, variant): | |
| ref = variant.ref | ||
| alt = variant.alt | ||
|
|
||
| overlapping_genes = self.ensembl.genes_at_locus(contig, pos, end_pos) | ||
| overlapping_transcripts = self.ensembl.transcripts_at_locus( | ||
| contig, pos, end_pos) | ||
|
|
||
| if len(overlapping_genes) == 0: | ||
| return [], {} | ||
| else: | ||
| overlapping_transcripts = self.ensembl.transcripts_at_locus( | ||
| contig, pos, end_pos) | ||
|
|
||
| assert len(overlapping_transcripts) > 0, \ | ||
| "No transcripts found for mutation %s:%d %s>%s" % ( | ||
| contig, pos, ref, alt) | ||
| if len(overlapping_transcripts) == 0: | ||
| # intergenic variant | ||
| return VariantEffect( | ||
| variant=variant, | ||
| genes=[], | ||
| gene_transcript_effects={}, | ||
| errors={}) | ||
|
|
||
| # group transcripts by their gene ID | ||
| overlapping_transcript_groups = group_by( | ||
| overlapping_transcripts, field_name='gene_id') | ||
|
|
||
| # dictionary from gene ID to list of transcript effects | ||
| gene_transcript_effects_groups = {} | ||
|
|
||
| # mapping from Transcript objects to errors encountered | ||
| # while trying to annotated them | ||
| errors = {} | ||
|
|
||
| for (gene_id, transcripts) in overlapping_transcript_groups.items(): | ||
| effects = [] | ||
| for transcript in transcripts: | ||
| effect = infer_transcript_effect(variant, transcript) | ||
| effects.append(effect) | ||
| try: | ||
| effects.append(infer_transcript_effect(variant, transcript)) | ||
| except (AssertionError, ValueError) as error: | ||
| if raise_on_error: | ||
| raise | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No error for the raise?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you mean? (
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't realize it did that :) |
||
| else: | ||
| logging.warn( | ||
| "Encountered error annotating %s for %s: %s", | ||
| variant, | ||
| transcript, | ||
| error) | ||
| errors[transcript] = error | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not tie each error directly to the transcript effect inference that it came from, vs. having a parallel mapping?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure what you mean by "the transcript effect inference that it came from", can you say more?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When you error on an
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any ideas on what these transcript effects would look like? Would I just have a generic
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Aha, it wasn't entirely clear to me that we wouldn't have a TranscriptEffect object because we errored in creating it. Sure,
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, it's not really possible to make a e.g. On Mon, Feb 16, 2015 at 9:35 PM, Tavi Nathanson notifications@github.com
|
||
| gene_transcript_effects_groups[gene_id] = effects | ||
|
|
||
| # construct Gene objects for all the genes containing | ||
| # all the transcripts this variant overlaps | ||
| overlapping_genes = [ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, in practice, the difference between this result and
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just thought it was inelegant (and a little wasteful) to do two queries to see what's overlapping the mutated locus. So, instead, I just look up the overlapping transcripts and construct the Gene objects from the transcripts.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, so it's the same result
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, just computed via one overlap query instead of two. |
||
| self.ensembl.gene_by_id(gene_id) | ||
| for gene_id | ||
| in overlapping_transcript_groups.keys() | ||
| ] | ||
|
|
||
| return VariantEffect( | ||
| variant=variant, | ||
| genes=overlapping_genes, | ||
| gene_transcript_effects=gene_transcript_effects_groups) | ||
| gene_transcript_effects=gene_transcript_effects_groups, | ||
| errors=errors) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -82,7 +82,7 @@ def __init__(self, filename, drop_duplicates=True): | |
| elif filename.endswith(".maf"): | ||
| self.raw_reference_name, self.records = _load_maf(filename) | ||
| else: | ||
| raise ValueErrr("Unrecognized file type: %s" % (filename,)) | ||
| raise ValueError("Unrecognized file type: %s" % (filename,)) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I prefer
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agrrree! |
||
|
|
||
|
|
||
| self.filename = filename | ||
|
|
@@ -117,13 +117,21 @@ def __str__(self): | |
| def __repr__(self): | ||
| return str(self) | ||
|
|
||
| def variant_effects(self): | ||
| def variant_effects(self, raise_on_error=True): | ||
| """ | ||
| Determine the impact of each variant, return a list of | ||
| VariantEffect objects. | ||
|
|
||
| Parameters | ||
| ---------- | ||
|
|
||
| raise_on_error : bool, optional. | ||
| Raise exception if error is encountered while annotating | ||
| transcripts, otherwise track errors in VariantEffect.errors | ||
| dictionary (default=True). | ||
| """ | ||
| return [ | ||
| self.annot.describe_variant(variant) | ||
| self.annot.describe_variant(variant, raise_on_error=raise_on_error) | ||
| for variant | ||
| in self.records | ||
| ] | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand the context here, but why not apply the same sorting to both (first by CDS length, then by transcript length), vs. filtering in one case and not in the other?
(Probably a silly question. I'm just not clear on the context.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand the question, so I'm not sure if this clarifies the assumptions: (1) the impact on a coding transcript is more interesting than the impact on a non-coding transcript (2) lacking any other info, a longer CDS is more interesting than a shorter one, (3) non-coding transcripts don't have a CDS, so instead sort them by full transcript length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the question is: why not leave the non-coding transcript stuff in there (when transcripts do have complete coding sequence annotations), and just have it lower in the sort order vs. filtering it out? Not saying you should. Just making sure I understand the choices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm being unimaginative: how do you get the CDS length of a transcript without a CDS?
np.inf? Also, is there a case where the result would be different, or are you just saying it would be more elegant to have a single sort key function?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like my comment had two parts!
Part of my thinking was the elegance of a single sort key.
But also, in terms of the difference in results: the results for the
ifclause can either contain both complete and incomplete transcript effects, or just complete transcript effects. Just curious about leaving the incomplete stuff in, but lower in the sort order.Probably not worthy of too much conversationing :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...but why leave in incomplete transcripts if we know for sure we're not going to use them? We can't compare their CDS length, so the key would have to be something like:
I find that less clear than explicitly dropping the incomplete terms.
If we wanted to unify both branches we'd need something like:
Which is more elegant but I think less clear on what's happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wasn't clear how/when we'd use the incomplete transcripts in the
elsecase, and bubbled that lack of clarity to ask about it for theifcase.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the transcripts in the else case e.g.
Incomplete,Noncoding, or something else without a CDS. If we have at least one coding transcript, we should ignore all those without a CDS. Does that make sense?(also, this is an axiom/assumption, not necessarily something well grounded in evidence. I could imagine a long ncRNA being the most important transcript of a gene)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that logic makes sense to me -- I just wanted to hear more about the assumption it was based on