-
Notifications
You must be signed in to change notification settings - Fork 59
Adds support for --minimal option of VEP with related schema chagnes. #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support for --minimal option of VEP with related schema chagnes. #131
Conversation
Pull Request Test Coverage Report for Build 468
💛 - Coveralls |
21bcb72 to
28c2913
Compare
|
FYI, submitted changes from the two upstream PRs were merged/resolved. |
b0f1141 to
aba3254
Compare
aba3254 to
dcbc206
Compare
|
@arostamianfar while you are reviewing this I thought I share my latest finding re. gnomAD which is related to this PR: While looking at differences between my runs of VEP on gnomAD and the annotations that were already in gnomAD VCFs, I realized that there is this So with that, we currently support three modes:
I think there is value in keeping all 3 modes but please let me know if you have other thoughts about this. |
|
Re. the last comment and to be more specific: I am thinking whether supporting On the other hand, this adds some complexity to the code and also an extra column in the output BQ tables ( Thoughts? |
arostamianfar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thorough investigations and catching all corner cases, Bashir! And nice find on the allele_number field!
My thoughts:
- I think we should support all 3 cases as you mentioned. I'm in favor of being flexible and since you have already implemented this logic, let's use it :)
- The logic in this class is getting very involved and confusing since there are several corner cases. I think creating a new
annotationfolder and pulling out code from this file would help a lot, but now that we know more about all of the corner cases, let's sit down and think about how we can redesign this in a more "easy-to-follow" manner. - I think we should do this refactoring/redesign before adding any new features to this file as it's becoming increasingly more difficult to follow.
- For this PR though, please submit as is after addressing my nits :)
| 'BigQuery table and stored as repeated fields with ' | ||
| 'corresponding alternate alleles. [EXPERIMENTAL]')) | ||
| parser.add_argument( | ||
| '--minimal_VEP_alt_matching', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/VEP/vep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| _COMPLETELY_DELETED_ALT = '-' | ||
|
|
||
| # The field name in the BigQuery table that holds annotation ALT. | ||
| _ANNOTATION_ALT = 'ALT' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 'alt' is a very generic name and it's not clear how it's different from the actual alt. Consider renaming this (e.g. 'matching_alt'?). Also, we have made all of our column names lowercase, so I think it makes sense to follow that convention here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this will appear under an annotation field, e.g., alternate_bases.CSQ.ALT. I agree that "ALT" is generic but I hoped being under the annotation field clarifies it and it is also how "the standard" refers to it. Per your comment, I changed it to allele_string (I think "matching_alt" is confusing as it may imply this is not the exact string that appeared in the VCF file). Please let me know if you prefer something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. allele_string sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, how about just 'allele'? given that it is what the standard has anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| _BREAKEND_ALT_RE = (re.compile( | ||
| r'^(?P<up_to_chr>.*([\[\]]).*):(?P<pos>.*)([\[\]]).*$')) | ||
|
|
||
| def __init__(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think annotation logic deserves its own file and even folder now (AnnotationProcessor, Annotation class, util libs, etc). Please just add a TODO in this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, and the TODO is in the next PR at a place which definitely needs refactoring. _AnnotationProcessor itself becoming a separate class needs more thinking because it is mutating ProcessedVariant and I prefer ProcessedVariantFactory.create_processed_variant be the only public way of ProcessedVariant mutation. I think it is possible to get most of _AnnotationProcessor out into its own library module/class without exposing ProcessedVariant to it, but I need to think more about the design of it.
| mode=bigquery_util.TableFieldConstants.MODE_REPEATED, | ||
| description='List of {} annotations for this alternate.'.format( | ||
| annot_field)) | ||
| annotation_record.fields.append(bigquery.TableFieldSchema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of making these optional only if --minimal is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea for the next field (i.e., ambiguous_allele_string) and I did that (thanks for suggesting). For the actual ALT string, I see value in keeping it even for non --minimal cases. The reason is that the matching is not always an exact matching and I think it is good to keep this information in an easy to access way in the BQ table (this is an attempt to make sure no information is lost in the import process).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
| return found_alt, is_ambiguous | ||
| else: | ||
| self._annotation_alt_mismatch_counter.inc() | ||
| self._alt_mismatch_counter.inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider rephrasing this as:
if not found_alt:
counter.inc()
log
return found_alt, is_ambiguous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| """ | ||
| if not alt_bases or not annotation_alt: | ||
| return False | ||
| # Finding common leading and trailing sub-strings of ALT and REF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a library function (I wonder if there is one already?) If not, we should make a generic substring matcher lib/function and use it here. Please fell free leave it as is for now though (we can do this in the more involved refactoring later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed but the only library function I know is os.path.commonprefix (which acts on a list of strings and is used elsewhere in this file when there is a list of strings). I can use that here (for a list of two strings) and apply it on reversed strings for the common suffix, but I thought it is easier to understand if I directly implement it here. I can certainly move this to a separate library function when we do refactoring of this class.
| raise | ||
|
|
||
|
|
||
| class AnnotationOptions(VariantTransformsOptions): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this new class!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
bashir2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review; PTAL at the review commit.
| _COMPLETELY_DELETED_ALT = '-' | ||
|
|
||
| # The field name in the BigQuery table that holds annotation ALT. | ||
| _ANNOTATION_ALT = 'ALT' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this will appear under an annotation field, e.g., alternate_bases.CSQ.ALT. I agree that "ALT" is generic but I hoped being under the annotation field clarifies it and it is also how "the standard" refers to it. Per your comment, I changed it to allele_string (I think "matching_alt" is confusing as it may imply this is not the exact string that appeared in the VCF file). Please let me know if you prefer something else.
|
|
||
| # The field name in the BigQuery table that indicates whether the annotation ALT | ||
| # matching was ambiguous or not. | ||
| _ANNOTATION_ALT_AMBIGUOUS = 'ambiguous_ALT' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that I also changed this from ambiguous_ALT to ambiguous_allele_string to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this one too.
| mode=bigquery_util.TableFieldConstants.MODE_REPEATED, | ||
| description='List of {} annotations for this alternate.'.format( | ||
| annot_field)) | ||
| annotation_record.fields.append(bigquery.TableFieldSchema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea for the next field (i.e., ambiguous_allele_string) and I did that (thanks for suggesting). For the actual ALT string, I see value in keeping it even for non --minimal cases. The reason is that the matching is not always an exact matching and I think it is good to keep this information in an easy to access way in the BQ table (this is an attempt to make sure no information is lost in the import process).
| _BREAKEND_ALT_RE = (re.compile( | ||
| r'^(?P<up_to_chr>.*([\[\]]).*):(?P<pos>.*)([\[\]]).*$')) | ||
|
|
||
| def __init__(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, and the TODO is in the next PR at a place which definitely needs refactoring. _AnnotationProcessor itself becoming a separate class needs more thinking because it is mutating ProcessedVariant and I prefer ProcessedVariantFactory.create_processed_variant be the only public way of ProcessedVariant mutation. I think it is possible to get most of _AnnotationProcessor out into its own library module/class without exposing ProcessedVariant to it, but I need to think more about the design of it.
| return found_alt, is_ambiguous | ||
| else: | ||
| self._annotation_alt_mismatch_counter.inc() | ||
| self._alt_mismatch_counter.inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| raise | ||
|
|
||
|
|
||
| class AnnotationOptions(VariantTransformsOptions): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
| 'BigQuery table and stored as repeated fields with ' | ||
| 'corresponding alternate alleles. [EXPERIMENTAL]')) | ||
| parser.add_argument( | ||
| '--minimal_VEP_alt_matching', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| """ | ||
| if not alt_bases or not annotation_alt: | ||
| return False | ||
| # Finding common leading and trailing sub-strings of ALT and REF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed but the only library function I know is os.path.commonprefix (which acts on a list of strings and is used elsewhere in this file when there is a list of strings). I can use that here (for a list of two strings) and apply it on reversed strings for the common suffix, but I thought it is easier to understand if I directly implement it here. I can certainly move this to a separate library function when we do refactoring of this class.
| """Adds all annotations to the given `proc_var`. | ||
| def _add_ambiguous_fields(self, annotations_list, ambiguous): | ||
| # type: (List[Dict[str, str]], bool) -> None | ||
| if self._minimal_match: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i think it's more readable to move thisif at the top so that we don't call this method if minimal_match is false. In general, methods should perform what they are meant to do and not become a no-op based on external factors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| _COMPLETELY_DELETED_ALT = '-' | ||
|
|
||
| # The field name in the BigQuery table that holds annotation ALT. | ||
| _ANNOTATION_ALT = 'ALT' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. allele_string sounds good!
arostamianfar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just one nit.
|
PTAL, needs reapproval for these changes. |
Please note that this PR is on top of the previous two PRs #125 and #126 so it includes three commits where only the last one is for this PR. Please only review the last commit.
Tested:
Added/updated unit-tests.
Also ran the pipeline on a small subset of gnomAD with --minimal_VEP_alt_matching and checked the output table's schema, counters, and logs for ambiguous ALTs.
Issue #81