Adds support for --minimal option of VEP with related schema chagnes. #131

bashir2 · 2018-03-07T07:10:40Z

Please note that this PR is on top of the previous two PRs #125 and #126 so it includes three commits where only the last one is for this PR. Please only review the last commit.

Tested:
Added/updated unit-tests.
Also ran the pipeline on a small subset of gnomAD with --minimal_VEP_alt_matching and checked the output table's schema, counters, and logs for ambiguous ALTs.

Issue #81

coveralls · 2018-03-07T07:14:12Z

Pull Request Test Coverage Report for Build 468

79 of 87 (90.8%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.02%) to 90.444%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
gcp_variant_transforms/options/variant_transform_options.py	2	4	50.0%
gcp_variant_transforms/libs/processed_variant.py	56	62	90.32%

Totals
Change from base Build 463:	0.02%
Covered Lines:	3426
Relevant Lines:	3788

💛 - Coveralls

bashir2 · 2018-03-09T21:13:50Z

FYI, submitted changes from the two upstream PRs were merged/resolved.

bashir2 · 2018-03-16T03:36:44Z

@arostamianfar while you are reviewing this I thought I share my latest finding re. gnomAD which is related to this PR: While looking at differences between my runs of VEP on gnomAD and the annotations that were already in gnomAD VCFs, I realized that there is this --allele_number option in VEP and that in fact gnomAD VCFs do have ALLELE_NUM "annotations". This makes finding the corresponding ALT deterministic (so no more ambiguity). I have already implemented the support for that and ran on a sample gnomAD subset.

So with that, we currently support three modes:

Map annotation ALTs to ALTs without simulating --minimal.
Simulation of --minimal which can introduce ambiguous cases.
Using ALLELE_NUM, which does not trigger either (1) or (2) logic and solely relies in ALLELE_NUM.

I think there is value in keeping all 3 modes but please let me know if you have other thoughts about this.

bashir2 · 2018-03-16T03:50:00Z

Re. the last comment and to be more specific: I am thinking whether supporting --minimal without --allele_number (i.e., what this very PR does) is something that we want to keep. The only scenario that this is helpful is if someone annotates a lot of files with vep --minimal but no --allele_number and they still want to import their data to BQ (without rerunning vep for whatever reasons).

On the other hand, this adds some complexity to the code and also an extra column in the output BQ tables (ambiguous_ALT booleans). So maybe we should opt for only supporting the deterministic cases (i.e,. (1) and (3) from my previous comment) and drop all ambiguous/--minimal related changes.

Thoughts?

arostamianfar

Thanks for the thorough investigations and catching all corner cases, Bashir! And nice find on the allele_number field!
My thoughts:

I think we should support all 3 cases as you mentioned. I'm in favor of being flexible and since you have already implemented this logic, let's use it :)
The logic in this class is getting very involved and confusing since there are several corner cases. I think creating a new annotation folder and pulling out code from this file would help a lot, but now that we know more about all of the corner cases, let's sit down and think about how we can redesign this in a more "easy-to-follow" manner.
I think we should do this refactoring/redesign before adding any new features to this file as it's becoming increasingly more difficult to follow.
For this PR though, please submit as is after addressing my nits :)

arostamianfar · 2018-03-15T17:58:41Z

gcp_variant_transforms/options/variant_transform_options.py

+              'BigQuery table and stored as repeated fields with '
+              'corresponding alternate alleles. [EXPERIMENTAL]'))
+    parser.add_argument(
+        '--minimal_VEP_alt_matching',


nit: s/VEP/vep

arostamianfar · 2018-03-15T21:05:17Z

gcp_variant_transforms/libs/processed_variant.py

+_COMPLETELY_DELETED_ALT = '-'
+
+# The field name in the BigQuery table that holds annotation ALT.
+_ANNOTATION_ALT = 'ALT'


nit: 'alt' is a very generic name and it's not clear how it's different from the actual alt. Consider renaming this (e.g. 'matching_alt'?). Also, we have made all of our column names lowercase, so I think it makes sense to follow that convention here as well.

Hmm, this will appear under an annotation field, e.g., alternate_bases.CSQ.ALT. I agree that "ALT" is generic but I hoped being under the annotation field clarifies it and it is also how "the standard" refers to it. Per your comment, I changed it to allele_string (I think "matching_alt" is confusing as it may imply this is not the exact string that appeared in the VCF file). Please let me know if you prefer something else.

I see. allele_string sounds good!

actually, how about just 'allele'? given that it is what the standard has anyway?

arostamianfar · 2018-03-16T15:10:58Z

gcp_variant_transforms/libs/processed_variant.py

  _BREAKEND_ALT_RE = (re.compile(
      r'^(?P<up_to_chr>.*([\[\]]).*):(?P<pos>.*)([\[\]]).*$'))

  def __init__(self,


I think annotation logic deserves its own file and even folder now (AnnotationProcessor, Annotation class, util libs, etc). Please just add a TODO in this PR though.

Agreed, and the TODO is in the next PR at a place which definitely needs refactoring. _AnnotationProcessor itself becoming a separate class needs more thinking because it is mutating ProcessedVariant and I prefer ProcessedVariantFactory.create_processed_variant be the only public way of ProcessedVariant mutation. I think it is possible to get most of _AnnotationProcessor out into its own library module/class without exposing ProcessedVariant to it, but I need to think more about the design of it.

arostamianfar · 2018-03-16T15:12:32Z

gcp_variant_transforms/libs/processed_variant.py

          mode=bigquery_util.TableFieldConstants.MODE_REPEATED,
          description='List of {} annotations for this alternate.'.format(
              annot_field))
+      annotation_record.fields.append(bigquery.TableFieldSchema(


What do you think of making these optional only if --minimal is used?

This is a good idea for the next field (i.e., ambiguous_allele_string) and I did that (thanks for suggesting). For the actual ALT string, I see value in keeping it even for non --minimal cases. The reason is that the matching is not always an exact matching and I think it is good to keep this information in an easy to access way in the BQ table (this is an attempt to make sure no information is lost in the import process).

Sounds good!

arostamianfar · 2018-03-16T15:45:05Z

gcp_variant_transforms/libs/processed_variant.py

+      return found_alt, is_ambiguous
    else:
-      self._annotation_alt_mismatch_counter.inc()
+      self._alt_mismatch_counter.inc()


nit: consider rephrasing this as:

if not found_alt: counter.inc() log return found_alt, is_ambiguous

arostamianfar · 2018-03-16T15:50:55Z

gcp_variant_transforms/libs/processed_variant.py

+    """
+    if not alt_bases or not annotation_alt:
+      return False
+    # Finding common leading and trailing sub-strings of ALT and REF.


This looks like a library function (I wonder if there is one already?) If not, we should make a generic substring matcher lib/function and use it here. Please fell free leave it as is for now though (we can do this in the more involved refactoring later).

Agreed but the only library function I know is os.path.commonprefix (which acts on a list of strings and is used elsewhere in this file when there is a list of strings). I can use that here (for a list of two strings) and apply it on reversed strings for the common suffix, but I thought it is easier to understand if I directly implement it here. I can certainly move this to a separate library function when we do refactoring of this class.

arostamianfar · 2018-03-16T15:52:39Z

gcp_variant_transforms/options/variant_transform_options.py

        raise


+class AnnotationOptions(VariantTransformsOptions):


Thanks for adding this new class!

bashir2

Thanks for the review; PTAL at the review commit.

bashir2 · 2018-03-16T19:03:08Z

gcp_variant_transforms/libs/processed_variant.py

+_COMPLETELY_DELETED_ALT = '-'
+
+# The field name in the BigQuery table that holds annotation ALT.
+_ANNOTATION_ALT = 'ALT'


Hmm, this will appear under an annotation field, e.g., alternate_bases.CSQ.ALT. I agree that "ALT" is generic but I hoped being under the annotation field clarifies it and it is also how "the standard" refers to it. Per your comment, I changed it to allele_string (I think "matching_alt" is confusing as it may imply this is not the exact string that appeared in the VCF file). Please let me know if you prefer something else.

bashir2 · 2018-03-16T19:06:01Z

gcp_variant_transforms/libs/processed_variant.py

+
+# The field name in the BigQuery table that indicates whether the annotation ALT
+# matching was ambiguous or not.
+_ANNOTATION_ALT_AMBIGUOUS = 'ambiguous_ALT'


Please note that I also changed this from ambiguous_ALT to ambiguous_allele_string to be consistent.

Updated this one too.

bashir2 · 2018-03-16T19:06:56Z

gcp_variant_transforms/libs/processed_variant.py

          mode=bigquery_util.TableFieldConstants.MODE_REPEATED,
          description='List of {} annotations for this alternate.'.format(
              annot_field))
+      annotation_record.fields.append(bigquery.TableFieldSchema(


This is a good idea for the next field (i.e., ambiguous_allele_string) and I did that (thanks for suggesting). For the actual ALT string, I see value in keeping it even for non --minimal cases. The reason is that the matching is not always an exact matching and I think it is good to keep this information in an easy to access way in the BQ table (this is an attempt to make sure no information is lost in the import process).

bashir2 · 2018-03-16T19:22:54Z

gcp_variant_transforms/libs/processed_variant.py

  _BREAKEND_ALT_RE = (re.compile(
      r'^(?P<up_to_chr>.*([\[\]]).*):(?P<pos>.*)([\[\]]).*$'))

  def __init__(self,


Agreed, and the TODO is in the next PR at a place which definitely needs refactoring. _AnnotationProcessor itself becoming a separate class needs more thinking because it is mutating ProcessedVariant and I prefer ProcessedVariantFactory.create_processed_variant be the only public way of ProcessedVariant mutation. I think it is possible to get most of _AnnotationProcessor out into its own library module/class without exposing ProcessedVariant to it, but I need to think more about the design of it.

bashir2 · 2018-03-16T19:24:01Z

gcp_variant_transforms/libs/processed_variant.py

+      return found_alt, is_ambiguous
    else:
-      self._annotation_alt_mismatch_counter.inc()
+      self._alt_mismatch_counter.inc()


bashir2 · 2018-03-16T19:28:50Z

gcp_variant_transforms/options/variant_transform_options.py

        raise


+class AnnotationOptions(VariantTransformsOptions):


bashir2 · 2018-03-16T19:28:56Z

gcp_variant_transforms/options/variant_transform_options.py

+              'BigQuery table and stored as repeated fields with '
+              'corresponding alternate alleles. [EXPERIMENTAL]'))
+    parser.add_argument(
+        '--minimal_VEP_alt_matching',


bashir2 · 2018-03-16T19:40:09Z

gcp_variant_transforms/libs/processed_variant.py

+    """
+    if not alt_bases or not annotation_alt:
+      return False
+    # Finding common leading and trailing sub-strings of ALT and REF.


Agreed but the only library function I know is os.path.commonprefix (which acts on a list of strings and is used elsewhere in this file when there is a list of strings). I can use that here (for a list of two strings) and apply it on reversed strings for the common suffix, but I thought it is easier to understand if I directly implement it here. I can certainly move this to a separate library function when we do refactoring of this class.

arostamianfar · 2018-03-16T20:40:25Z

gcp_variant_transforms/libs/processed_variant.py

-    """Adds all annotations to the given `proc_var`.
+  def _add_ambiguous_fields(self, annotations_list, ambiguous):
+    # type: (List[Dict[str, str]], bool) -> None
+    if self._minimal_match:


nit: i think it's more readable to move thisif at the top so that we don't call this method if minimal_match is false. In general, methods should perform what they are meant to do and not become a no-op based on external factors.

arostamianfar · 2018-03-16T20:41:07Z

gcp_variant_transforms/libs/processed_variant.py

+_COMPLETELY_DELETED_ALT = '-'
+
+# The field name in the BigQuery table that holds annotation ALT.
+_ANNOTATION_ALT = 'ALT'


I see. allele_string sounds good!

arostamianfar

Looks good! Just one nit.

bashir2 · 2018-03-16T21:04:59Z

PTAL, needs reapproval for these changes.

bashir2 requested a review from arostamianfar March 7, 2018 07:10

This was referenced Mar 7, 2018

Adds counters to processed_variant and creates a wrapper for Beam counters. #125

Merged

Extends variant annotation alt matching. #126

Merged

bashir2 force-pushed the alt_annotation_fix_minimal_review branch from 21bcb72 to 28c2913 Compare March 9, 2018 21:12

bashir2 force-pushed the alt_annotation_fix_minimal_review branch 5 times, most recently from b0f1141 to aba3254 Compare March 13, 2018 20:14

Adds support for --minimal option of VEP with related schema chagnes.

dcbc206

bashir2 force-pushed the alt_annotation_fix_minimal_review branch from aba3254 to dcbc206 Compare March 13, 2018 22:13

arostamianfar suggested changes Mar 16, 2018

View reviewed changes

bashir2 mentioned this pull request Mar 16, 2018

Adds the option for processing ALLELE_NUM. #141

Merged

bashir2 added 2 commits March 16, 2018 13:55

Merge branch 'master' into alt_annotation_fix_minimal_review

ee34cbc

review comments

b357283

bashir2 commented Mar 16, 2018

View reviewed changes

arostamianfar reviewed Mar 16, 2018

View reviewed changes

arostamianfar previously approved these changes Mar 16, 2018

View reviewed changes

review comments round two

e2f7b68

bashir2 dismissed arostamianfar’s stale review via e2f7b68 March 16, 2018 21:04

arostamianfar approved these changes Mar 16, 2018

View reviewed changes

bashir2 merged commit 9ddddfe into googlegenomics:master Mar 16, 2018

bashir2 deleted the alt_annotation_fix_minimal_review branch March 20, 2018 18:47

Adds support for --minimal option of VEP with related schema chagnes. #131

Adds support for --minimal option of VEP with related schema chagnes. #131

Uh oh!

Conversation

bashir2 commented Mar 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Mar 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 468

💛 - Coveralls

Uh oh!

bashir2 commented Mar 9, 2018

Uh oh!

bashir2 commented Mar 16, 2018

Uh oh!

bashir2 commented Mar 16, 2018

Uh oh!

arostamianfar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bashir2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bashir2 commented Mar 7, 2018 •

edited

Loading

coveralls commented Mar 7, 2018 •

edited

Loading