Skip to content

Conversation

@bashir2
Copy link
Member

@bashir2 bashir2 commented Mar 7, 2018

Please note that this PR is on top of the previous two PRs #125 and #126 so it includes three commits where only the last one is for this PR. Please only review the last commit.

Tested:
Added/updated unit-tests.
Also ran the pipeline on a small subset of gnomAD with --minimal_VEP_alt_matching and checked the output table's schema, counters, and logs for ambiguous ALTs.

Issue #81

@bashir2 bashir2 requested a review from arostamianfar March 7, 2018 07:10
@coveralls
Copy link

coveralls commented Mar 7, 2018

Pull Request Test Coverage Report for Build 468

  • 79 of 87 (90.8%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.02%) to 90.444%

Changes Missing Coverage Covered Lines Changed/Added Lines %
gcp_variant_transforms/options/variant_transform_options.py 2 4 50.0%
gcp_variant_transforms/libs/processed_variant.py 56 62 90.32%
Totals Coverage Status
Change from base Build 463: 0.02%
Covered Lines: 3426
Relevant Lines: 3788

💛 - Coveralls

@bashir2
Copy link
Member Author

bashir2 commented Mar 9, 2018

FYI, submitted changes from the two upstream PRs were merged/resolved.

@bashir2 bashir2 force-pushed the alt_annotation_fix_minimal_review branch 5 times, most recently from b0f1141 to aba3254 Compare March 13, 2018 20:14
@bashir2 bashir2 force-pushed the alt_annotation_fix_minimal_review branch from aba3254 to dcbc206 Compare March 13, 2018 22:13
@bashir2
Copy link
Member Author

bashir2 commented Mar 16, 2018

@arostamianfar while you are reviewing this I thought I share my latest finding re. gnomAD which is related to this PR: While looking at differences between my runs of VEP on gnomAD and the annotations that were already in gnomAD VCFs, I realized that there is this --allele_number option in VEP and that in fact gnomAD VCFs do have ALLELE_NUM "annotations". This makes finding the corresponding ALT deterministic (so no more ambiguity). I have already implemented the support for that and ran on a sample gnomAD subset.

So with that, we currently support three modes:

  1. Map annotation ALTs to ALTs without simulating --minimal.
  2. Simulation of --minimal which can introduce ambiguous cases.
  3. Using ALLELE_NUM, which does not trigger either (1) or (2) logic and solely relies in ALLELE_NUM.

I think there is value in keeping all 3 modes but please let me know if you have other thoughts about this.

@bashir2
Copy link
Member Author

bashir2 commented Mar 16, 2018

Re. the last comment and to be more specific: I am thinking whether supporting --minimal without --allele_number (i.e., what this very PR does) is something that we want to keep. The only scenario that this is helpful is if someone annotates a lot of files with vep --minimal but no --allele_number and they still want to import their data to BQ (without rerunning vep for whatever reasons).

On the other hand, this adds some complexity to the code and also an extra column in the output BQ tables (ambiguous_ALT booleans). So maybe we should opt for only supporting the deterministic cases (i.e,. (1) and (3) from my previous comment) and drop all ambiguous/--minimal related changes.

Thoughts?

Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough investigations and catching all corner cases, Bashir! And nice find on the allele_number field!
My thoughts:

  • I think we should support all 3 cases as you mentioned. I'm in favor of being flexible and since you have already implemented this logic, let's use it :)
  • The logic in this class is getting very involved and confusing since there are several corner cases. I think creating a new annotation folder and pulling out code from this file would help a lot, but now that we know more about all of the corner cases, let's sit down and think about how we can redesign this in a more "easy-to-follow" manner.
  • I think we should do this refactoring/redesign before adding any new features to this file as it's becoming increasingly more difficult to follow.
  • For this PR though, please submit as is after addressing my nits :)

'BigQuery table and stored as repeated fields with '
'corresponding alternate alleles. [EXPERIMENTAL]'))
parser.add_argument(
'--minimal_VEP_alt_matching',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/VEP/vep

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

_COMPLETELY_DELETED_ALT = '-'

# The field name in the BigQuery table that holds annotation ALT.
_ANNOTATION_ALT = 'ALT'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 'alt' is a very generic name and it's not clear how it's different from the actual alt. Consider renaming this (e.g. 'matching_alt'?). Also, we have made all of our column names lowercase, so I think it makes sense to follow that convention here as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this will appear under an annotation field, e.g., alternate_bases.CSQ.ALT. I agree that "ALT" is generic but I hoped being under the annotation field clarifies it and it is also how "the standard" refers to it. Per your comment, I changed it to allele_string (I think "matching_alt" is confusing as it may imply this is not the exact string that appeared in the VCF file). Please let me know if you prefer something else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. allele_string sounds good!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, how about just 'allele'? given that it is what the standard has anyway?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

_BREAKEND_ALT_RE = (re.compile(
r'^(?P<up_to_chr>.*([\[\]]).*):(?P<pos>.*)([\[\]]).*$'))

def __init__(self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think annotation logic deserves its own file and even folder now (AnnotationProcessor, Annotation class, util libs, etc). Please just add a TODO in this PR though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and the TODO is in the next PR at a place which definitely needs refactoring. _AnnotationProcessor itself becoming a separate class needs more thinking because it is mutating ProcessedVariant and I prefer ProcessedVariantFactory.create_processed_variant be the only public way of ProcessedVariant mutation. I think it is possible to get most of _AnnotationProcessor out into its own library module/class without exposing ProcessedVariant to it, but I need to think more about the design of it.

mode=bigquery_util.TableFieldConstants.MODE_REPEATED,
description='List of {} annotations for this alternate.'.format(
annot_field))
annotation_record.fields.append(bigquery.TableFieldSchema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of making these optional only if --minimal is used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea for the next field (i.e., ambiguous_allele_string) and I did that (thanks for suggesting). For the actual ALT string, I see value in keeping it even for non --minimal cases. The reason is that the matching is not always an exact matching and I think it is good to keep this information in an easy to access way in the BQ table (this is an attempt to make sure no information is lost in the import process).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

return found_alt, is_ambiguous
else:
self._annotation_alt_mismatch_counter.inc()
self._alt_mismatch_counter.inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider rephrasing this as:

if not found_alt:
  counter.inc()
  log
return found_alt, is_ambiguous

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""
if not alt_bases or not annotation_alt:
return False
# Finding common leading and trailing sub-strings of ALT and REF.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a library function (I wonder if there is one already?) If not, we should make a generic substring matcher lib/function and use it here. Please fell free leave it as is for now though (we can do this in the more involved refactoring later).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed but the only library function I know is os.path.commonprefix (which acts on a list of strings and is used elsewhere in this file when there is a list of strings). I can use that here (for a list of two strings) and apply it on reversed strings for the common suffix, but I thought it is easier to understand if I directly implement it here. I can certainly move this to a separate library function when we do refactoring of this class.

raise


class AnnotationOptions(VariantTransformsOptions):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this new class!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

Copy link
Member Author

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review; PTAL at the review commit.

_COMPLETELY_DELETED_ALT = '-'

# The field name in the BigQuery table that holds annotation ALT.
_ANNOTATION_ALT = 'ALT'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this will appear under an annotation field, e.g., alternate_bases.CSQ.ALT. I agree that "ALT" is generic but I hoped being under the annotation field clarifies it and it is also how "the standard" refers to it. Per your comment, I changed it to allele_string (I think "matching_alt" is confusing as it may imply this is not the exact string that appeared in the VCF file). Please let me know if you prefer something else.


# The field name in the BigQuery table that indicates whether the annotation ALT
# matching was ambiguous or not.
_ANNOTATION_ALT_AMBIGUOUS = 'ambiguous_ALT'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that I also changed this from ambiguous_ALT to ambiguous_allele_string to be consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this one too.

mode=bigquery_util.TableFieldConstants.MODE_REPEATED,
description='List of {} annotations for this alternate.'.format(
annot_field))
annotation_record.fields.append(bigquery.TableFieldSchema(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea for the next field (i.e., ambiguous_allele_string) and I did that (thanks for suggesting). For the actual ALT string, I see value in keeping it even for non --minimal cases. The reason is that the matching is not always an exact matching and I think it is good to keep this information in an easy to access way in the BQ table (this is an attempt to make sure no information is lost in the import process).

_BREAKEND_ALT_RE = (re.compile(
r'^(?P<up_to_chr>.*([\[\]]).*):(?P<pos>.*)([\[\]]).*$'))

def __init__(self,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and the TODO is in the next PR at a place which definitely needs refactoring. _AnnotationProcessor itself becoming a separate class needs more thinking because it is mutating ProcessedVariant and I prefer ProcessedVariantFactory.create_processed_variant be the only public way of ProcessedVariant mutation. I think it is possible to get most of _AnnotationProcessor out into its own library module/class without exposing ProcessedVariant to it, but I need to think more about the design of it.

return found_alt, is_ambiguous
else:
self._annotation_alt_mismatch_counter.inc()
self._alt_mismatch_counter.inc()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

raise


class AnnotationOptions(VariantTransformsOptions):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

'BigQuery table and stored as repeated fields with '
'corresponding alternate alleles. [EXPERIMENTAL]'))
parser.add_argument(
'--minimal_VEP_alt_matching',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""
if not alt_bases or not annotation_alt:
return False
# Finding common leading and trailing sub-strings of ALT and REF.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed but the only library function I know is os.path.commonprefix (which acts on a list of strings and is used elsewhere in this file when there is a list of strings). I can use that here (for a list of two strings) and apply it on reversed strings for the common suffix, but I thought it is easier to understand if I directly implement it here. I can certainly move this to a separate library function when we do refactoring of this class.

"""Adds all annotations to the given `proc_var`.
def _add_ambiguous_fields(self, annotations_list, ambiguous):
# type: (List[Dict[str, str]], bool) -> None
if self._minimal_match:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i think it's more readable to move thisif at the top so that we don't call this method if minimal_match is false. In general, methods should perform what they are meant to do and not become a no-op based on external factors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

_COMPLETELY_DELETED_ALT = '-'

# The field name in the BigQuery table that holds annotation ALT.
_ANNOTATION_ALT = 'ALT'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. allele_string sounds good!

arostamianfar
arostamianfar previously approved these changes Mar 16, 2018
Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just one nit.

@bashir2
Copy link
Member Author

bashir2 commented Mar 16, 2018

PTAL, needs reapproval for these changes.

@bashir2 bashir2 merged commit 9ddddfe into googlegenomics:master Mar 16, 2018
@bashir2 bashir2 deleted the alt_annotation_fix_minimal_review branch March 20, 2018 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants