Skip to content

Conversation

@bashir2
Copy link
Member

@bashir2 bashir2 commented Feb 7, 2018

Please note that some more unit-tests need to be added and there are a few TODOs (some for follow-up PRs), but this is in a reasonable shape for the first pass review.

Tested:
Added some unit-tests and one integration test (valid_4_2_VEP.json).
More interesting query: https://bigquery.cloud.google.com/results/gcp-variant-transforms-test:bquijob_7d798216_1616efded18

Issue #81

@coveralls
Copy link

Pull Request Test Coverage Report for Build 294

  • 63 of 80 (78.75%) changed or added relevant lines in 9 files are covered.
  • 6 unchanged lines in 4 files lost coverage.
  • Overall coverage decreased (-0.4%) to 90.211%

Changes Missing Coverage Covered Lines Changed/Added Lines %
gcp_variant_transforms/beam_io/vcfio.py 7 8 87.5%
gcp_variant_transforms/transforms/variant_to_bigquery.py 1 2 50.0%
gcp_variant_transforms/options/variant_transform_options.py 0 1 0.0%
gcp_variant_transforms/libs/vcf_header_parser.py 6 7 85.71%
gcp_variant_transforms/libs/bigquery_vcf_schema.py 20 33 60.61%
Files with Coverage Reduction New Missed Lines %
gcp_variant_transforms/libs/bigquery_vcf_schema.py 1 90.09%
gcp_variant_transforms/options/variant_transform_options.py 1 55.17%
gcp_variant_transforms/transforms/variant_to_bigquery.py 2 66.67%
gcp_variant_transforms/vcf_to_bq.py 2 55.34%
Totals Coverage Status
Change from base Build 293: -0.4%
Covered Lines: 2820
Relevant Lines: 3126

💛 - Coveralls

@bashir2 bashir2 requested a review from arostamianfar February 7, 2018 06:48
Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Bashir! I looked at the high level design and I a few concerns:

  • We should not modify vcfio. We should keep vcfio as a very low level transform (think of it as just a text reader). Any higher level semantics should be done outside.
  • I think we should decouple annotations from the main flow as much as possible so that it's easy to change it later on. Is it possible to add a PTransform for this?
  • The design should be flexible enough to support multiple annotation fields.

Let's discuss this in person when you're here. I think we need a short design doc now that you have the prototype working.

_JSON_CONCATENATION_OVERHEAD_BYTES = 5


# TODO(bashir2): Using type identifiers like ``HeaderFields`` does not seem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the type identifiers have been on my mind for some time. I inherited it from Beam conventions, but have not actually set up automated documentation generation to verify that it finds the correct class.
I think the 'correct' way is something like :class:`HeaderFields`. In any case, I filed Issue #108 for this. You can remove the TODO from here as it's awkward to single out this particular field.

@bashir2
Copy link
Member Author

bashir2 commented Feb 7, 2018

To document follow-up discussion on the design decisions:

  • The reason that we need to touch vcfio is that it is the only place that we have access to original headers. Original headers are needed to resolve inconsistency between annotation headers, e.g., Conseq|Gene in one file and Gene|Conseq in another (see this TODO for a code pointer).

  • We decided that we will not support resolving these kind of conflicts. With that, we can rely on the merged header and do not attach annotation format to annotation fields of each variant.

  • This also applies to the field_count which is already in VariantInfo.

  • To make the annotation extraction support more decoupled from the rest of the code, we need more refactoring to use the merged header as a side input to a PTransform that processes Variant. With this we will make VCF reading (vcfio) and BigQuery writing (bigquery_vcf_schema) dumb transforms that do mostly reading/parsing and writing, not much more processing.

  • Supporting multiple annotation fields is simple, both in the current form and the future refactored form. We just need to change --annotation_field to be a comma separated list of field names.

@bashir2
Copy link
Member Author

bashir2 commented Feb 16, 2018

This was superseded by PR #109 .

@bashir2 bashir2 closed this Feb 16, 2018
@bashir2 bashir2 deleted the annotation_field_review branch February 16, 2018 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants