Adds an option to split annotation fields. #107

bashir2 · 2018-02-07T06:43:11Z

Please note that some more unit-tests need to be added and there are a few TODOs (some for follow-up PRs), but this is in a reasonable shape for the first pass review.

Tested:
Added some unit-tests and one integration test (valid_4_2_VEP.json).
More interesting query: https://bigquery.cloud.google.com/results/gcp-variant-transforms-test:bquijob_7d798216_1616efded18

Issue #81

coveralls · 2018-02-07T06:48:05Z

Pull Request Test Coverage Report for Build 294

63 of 80 (78.75%) changed or added relevant lines in 9 files are covered.
6 unchanged lines in 4 files lost coverage.
Overall coverage decreased (-0.4%) to 90.211%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
gcp_variant_transforms/beam_io/vcfio.py	7	8	87.5%
gcp_variant_transforms/transforms/variant_to_bigquery.py	1	2	50.0%
gcp_variant_transforms/options/variant_transform_options.py	0	1	0.0%
gcp_variant_transforms/libs/vcf_header_parser.py	6	7	85.71%
gcp_variant_transforms/libs/bigquery_vcf_schema.py	20	33	60.61%

Files with Coverage Reduction	New Missed Lines	%
gcp_variant_transforms/libs/bigquery_vcf_schema.py	1	90.09%
gcp_variant_transforms/options/variant_transform_options.py	1	55.17%
gcp_variant_transforms/transforms/variant_to_bigquery.py	2	66.67%
gcp_variant_transforms/vcf_to_bq.py	2	55.34%

Totals
Change from base Build 293:	-0.4%
Covered Lines:	2820
Relevant Lines:	3126

💛 - Coveralls

arostamianfar

Thanks, Bashir! I looked at the high level design and I a few concerns:

We should not modify vcfio. We should keep vcfio as a very low level transform (think of it as just a text reader). Any higher level semantics should be done outside.
I think we should decouple annotations from the main flow as much as possible so that it's easy to change it later on. Is it possible to add a PTransform for this?
The design should be flexible enough to support multiple annotation fields.

Let's discuss this in person when you're here. I think we need a short design doc now that you have the prototype working.

arostamianfar · 2018-02-07T15:35:01Z

gcp_variant_transforms/libs/bigquery_vcf_schema.py

 _JSON_CONCATENATION_OVERHEAD_BYTES = 5


+# TODO(bashir2): Using type identifiers like ``HeaderFields`` does not seem


yeah, the type identifiers have been on my mind for some time. I inherited it from Beam conventions, but have not actually set up automated documentation generation to verify that it finds the correct class.
I think the 'correct' way is something like :class:`HeaderFields`. In any case, I filed Issue #108 for this. You can remove the TODO from here as it's awkward to single out this particular field.

bashir2 · 2018-02-07T20:43:12Z

To document follow-up discussion on the design decisions:

The reason that we need to touch vcfio is that it is the only place that we have access to original headers. Original headers are needed to resolve inconsistency between annotation headers, e.g., Conseq|Gene in one file and Gene|Conseq in another (see this TODO for a code pointer).
We decided that we will not support resolving these kind of conflicts. With that, we can rely on the merged header and do not attach annotation format to annotation fields of each variant.
This also applies to the field_count which is already in VariantInfo.
To make the annotation extraction support more decoupled from the rest of the code, we need more refactoring to use the merged header as a side input to a PTransform that processes Variant. With this we will make VCF reading (vcfio) and BigQuery writing (bigquery_vcf_schema) dumb transforms that do mostly reading/parsing and writing, not much more processing.
Supporting multiple annotation fields is simple, both in the current form and the future refactored form. We just need to change --annotation_field to be a comma separated list of field names.

bashir2 · 2018-02-16T22:06:34Z

This was superseded by PR #109 .

Adds an option to split annotation fields.

00c61ee

bashir2 requested a review from arostamianfar February 7, 2018 06:48

arostamianfar suggested changes Feb 7, 2018

View reviewed changes

bashir2 mentioned this pull request Feb 9, 2018

Adds annotation field support and some related refactoring. #109

Merged

bashir2 closed this Feb 16, 2018

bashir2 deleted the annotation_field_review branch February 16, 2018 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds an option to split annotation fields. #107

Adds an option to split annotation fields. #107

Uh oh!

bashir2 commented Feb 7, 2018

Uh oh!

coveralls commented Feb 7, 2018

Uh oh!

arostamianfar left a comment

Uh oh!

arostamianfar Feb 7, 2018

Uh oh!

bashir2 commented Feb 7, 2018

Uh oh!

bashir2 commented Feb 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		_JSON_CONCATENATION_OVERHEAD_BYTES = 5


		# TODO(bashir2): Using type identifiers like ``HeaderFields`` does not seem

Adds an option to split annotation fields. #107

Adds an option to split annotation fields. #107

Uh oh!

Conversation

bashir2 commented Feb 7, 2018

Uh oh!

coveralls commented Feb 7, 2018

Pull Request Test Coverage Report for Build 294

💛 - Coveralls

Uh oh!

arostamianfar left a comment

Choose a reason for hiding this comment

Uh oh!

arostamianfar Feb 7, 2018

Choose a reason for hiding this comment

Uh oh!

bashir2 commented Feb 7, 2018

Uh oh!

bashir2 commented Feb 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants