Moves alternate bases schema mutation to ProcessedVariantFactory. #118

bashir2 · 2018-02-22T05:11:09Z

Tested: Ran integration tests and updated unit-tests. Also ran the code manually for an annotated VCF and confirmed the generated schema is the same as before.

Issue: #59, #81.

coveralls · 2018-02-22T05:15:09Z

Pull Request Test Coverage Report for Build 335

160 of 174 (91.95%) changed or added relevant lines in 10 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+0.2%) to 90.716%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
gcp_variant_transforms/transforms/variant_to_bigquery.py	3	4	75.0%
gcp_variant_transforms/libs/variant_merge/move_to_calls_strategy.py	6	8	75.0%
gcp_variant_transforms/libs/bigquery_util.py	85	88	96.59%
gcp_variant_transforms/libs/processed_variant.py	19	27	70.37%

Files with Coverage Reduction	New Missed Lines	%
gcp_variant_transforms/transforms/variant_to_bigquery.py	1	73.08%

Totals
Change from base Build 333:	0.2%
Covered Lines:	3068
Relevant Lines:	3382

💛 - Coveralls

bashir2 · 2018-02-22T05:18:12Z

Folks, please note that this needs some more unit-tests, which I will add tomorrow. No new functionality is added so the unit-test coverage is not decreased but some functionality that before were buried as module "private"s are now exposed as "public" interfaces so it is better to write tests for them in isolation (e.g., all of bigquery_util and the new schema method of ProcessedVariantFactory).

Also we probably need a better representation of the schema as an encapsulated class but that is left for the schema validation work that Nima is going to do as he can better design that class knowing the current status and what is needed for validation. So this does not "fix" Issue #59, just make it better.

arostamianfar

Thanks! Just a few nits, otherwise LGTM.

arostamianfar · 2018-02-22T15:17:16Z

gcp_variant_transforms/libs/processed_variant.py

+    # type: (str) -> bool
+    if info_field_name not in self._header_fields.infos:
+      raise ValueError('INFO field {} not found'.format(info_field_name))
+    if ((self._split_alternate_allele_info_fields and


nit: you can just do return (...)

FYI, I changed this a little because pylint was suggesting to do it like this (check the bottom of the page) and it did not seem to be very readable. So instead I do it the way it is now.

arostamianfar · 2018-02-22T15:20:42Z

gcp_variant_transforms/libs/bigquery_vcf_schema.py

-          description=''))
-    alternate_bases_record.fields.append(annotation_record)
-  schema.fields.append(alternate_bases_record)
+  proc_var_factory.add_alt_record_to_schema(schema)


nit: instead of adding alt to a mutable schema, consider returning the alt schema and appending it to the schema object here (e.g. schema.fields.append(proc_var_factory.get_alt_record_schema()). This makes is more clear how the schema is being mutated in this function.

arostamianfar · 2018-02-22T15:24:04Z

gcp_variant_transforms/transforms/variant_to_bigquery.py

+        The latter functionality is what is needed here.
+      append: If true, existing records in output_table will not be
        overwritten. New records will be appended to those that already exist.
      omit_empty_sample_calls (bool): If true, samples that don't have a given


nit: can you also remove bool from here? :)

arostamianfar · 2018-02-22T18:10:47Z

gcp_variant_transforms/transforms/variant_to_bigquery.py


 @beam.typehints.with_input_types(processed_variant.ProcessedVariant)
 class VariantToBigQuery(beam.PTransform):
  """Writes PCollection of ``Variant`` records to BigQuery."""


Please also change this to be ProcessedVariant.

bashir2

PTAL

And please ignore the "Closed PR" comment GitHub injected there :-)

bashir2 · 2018-02-22T20:14:00Z

gcp_variant_transforms/libs/processed_variant.py

+    # type: (str) -> bool
+    if info_field_name not in self._header_fields.infos:
+      raise ValueError('INFO field {} not found'.format(info_field_name))
+    if ((self._split_alternate_allele_info_fields and


bashir2 · 2018-02-22T20:16:04Z

gcp_variant_transforms/transforms/variant_to_bigquery.py


 @beam.typehints.with_input_types(processed_variant.ProcessedVariant)
 class VariantToBigQuery(beam.PTransform):
  """Writes PCollection of ``Variant`` records to BigQuery."""


bashir2 · 2018-02-22T20:16:51Z

gcp_variant_transforms/transforms/variant_to_bigquery.py

+        The latter functionality is what is needed here.
+      append: If true, existing records in output_table will not be
        overwritten. New records will be appended to those that already exist.
      omit_empty_sample_calls (bool): If true, samples that don't have a given


bashir2 · 2018-02-22T20:39:49Z

gcp_variant_transforms/libs/bigquery_vcf_schema.py

-          description=''))
-    alternate_bases_record.fields.append(annotation_record)
-  schema.fields.append(alternate_bases_record)
+  proc_var_factory.add_alt_record_to_schema(schema)


bashir2 · 2018-02-22T21:33:15Z

BTW, if you want to see the changes since last review, check this commit

arostamianfar

LGTM. Please wait for Nima's review as well and add a few unit tests to bigquery_util.py.

bashir2 · 2018-02-23T00:12:17Z

bigquery_util_test added plus a TODO for the new create_alt_record_for_schema method of ProcessedVariantFactory.

nmousavi · 2018-02-22T22:50:45Z

gcp_variant_transforms/libs/processed_variant.py

      annotation_dict[name] = annotations[index + 1]
    return annotation_dict

+  def create_alt_record_for_schema(self):


maybe 'create_alt_bases_schema'?

Changed it to create_alt_bases_field_schema because I want to make it clear it is not a full schema.

nmousavi · 2018-02-22T22:50:57Z

gcp_variant_transforms/libs/bigquery_vcf_schema.py

-                                       split_alternate_allele_info_fields=True,
-                                       annotation_fields=None):
+def generate_schema_from_header_fields(
+    header_fields,  # type: vcf_header_parser.HeaderFields


This type annotation is cool. :)

nmousavi · 2018-02-22T22:50:59Z

gcp_variant_transforms/libs/bigquery_vcf_schema.py

-                                       annotation_fields=None):
+def generate_schema_from_header_fields(
+    header_fields,  # type: vcf_header_parser.HeaderFields
+    proc_var_factory,  # type: processed_variant.ProcessedVariantFactory


processed_variant_factory?

both 'proc' and 'var' can be confused by other more popular names.

Changed it to proc_variant_factory; agreed that proc is not great but processed_variant_factory is also too long (for example I need to break the # type comment to next line which is doable but ugly). There is a docstring below which clarifies it too. Still if you feel strongly about this, please let me know.

nmousavi · 2018-02-22T22:51:53Z

gcp_variant_transforms/libs/bigquery_vcf_schema.py

+      VCF file(s).
+    proc_var_factory: The factory class that knows how to convert Variant
+      instances to ProcessedVariant. As a side effect it also knows how to
+      modify BigQuery schema based on the ProcessedVariants that it generates.


This line
"... As a side effect it also knows how to modify BigQuery schema based on the ProcessedVariants that it generates. "

calls for a refactoring. In another PR though, this one is already big for review.

Agreed this is not great but I can't think of a much better alternative design without too much code change either. If you have a specific suggestion, I can add a TODO.

bashir2

Thanks Nima for the review. PTAL.

bashir2 · 2018-02-23T19:04:53Z

gcp_variant_transforms/libs/processed_variant.py

      annotation_dict[name] = annotations[index + 1]
    return annotation_dict

+  def create_alt_record_for_schema(self):


Changed it to create_alt_bases_field_schema because I want to make it clear it is not a full schema.

bashir2 · 2018-02-23T19:05:35Z

gcp_variant_transforms/libs/bigquery_vcf_schema.py

-                                       split_alternate_allele_info_fields=True,
-                                       annotation_fields=None):
+def generate_schema_from_header_fields(
+    header_fields,  # type: vcf_header_parser.HeaderFields


bashir2 · 2018-02-23T19:05:44Z

gcp_variant_transforms/libs/bigquery_vcf_schema.py

-                                       annotation_fields=None):
+def generate_schema_from_header_fields(
+    header_fields,  # type: vcf_header_parser.HeaderFields
+    proc_var_factory,  # type: processed_variant.ProcessedVariantFactory


Changed it to proc_variant_factory; agreed that proc is not great but processed_variant_factory is also too long (for example I need to break the # type comment to next line which is doable but ugly). There is a docstring below which clarifies it too. Still if you feel strongly about this, please let me know.

bashir2 · 2018-02-23T19:16:23Z

gcp_variant_transforms/libs/bigquery_vcf_schema.py

+      VCF file(s).
+    proc_var_factory: The factory class that knows how to convert Variant
+      instances to ProcessedVariant. As a side effect it also knows how to
+      modify BigQuery schema based on the ProcessedVariants that it generates.


Agreed this is not great but I can't think of a much better alternative design without too much code change either. If you have a specific suggestion, I can add a TODO.

bashir2 · 2018-02-23T19:24:28Z

BTW, if you want to see only the changes since last time, check this commit.

nmousavi

Thank you for this PR. LGTM!

bashir2 requested review from arostamianfar and nmousavi February 22, 2018 05:18

arostamianfar suggested changes Feb 22, 2018

View reviewed changes

arostamianfar reviewed Feb 22, 2018

View reviewed changes

bashir2 closed this Feb 22, 2018

bashir2 force-pushed the refactor_schema_review branch from 651a6fb to bf8448a Compare February 22, 2018 20:53

bashir2 reopened this Feb 22, 2018

bashir2 commented Feb 22, 2018

View reviewed changes

bashir2 force-pushed the refactor_schema_review branch from c4b6b15 to 1003fd9 Compare February 22, 2018 21:05

arostamianfar reviewed Feb 22, 2018

View reviewed changes

bashir2 force-pushed the refactor_schema_review branch 2 times, most recently from a102cbf to 5ed7e7e Compare February 23, 2018 00:07

nmousavi reviewed Feb 23, 2018

View reviewed changes

bashir2 commented Feb 23, 2018

View reviewed changes

bashir2 force-pushed the refactor_schema_review branch from 5ed7e7e to d65b158 Compare February 23, 2018 19:22

nmousavi previously approved these changes Feb 23, 2018

View reviewed changes

Moves alternate bases schema mutation to ProcessedVariantFactory.

79694a0

bashir2 dismissed nmousavi’s stale review via 79694a0 February 23, 2018 21:20

bashir2 force-pushed the refactor_schema_review branch from d65b158 to 79694a0 Compare February 23, 2018 21:20

arostamianfar approved these changes Feb 23, 2018

View reviewed changes

bashir2 merged commit d37ff53 into googlegenomics:master Feb 23, 2018

bashir2 mentioned this pull request Feb 23, 2018

Refactor bigquery_vcf_schema.py to use a class #59

Closed

bashir2 deleted the refactor_schema_review branch March 9, 2018 18:55

Moves alternate bases schema mutation to ProcessedVariantFactory. #118

Moves alternate bases schema mutation to ProcessedVariantFactory. #118

Uh oh!

Conversation

bashir2 commented Feb 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Feb 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 335

💛 - Coveralls

Uh oh!

bashir2 commented Feb 22, 2018

Uh oh!

arostamianfar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bashir2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bashir2 commented Feb 22, 2018

Uh oh!

arostamianfar left a comment

Choose a reason for hiding this comment

Uh oh!

bashir2 commented Feb 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bashir2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bashir2 commented Feb 22, 2018 •

edited

Loading

coveralls commented Feb 22, 2018 •

edited

Loading