Parse out the reusable logic from infer_headers and variant_to_bigquery #435

tneymanov · 2019-01-24T19:42:30Z

Continue with our effort to simplify PTransform code.

Migrate reusable logic in infer_headers.py into a new module.
Migrate reusable logic in variant_to_bigquery.py into bigquery_util module.

coveralls · 2019-01-24T21:56:16Z

Pull Request Test Coverage Report for Build 1547

355 of 378 (93.92%) changed or added relevant lines in 7 files are covered.
4 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.05%) to 88.975%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
gcp_variant_transforms/transforms/variant_to_bigquery.py	0	1	0.0%
gcp_variant_transforms/libs/infer_headers_util.py	124	129	96.12%
gcp_variant_transforms/libs/bigquery_util.py	22	39	56.41%

Files with Coverage Reduction	New Missed Lines	%
gcp_variant_transforms/transforms/infer_headers_test.py	4	97.48%

Totals
Change from base Build 1542:	0.05%
Covered Lines:	7037
Relevant Lines:	7909

💛 - Coveralls

allieychen

Thank you Tural, great work! I have added a few nits.

allieychen · 2019-01-25T15:51:19Z

gcp_variant_transforms/libs/bigquery_util.py


 """Constants and simple utility functions related to BigQuery."""

+import exceptions


nit: Please keep the imported package sorted. Change the order of exceptions and enum.

allieychen · 2019-01-25T15:52:52Z

gcp_variant_transforms/libs/bigquery_util.py

  else:
    return avro_type
+
+def update_bigquery_schema_on_append(schema_fields, output_table):


nit: Please add a docstring for this function since it is public now.

allieychen · 2019-01-25T15:53:27Z

gcp_variant_transforms/libs/bigquery_util.py

    return avro_type
+
+def update_bigquery_schema_on_append(schema_fields, output_table):
+  # type: (bool) -> None


Please update the type.

allieychen · 2019-01-25T15:57:02Z

gcp_variant_transforms/libs/bigquery_util.py

+    raise RuntimeError('BigQuery schema update failed: %s' % str(e))
+
+
+def get_merged_field_schemas(


I think it is fine to still keep it as private since it is only used in the above function.

I see, so if it's used in tests, we don't care?
kk, Done.

allieychen · 2019-01-25T16:02:32Z

gcp_variant_transforms/libs/bigquery_util.py

+  # type: (...) -> List[bigquery.TableFieldSchema]
+  """Merges the `field_schemas_1` and `field_schemas_2`.
+
+  Args:


nit: Leave one empty line between Args, Returns, and Raises. See one example. I know this is how the old code looks like, can you also help refining them? :)

sure, Done.

allieychen · 2019-01-25T16:17:21Z

gcp_variant_transforms/libs/infer_headers_util.py

+    cardinality as the alternate bases. Correct the num to be `None`.
+  - Defined type is `Integer`, but the provided value is float. Correct the
+    type to be `Float`.
+  Args:


nit: add one empty line here and below.

allieychen · 2019-01-25T16:18:23Z

gcp_variant_transforms/libs/infer_headers_util.py

+                defined_header.get(_HeaderKeyConstants.VERSION))
+  return None
+
+def _infer_mismatched_format_field(field_key,  # type: str


I feel the parameters can fit into one line. It will be easier to read if them are in one line.

allieychen · 2019-01-25T16:20:09Z

gcp_variant_transforms/libs/infer_headers_util.py

+                  defined_header.get(_HeaderKeyConstants.DESC))
+  return None
+
+def _infer_standard_info_fields(variant, infos, defined_headers):


I feel standard is not that easy to understand. Is it non annotation info?

Yeap. I guess I'll rename to .._non_annotation_.. then.
Done.

allieychen · 2019-01-25T16:26:10Z

gcp_variant_transforms/libs/infer_headers_util.py

+        variant, infos, defined_headers, annotation_fields_to_infer)
+  return infos
+
+def infer_format_fields(variant, defined_headers):


In general, I prefer to put the public methods at the beginning of one script, rather than scrolling all the way down to find out what this file can provide. If you agree, let's stick to the same rule. Of course, there is no wrong/right for the other way. Just let me know your preference.

Sooo I personally prefer doing it other way - when we were working on Java, the style guide enforced that a method, if possible, should only be calling methods that were defined before it. However, that's Java, I have no idea about Python. Also if you guys already have been defining public methods first, we should continue doing that - inconsistency is the worst.

I'll modify the order.

allieychen · 2019-01-25T16:35:54Z

gcp_variant_transforms/libs/annotation/annotation_parser.py

 _BREAKEND_ALT_RE = (re.compile(
    r'^(?P<up_to_chr>.*([\[\]]).*):(?P<pos>.*)([\[\]]).*$'))

+# Filled with annotation field and name data, then used as a header ID.


I prefer to still keep these in infer_headers_util.py.
This module is used to parse/reconstruct the existing annotation header (something like A|upstream_gene_variant|MODIFIER|PSMF1|||||), which is usually added by the third party tool.

The new get_inferred_annotation_type_header_key, creates a name for each individual annotation field when infer_annotation_type is set. For instance, we may add something like below into one temporary header file to help loading the VCF to BQ.
##INFO=<ID=CSQ_VT_SWISSPROT_TYPE,Number=1,Type=String,Description="Inferred type field for annotation SWISSPROT.",Source="",Version="">

Sounds good, moved the logic into util.

tneymanov

Thanks for the review, addressed the comments.

tneymanov · 2019-01-28T15:50:48Z

gcp_variant_transforms/libs/annotation/annotation_parser.py

 _BREAKEND_ALT_RE = (re.compile(
    r'^(?P<up_to_chr>.*([\[\]]).*):(?P<pos>.*)([\[\]]).*$'))

+# Filled with annotation field and name data, then used as a header ID.


Sounds good, moved the logic into util.

tneymanov · 2019-01-28T15:51:27Z

gcp_variant_transforms/libs/bigquery_util.py


 """Constants and simple utility functions related to BigQuery."""

+import exceptions


tneymanov · 2019-01-28T16:01:47Z

gcp_variant_transforms/libs/bigquery_util.py

  else:
    return avro_type
+
+def update_bigquery_schema_on_append(schema_fields, output_table):


tneymanov · 2019-01-28T16:01:55Z

gcp_variant_transforms/libs/bigquery_util.py

    return avro_type
+
+def update_bigquery_schema_on_append(schema_fields, output_table):
+  # type: (bool) -> None


tneymanov · 2019-01-28T16:03:46Z

gcp_variant_transforms/libs/bigquery_util.py

+    raise RuntimeError('BigQuery schema update failed: %s' % str(e))
+
+
+def get_merged_field_schemas(


I see, so if it's used in tests, we don't care?
kk, Done.

tneymanov · 2019-01-28T16:08:22Z

gcp_variant_transforms/libs/infer_headers_util.py

+
+from gcp_variant_transforms.beam_io import vcf_header_io
+from gcp_variant_transforms.beam_io import vcfio  # pylint: disable=unused-import
+from gcp_variant_transforms.libs.annotation import annotation_parser


tneymanov · 2019-01-28T16:08:45Z

gcp_variant_transforms/libs/infer_headers_util.py

+    cardinality as the alternate bases. Correct the num to be `None`.
+  - Defined type is `Integer`, but the provided value is float. Correct the
+    type to be `Float`.
+  Args:


tneymanov · 2019-01-28T16:09:49Z

gcp_variant_transforms/libs/infer_headers_util.py

+                defined_header.get(_HeaderKeyConstants.VERSION))
+  return None
+
+def _infer_mismatched_format_field(field_key,  # type: str


tneymanov · 2019-01-28T16:41:49Z

gcp_variant_transforms/libs/infer_headers_util.py

+                  defined_header.get(_HeaderKeyConstants.DESC))
+  return None
+
+def _infer_standard_info_fields(variant, infos, defined_headers):


Yeap. I guess I'll rename to .._non_annotation_.. then.
Done.

tneymanov · 2019-01-28T16:45:32Z

gcp_variant_transforms/libs/infer_headers_util.py

+        variant, infos, defined_headers, annotation_fields_to_infer)
+  return infos
+
+def infer_format_fields(variant, defined_headers):


Sooo I personally prefer doing it other way - when we were working on Java, the style guide enforced that a method, if possible, should only be calling methods that were defined before it. However, that's Java, I have no idea about Python. Also if you guys already have been defining public methods first, we should continue doing that - inconsistency is the worst.

I'll modify the order.

allieychen

Thank you so much Tural. I added a few nits.

allieychen · 2019-01-31T15:43:37Z

gcp_variant_transforms/libs/bigquery_util.py


 def update_bigquery_schema_on_append(schema_fields, output_table):
-  # type: (bool) -> None
+  # type: (bool, str) -> None


Oops, Done.

allieychen · 2019-01-31T15:46:43Z

gcp_variant_transforms/libs/bigquery_util.py

+  # type: (bool, str) -> None
  # if table does not exist, do not need to update the schema.
  # TODO (yifangchen): Move the logic into validate().
+  """Update BQ schema by combining existing one with a new one, if possible."""


nit: docstring is a string that is the first statement in a package, module, class or function. Read more here. However, if we have type annotation, the doc string goes below the type annotation.

Hmm, kinda weird to have a comment, docstring and then a comment again :/. Done, nonetheless.

I didn't see any changes here.

I agree with your point. You can refactor it to something like:
"""Update BQ schema by combining existing one with a new one, if possible.

If table does not exist, do not need to update the schema.
TODO (yifangchen): Move the logic into validate().
"""

allieychen · 2019-01-31T15:47:45Z

gcp_variant_transforms/libs/infer_headers_util.py

 # limitations under the License.

-"""A Helper module for Header Inference operations."""
+"""A Helper module for header inference operations."""


nit: s/Helper/helper

allieychen · 2019-01-31T15:48:33Z

gcp_variant_transforms/libs/infer_headers_util.py

+  return _BASE_ANNOTATION_TYPE_KEY.format(annot_field, name)
+
+def infer_info_fields(
+    variant,


nit: Please add type for this variable and the one below.

tneymanov

Addressed the comments

tneymanov · 2019-02-04T16:17:29Z

gcp_variant_transforms/libs/bigquery_util.py


 def update_bigquery_schema_on_append(schema_fields, output_table):
-  # type: (bool) -> None
+  # type: (bool, str) -> None


Oops, Done.

tneymanov · 2019-02-04T16:20:51Z

gcp_variant_transforms/libs/bigquery_util.py

+  # type: (bool, str) -> None
  # if table does not exist, do not need to update the schema.
  # TODO (yifangchen): Move the logic into validate().
+  """Update BQ schema by combining existing one with a new one, if possible."""


Hmm, kinda weird to have a comment, docstring and then a comment again :/. Done, nonetheless.

tneymanov · 2019-02-04T16:38:37Z

gcp_variant_transforms/libs/infer_headers_util.py

+  return _BASE_ANNOTATION_TYPE_KEY.format(annot_field, name)
+
+def infer_info_fields(
+    variant,


tneymanov · 2019-02-04T16:38:55Z

gcp_variant_transforms/libs/infer_headers_util.py

 # limitations under the License.

-"""A Helper module for Header Inference operations."""
+"""A Helper module for header inference operations."""


allieychen

LGTM. Please feel free to merge the code after you address the last comment.

allieychen · 2019-02-05T16:14:13Z

gcp_variant_transforms/libs/bigquery_util.py

+  # type: (bool, str) -> None
  # if table does not exist, do not need to update the schema.
  # TODO (yifangchen): Move the logic into validate().
+  """Update BQ schema by combining existing one with a new one, if possible."""


I didn't see any changes here.

I agree with your point. You can refactor it to something like:
"""Update BQ schema by combining existing one with a new one, if possible.

If table does not exist, do not need to update the schema.
TODO (yifangchen): Move the logic into validate().
"""

allieychen

LGTM

…ry modules into libs directory.

…w comment.

tneymanov requested a review from allieychen January 24, 2019 19:42

allieychen reviewed Jan 25, 2019

View reviewed changes

tneymanov commented Jan 28, 2019

View reviewed changes

allieychen reviewed Jan 31, 2019

View reviewed changes

tneymanov commented Feb 4, 2019

View reviewed changes

allieychen approved these changes Feb 5, 2019

View reviewed changes

allieychen approved these changes Feb 7, 2019

View reviewed changes

tneymanov added 4 commits February 8, 2019 13:38

Parse out the reusable logic from infer_headers and variant_to_bigque…

8ecb414

…ry modules into libs directory.

Applied the requested changes.

aa369be

Addressed 2nd iteration of comments.

3fed341

Modified docstring for update_bigquery_schema_on_append, as per revie…

c6f611c

…w comment.

tneymanov force-pushed the move_infer_headers branch from 515d162 to c6f611c Compare February 8, 2019 18:39

tneymanov merged commit 217b0ce into googlegenomics:master Feb 12, 2019


		"""Constants and simple utility functions related to BigQuery."""

		import exceptions

		raise RuntimeError('BigQuery schema update failed: %s' % str(e))


		def get_merged_field_schemas(

Parse out the reusable logic from infer_headers and variant_to_bigquery #435

Parse out the reusable logic from infer_headers and variant_to_bigquery #435

Uh oh!

Conversation

tneymanov commented Jan 24, 2019

Uh oh!

coveralls commented Jan 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 1547

💛 - Coveralls

Uh oh!

allieychen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tneymanov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allieychen left a comment

Choose a reason for hiding this comment

Uh oh!

coveralls commented Jan 24, 2019 •

edited

Loading