Skip to content

Conversation

@tneymanov
Copy link
Collaborator

Continue with our effort to simplify PTransform code.

  • Migrate reusable logic in infer_headers.py into a new module.
  • Migrate reusable logic in variant_to_bigquery.py into bigquery_util module.

@tneymanov tneymanov requested a review from allieychen January 24, 2019 19:42
@coveralls
Copy link

coveralls commented Jan 24, 2019

Pull Request Test Coverage Report for Build 1547

  • 355 of 378 (93.92%) changed or added relevant lines in 7 files are covered.
  • 4 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.05%) to 88.975%

Changes Missing Coverage Covered Lines Changed/Added Lines %
gcp_variant_transforms/transforms/variant_to_bigquery.py 0 1 0.0%
gcp_variant_transforms/libs/infer_headers_util.py 124 129 96.12%
gcp_variant_transforms/libs/bigquery_util.py 22 39 56.41%
Files with Coverage Reduction New Missed Lines %
gcp_variant_transforms/transforms/infer_headers_test.py 4 97.48%
Totals Coverage Status
Change from base Build 1542: 0.05%
Covered Lines: 7037
Relevant Lines: 7909

💛 - Coveralls

Copy link
Contributor

@allieychen allieychen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Tural, great work! I have added a few nits.


"""Constants and simple utility functions related to BigQuery."""

import exceptions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please keep the imported package sorted. Change the order of exceptions and enum.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

else:
return avro_type

def update_bigquery_schema_on_append(schema_fields, output_table):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please add a docstring for this function since it is public now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return avro_type

def update_bigquery_schema_on_append(schema_fields, output_table):
# type: (bool) -> None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

raise RuntimeError('BigQuery schema update failed: %s' % str(e))


def get_merged_field_schemas(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to still keep it as private since it is only used in the above function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so if it's used in tests, we don't care?
kk, Done.

# type: (...) -> List[bigquery.TableFieldSchema]
"""Merges the `field_schemas_1` and `field_schemas_2`.
Args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Leave one empty line between Args, Returns, and Raises. See one example. I know this is how the old code looks like, can you also help refining them? :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, Done.

cardinality as the alternate bases. Correct the num to be `None`.
- Defined type is `Integer`, but the provided value is float. Correct the
type to be `Float`.
Args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add one empty line here and below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

defined_header.get(_HeaderKeyConstants.VERSION))
return None

def _infer_mismatched_format_field(field_key, # type: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the parameters can fit into one line. It will be easier to read if them are in one line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

defined_header.get(_HeaderKeyConstants.DESC))
return None

def _infer_standard_info_fields(variant, infos, defined_headers):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel standard is not that easy to understand. Is it non annotation info?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap. I guess I'll rename to .._non_annotation_.. then.
Done.

variant, infos, defined_headers, annotation_fields_to_infer)
return infos

def infer_format_fields(variant, defined_headers):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I prefer to put the public methods at the beginning of one script, rather than scrolling all the way down to find out what this file can provide. If you agree, let's stick to the same rule. Of course, there is no wrong/right for the other way. Just let me know your preference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sooo I personally prefer doing it other way - when we were working on Java, the style guide enforced that a method, if possible, should only be calling methods that were defined before it. However, that's Java, I have no idea about Python. Also if you guys already have been defining public methods first, we should continue doing that - inconsistency is the worst.

I'll modify the order.

_BREAKEND_ALT_RE = (re.compile(
r'^(?P<up_to_chr>.*([\[\]]).*):(?P<pos>.*)([\[\]]).*$'))

# Filled with annotation field and name data, then used as a header ID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to still keep these in infer_headers_util.py.
This module is used to parse/reconstruct the existing annotation header (something like A|upstream_gene_variant|MODIFIER|PSMF1|||||), which is usually added by the third party tool.

The new get_inferred_annotation_type_header_key, creates a name for each individual annotation field when infer_annotation_type is set. For instance, we may add something like below into one temporary header file to help loading the VCF to BQ.
##INFO=<ID=CSQ_VT_SWISSPROT_TYPE,Number=1,Type=String,Description="Inferred type field for annotation SWISSPROT.",Source="",Version="">

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, moved the logic into util.

Copy link
Collaborator Author

@tneymanov tneymanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, addressed the comments.

_BREAKEND_ALT_RE = (re.compile(
r'^(?P<up_to_chr>.*([\[\]]).*):(?P<pos>.*)([\[\]]).*$'))

# Filled with annotation field and name data, then used as a header ID.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, moved the logic into util.


"""Constants and simple utility functions related to BigQuery."""

import exceptions
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

else:
return avro_type

def update_bigquery_schema_on_append(schema_fields, output_table):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return avro_type

def update_bigquery_schema_on_append(schema_fields, output_table):
# type: (bool) -> None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

raise RuntimeError('BigQuery schema update failed: %s' % str(e))


def get_merged_field_schemas(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so if it's used in tests, we don't care?
kk, Done.


from gcp_variant_transforms.beam_io import vcf_header_io
from gcp_variant_transforms.beam_io import vcfio # pylint: disable=unused-import
from gcp_variant_transforms.libs.annotation import annotation_parser
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

cardinality as the alternate bases. Correct the num to be `None`.
- Defined type is `Integer`, but the provided value is float. Correct the
type to be `Float`.
Args:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

defined_header.get(_HeaderKeyConstants.VERSION))
return None

def _infer_mismatched_format_field(field_key, # type: str
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

defined_header.get(_HeaderKeyConstants.DESC))
return None

def _infer_standard_info_fields(variant, infos, defined_headers):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap. I guess I'll rename to .._non_annotation_.. then.
Done.

variant, infos, defined_headers, annotation_fields_to_infer)
return infos

def infer_format_fields(variant, defined_headers):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sooo I personally prefer doing it other way - when we were working on Java, the style guide enforced that a method, if possible, should only be calling methods that were defined before it. However, that's Java, I have no idea about Python. Also if you guys already have been defining public methods first, we should continue doing that - inconsistency is the worst.

I'll modify the order.

Copy link
Contributor

@allieychen allieychen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much Tural. I added a few nits.


def update_bigquery_schema_on_append(schema_fields, output_table):
# type: (bool) -> None
# type: (bool, str) -> None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, Done.

# type: (bool, str) -> None
# if table does not exist, do not need to update the schema.
# TODO (yifangchen): Move the logic into validate().
"""Update BQ schema by combining existing one with a new one, if possible."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: docstring is a string that is the first statement in a package, module, class or function. Read more here. However, if we have type annotation, the doc string goes below the type annotation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, kinda weird to have a comment, docstring and then a comment again :/. Done, nonetheless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any changes here.

I agree with your point. You can refactor it to something like:
"""Update BQ schema by combining existing one with a new one, if possible.

If table does not exist, do not need to update the schema.
TODO (yifangchen): Move the logic into validate().
"""

# limitations under the License.

"""A Helper module for Header Inference operations."""
"""A Helper module for header inference operations."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/Helper/helper

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return _BASE_ANNOTATION_TYPE_KEY.format(annot_field, name)

def infer_info_fields(
variant,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please add type for this variable and the one below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator Author

@tneymanov tneymanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the comments


def update_bigquery_schema_on_append(schema_fields, output_table):
# type: (bool) -> None
# type: (bool, str) -> None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, Done.

# type: (bool, str) -> None
# if table does not exist, do not need to update the schema.
# TODO (yifangchen): Move the logic into validate().
"""Update BQ schema by combining existing one with a new one, if possible."""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, kinda weird to have a comment, docstring and then a comment again :/. Done, nonetheless.

return _BASE_ANNOTATION_TYPE_KEY.format(annot_field, name)

def infer_info_fields(
variant,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# limitations under the License.

"""A Helper module for Header Inference operations."""
"""A Helper module for header inference operations."""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@allieychen allieychen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please feel free to merge the code after you address the last comment.

# type: (bool, str) -> None
# if table does not exist, do not need to update the schema.
# TODO (yifangchen): Move the logic into validate().
"""Update BQ schema by combining existing one with a new one, if possible."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any changes here.

I agree with your point. You can refactor it to something like:
"""Update BQ schema by combining existing one with a new one, if possible.

If table does not exist, do not need to update the schema.
TODO (yifangchen): Move the logic into validate().
"""

Copy link
Contributor

@allieychen allieychen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tneymanov tneymanov merged commit 217b0ce into googlegenomics:master Feb 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants