-
Notifications
You must be signed in to change notification settings - Fork 59
Parse out the reusable logic from infer_headers and variant_to_bigquery #435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse out the reusable logic from infer_headers and variant_to_bigquery #435
Conversation
Pull Request Test Coverage Report for Build 1547
💛 - Coveralls |
allieychen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Tural, great work! I have added a few nits.
|
|
||
| """Constants and simple utility functions related to BigQuery.""" | ||
|
|
||
| import exceptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please keep the imported package sorted. Change the order of exceptions and enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| else: | ||
| return avro_type | ||
|
|
||
| def update_bigquery_schema_on_append(schema_fields, output_table): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please add a docstring for this function since it is public now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| return avro_type | ||
|
|
||
| def update_bigquery_schema_on_append(schema_fields, output_table): | ||
| # type: (bool) -> None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| raise RuntimeError('BigQuery schema update failed: %s' % str(e)) | ||
|
|
||
|
|
||
| def get_merged_field_schemas( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to still keep it as private since it is only used in the above function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so if it's used in tests, we don't care?
kk, Done.
| # type: (...) -> List[bigquery.TableFieldSchema] | ||
| """Merges the `field_schemas_1` and `field_schemas_2`. | ||
| Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Leave one empty line between Args, Returns, and Raises. See one example. I know this is how the old code looks like, can you also help refining them? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, Done.
| cardinality as the alternate bases. Correct the num to be `None`. | ||
| - Defined type is `Integer`, but the provided value is float. Correct the | ||
| type to be `Float`. | ||
| Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add one empty line here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| defined_header.get(_HeaderKeyConstants.VERSION)) | ||
| return None | ||
|
|
||
| def _infer_mismatched_format_field(field_key, # type: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the parameters can fit into one line. It will be easier to read if them are in one line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| defined_header.get(_HeaderKeyConstants.DESC)) | ||
| return None | ||
|
|
||
| def _infer_standard_info_fields(variant, infos, defined_headers): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel standard is not that easy to understand. Is it non annotation info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap. I guess I'll rename to .._non_annotation_.. then.
Done.
| variant, infos, defined_headers, annotation_fields_to_infer) | ||
| return infos | ||
|
|
||
| def infer_format_fields(variant, defined_headers): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I prefer to put the public methods at the beginning of one script, rather than scrolling all the way down to find out what this file can provide. If you agree, let's stick to the same rule. Of course, there is no wrong/right for the other way. Just let me know your preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sooo I personally prefer doing it other way - when we were working on Java, the style guide enforced that a method, if possible, should only be calling methods that were defined before it. However, that's Java, I have no idea about Python. Also if you guys already have been defining public methods first, we should continue doing that - inconsistency is the worst.
I'll modify the order.
| _BREAKEND_ALT_RE = (re.compile( | ||
| r'^(?P<up_to_chr>.*([\[\]]).*):(?P<pos>.*)([\[\]]).*$')) | ||
|
|
||
| # Filled with annotation field and name data, then used as a header ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to still keep these in infer_headers_util.py.
This module is used to parse/reconstruct the existing annotation header (something like A|upstream_gene_variant|MODIFIER|PSMF1|||||), which is usually added by the third party tool.
The new get_inferred_annotation_type_header_key, creates a name for each individual annotation field when infer_annotation_type is set. For instance, we may add something like below into one temporary header file to help loading the VCF to BQ.
##INFO=<ID=CSQ_VT_SWISSPROT_TYPE,Number=1,Type=String,Description="Inferred type field for annotation SWISSPROT.",Source="",Version="">
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, moved the logic into util.
tneymanov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, addressed the comments.
| _BREAKEND_ALT_RE = (re.compile( | ||
| r'^(?P<up_to_chr>.*([\[\]]).*):(?P<pos>.*)([\[\]]).*$')) | ||
|
|
||
| # Filled with annotation field and name data, then used as a header ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, moved the logic into util.
|
|
||
| """Constants and simple utility functions related to BigQuery.""" | ||
|
|
||
| import exceptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| else: | ||
| return avro_type | ||
|
|
||
| def update_bigquery_schema_on_append(schema_fields, output_table): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| return avro_type | ||
|
|
||
| def update_bigquery_schema_on_append(schema_fields, output_table): | ||
| # type: (bool) -> None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| raise RuntimeError('BigQuery schema update failed: %s' % str(e)) | ||
|
|
||
|
|
||
| def get_merged_field_schemas( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so if it's used in tests, we don't care?
kk, Done.
|
|
||
| from gcp_variant_transforms.beam_io import vcf_header_io | ||
| from gcp_variant_transforms.beam_io import vcfio # pylint: disable=unused-import | ||
| from gcp_variant_transforms.libs.annotation import annotation_parser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| cardinality as the alternate bases. Correct the num to be `None`. | ||
| - Defined type is `Integer`, but the provided value is float. Correct the | ||
| type to be `Float`. | ||
| Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| defined_header.get(_HeaderKeyConstants.VERSION)) | ||
| return None | ||
|
|
||
| def _infer_mismatched_format_field(field_key, # type: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| defined_header.get(_HeaderKeyConstants.DESC)) | ||
| return None | ||
|
|
||
| def _infer_standard_info_fields(variant, infos, defined_headers): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap. I guess I'll rename to .._non_annotation_.. then.
Done.
| variant, infos, defined_headers, annotation_fields_to_infer) | ||
| return infos | ||
|
|
||
| def infer_format_fields(variant, defined_headers): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sooo I personally prefer doing it other way - when we were working on Java, the style guide enforced that a method, if possible, should only be calling methods that were defined before it. However, that's Java, I have no idea about Python. Also if you guys already have been defining public methods first, we should continue doing that - inconsistency is the worst.
I'll modify the order.
allieychen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much Tural. I added a few nits.
|
|
||
| def update_bigquery_schema_on_append(schema_fields, output_table): | ||
| # type: (bool) -> None | ||
| # type: (bool, str) -> None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, Done.
| # type: (bool, str) -> None | ||
| # if table does not exist, do not need to update the schema. | ||
| # TODO (yifangchen): Move the logic into validate(). | ||
| """Update BQ schema by combining existing one with a new one, if possible.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: docstring is a string that is the first statement in a package, module, class or function. Read more here. However, if we have type annotation, the doc string goes below the type annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, kinda weird to have a comment, docstring and then a comment again :/. Done, nonetheless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see any changes here.
I agree with your point. You can refactor it to something like:
"""Update BQ schema by combining existing one with a new one, if possible.
If table does not exist, do not need to update the schema.
TODO (yifangchen): Move the logic into validate().
"""
| # limitations under the License. | ||
|
|
||
| """A Helper module for Header Inference operations.""" | ||
| """A Helper module for header inference operations.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/Helper/helper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| return _BASE_ANNOTATION_TYPE_KEY.format(annot_field, name) | ||
|
|
||
| def infer_info_fields( | ||
| variant, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please add type for this variable and the one below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tneymanov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed the comments
|
|
||
| def update_bigquery_schema_on_append(schema_fields, output_table): | ||
| # type: (bool) -> None | ||
| # type: (bool, str) -> None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, Done.
| # type: (bool, str) -> None | ||
| # if table does not exist, do not need to update the schema. | ||
| # TODO (yifangchen): Move the logic into validate(). | ||
| """Update BQ schema by combining existing one with a new one, if possible.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, kinda weird to have a comment, docstring and then a comment again :/. Done, nonetheless.
| return _BASE_ANNOTATION_TYPE_KEY.format(annot_field, name) | ||
|
|
||
| def infer_info_fields( | ||
| variant, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| # limitations under the License. | ||
|
|
||
| """A Helper module for Header Inference operations.""" | ||
| """A Helper module for header inference operations.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
allieychen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please feel free to merge the code after you address the last comment.
| # type: (bool, str) -> None | ||
| # if table does not exist, do not need to update the schema. | ||
| # TODO (yifangchen): Move the logic into validate(). | ||
| """Update BQ schema by combining existing one with a new one, if possible.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see any changes here.
I agree with your point. You can refactor it to something like:
"""Update BQ schema by combining existing one with a new one, if possible.
If table does not exist, do not need to update the schema.
TODO (yifangchen): Move the logic into validate().
"""
allieychen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
515d162 to
c6f611c
Compare
Continue with our effort to simplify PTransform code.