-
Notifications
You must be signed in to change notification settings - Fork 59
Extract header fields from variants. #114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pull Request Test Coverage Report for Build 332
💛 - Coveralls |
85bd40f to
e4b8b4d
Compare
arostamianfar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool! Thanks! Made some comments mainly related to naming and formatting. The design looks good overall.
| # from variants, therefore we let the default value (True) for it | ||
| # be used. | ||
| | 'MergeHeaders' >> merge_headers.MergeHeaders( | ||
| split_alternate_allele_info_fields=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since it defaults to true, you can just not specify it here. I'd even say to remove the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the comment so we remember to change this default choice if, in future, we are able to deduce NIM=A from variants only. For the sake of clarity in the comment, I kept the arg with default value.
| section of the VCF files. This is used to skip already defined | ||
| header fields. | ||
| """ | ||
| self._headers = headers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider renaming to defined_header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
gcp_variant_transforms/vcf_to_bq.py
Outdated
| | 'WriteHeaders' >> vcf_header_io.WriteVcfHeaders( | ||
| known_args.representative_header_file)) | ||
|
|
||
| merged_headers = (headers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider renaming to merged_header since we expect one header object at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
gcp_variant_transforms/vcf_to_bq.py
Outdated
| _read_variants(p, known_args) | ||
| | 'FilterVariants' >> filter_variants.FilterVariants( | ||
| reference_names=known_args.reference_names) | ||
| | 'ExtractHeaderFieldsFromVariants' >> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider renaming to InferUndefinedHeaderFields (I think you can drop 'FromVariants') and the return object would be inferred_headers as we should distinguish between defined and inferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
gcp_variant_transforms/vcf_to_bq.py
Outdated
| | 'MergeHeaders' >> merge_headers.MergeHeaders( | ||
| known_args.split_alternate_allele_info_fields)) | ||
| if known_args.robust_header_extraction: | ||
| headers_from_variants = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider extracting the body of this if as separate function that just returns a merged_header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| infos[info_field_key] = Info(info_field_key, | ||
| self._get_field_counts(info_field_value), | ||
| self._get_field_type(info_field_value), | ||
| _NO_DESCRIPTION, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use empty strings for these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| for call in variant.calls: | ||
| for format_key, format_value in call.info.iteritems(): | ||
| if not self._is_defined_format_field(format_key, headers): | ||
| assert format_key not in formats, ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider raising a ValueError instead of asserting (e.g. we may want to catch the error and do something with it later instead of crashing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| for format_key, format_value in call.info.iteritems(): | ||
| if not self._is_defined_format_field(format_key, headers): | ||
| assert format_key not in formats, ( | ||
| "Bad VCF file. Duplicate FORMAT field. %s" %format_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/Bad VCF file/Invalid VCF record.
And please also print the entire variant record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
|
|
||
| def process(self, variant, headers): | ||
| return self._extract_header_fields(variant, headers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the body of self._extract_header_fields is simple enough to be inlined here. That way, you can use yield instead of sometimes returning empty list as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| 'even if this is provided.')) | ||
|
|
||
| parser.add_argument( | ||
| '--robust_header_extraction', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something like infer_undefined_headers is a more representative name for this flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
2694de1 to
16ed0fb
Compare
arostamianfar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Just a few more nits ...
| @@ -0,0 +1,119 @@ | |||
| # Copyright 2017 Google Inc. All Rights Reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2018 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """A PTransform to output a PCollection of single ``HeaderFields`` extracted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/HeaderFields/VcfHeader here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """A PTransform to output a PCollection of single ``HeaderFields`` extracted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The style guide says to have a 1-line summary for modules and then a more descriptive version below it separated by a blank line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| __all__ = ['InferUndefinedHeaderFields'] | ||
|
|
||
|
|
||
| _FLAG = 'Flag' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a class VcfFieldType with values like this (similar to ColumnKeyConstants in bigquery_vcf_schema).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| def __init__(self, defined_headers): | ||
| """Initializes the transform. | ||
| Args: | ||
| defined_headers (:class:``vcf_header_io.VCFHeader``): side input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the new type format, but for now, please at least add args sections for the methods so that we can change it easily later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| def __init__(self, defined_headers): | ||
| """Initializes the transform. | ||
| Args: | ||
| defined_headers (:class:``vcf_header_io.VCFHeader``): side input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/side/Side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| def __init__(self, defined_headers): | ||
| """Initializes the transform. | ||
| Args: | ||
| defined_headers (:class:``vcf_header_io.VCFHeader``): side input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: when using :class: , you only need one backquote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
PTAL |
arostamianfar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just comment nits :)
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """A PTransform to infer undefined header fields""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please add a dot at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| """ | ||
| Args: | ||
| field_value (list, bool, integer, or string): value for the field | ||
| returned by pyvcf. E.g. [0.33, 0.66] is a field value for Allele freq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/pyvcf/PyVCF
| """ | ||
| Args: | ||
| field_value (list, bool, integer, or string): value for the field | ||
| returned by pyvcf. E.g. [0.33, 0.66] is a field value for Allele freq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/for Allele freq../for the allele frequency (AF) field.
(no need to imply semantics of the field here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S. sorry, i went back and forth in my comment about implying the semantics lol and I just decided to reword your original comment. Please ignore my "no need to imply semantics..." comment as it doesnt make sense.
| field_value (list, bool, integer, or string): value for the field | ||
| returned by pyvcf. E.g. [0.33, 0.66] is a field value for Allele freq | ||
| field (AF). | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add Returns as well (here and below).
| return VcfFieldType.STRING | ||
|
|
||
| def _infer_undefined_info_fields(self, variant, defined_headers): | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a short description as well, since it's not obvious that it only returns fields not defined in the header (for both info and format).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| return infos | ||
|
|
||
| def _infer_undefined_format_fields(self, variant, defined_headers): | ||
| formats = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add args/returns here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
nmousavi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
Add a PTransform to extract header fields (FORMAT, INFO) form variants. This feature is activated by a flag, and is useful when there are header fields in variant with no definition in headers. Without this PTransform we currently fail the pipeline in such cases.
arostamianfar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Add a PTransform to extract header fields (FORMAT, INFO) form variants.
This feature is activated by a flag, and is useful when there are header
fields in variant with no definition in headers. Without this PTransform
we currently fail the pipeline in such cases.
Related to issue #61, #101, #119
Tested:
unit test, ran pipeline for 1000genome dataset