Skip to content

Conversation

@nmousavi
Copy link
Contributor

@nmousavi nmousavi commented Feb 16, 2018

Add a PTransform to extract header fields (FORMAT, INFO) form variants.
This feature is activated by a flag, and is useful when there are header
fields in variant with no definition in headers. Without this PTransform
we currently fail the pipeline in such cases.

Related to issue #61, #101, #119

Tested:
unit test, ran pipeline for 1000genome dataset

@coveralls
Copy link

coveralls commented Feb 16, 2018

Pull Request Test Coverage Report for Build 332

  • 117 of 128 (91.41%) changed or added relevant lines in 4 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.09%) to 90.478%

Changes Missing Coverage Covered Lines Changed/Added Lines %
gcp_variant_transforms/options/variant_transform_options.py 0 1 0.0%
gcp_variant_transforms/transforms/infer_undefined_headers.py 50 52 96.15%
gcp_variant_transforms/vcf_to_bq.py 2 10 20.0%
Files with Coverage Reduction New Missed Lines %
gcp_variant_transforms/vcf_to_bq.py 1 52.63%
Totals Coverage Status
Change from base Build 321: 0.09%
Covered Lines: 3031
Relevant Lines: 3350

💛 - Coveralls

Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool! Thanks! Made some comments mainly related to naming and formatting. The design looks good overall.

# from variants, therefore we let the default value (True) for it
# be used.
| 'MergeHeaders' >> merge_headers.MergeHeaders(
split_alternate_allele_info_fields=True))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since it defaults to true, you can just not specify it here. I'd even say to remove the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the comment so we remember to change this default choice if, in future, we are able to deduce NIM=A from variants only. For the sake of clarity in the comment, I kept the arg with default value.

section of the VCF files. This is used to skip already defined
header fields.
"""
self._headers = headers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming to defined_header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

| 'WriteHeaders' >> vcf_header_io.WriteVcfHeaders(
known_args.representative_header_file))

merged_headers = (headers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming to merged_header since we expect one header object at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

_read_variants(p, known_args)
| 'FilterVariants' >> filter_variants.FilterVariants(
reference_names=known_args.reference_names)
| 'ExtractHeaderFieldsFromVariants' >>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming to InferUndefinedHeaderFields (I think you can drop 'FromVariants') and the return object would be inferred_headers as we should distinguish between defined and inferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

| 'MergeHeaders' >> merge_headers.MergeHeaders(
known_args.split_alternate_allele_info_fields))
if known_args.robust_header_extraction:
headers_from_variants = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider extracting the body of this if as separate function that just returns a merged_header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

infos[info_field_key] = Info(info_field_key,
self._get_field_counts(info_field_value),
self._get_field_type(info_field_value),
_NO_DESCRIPTION,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use empty strings for these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for call in variant.calls:
for format_key, format_value in call.info.iteritems():
if not self._is_defined_format_field(format_key, headers):
assert format_key not in formats, (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider raising a ValueError instead of asserting (e.g. we may want to catch the error and do something with it later instead of crashing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for format_key, format_value in call.info.iteritems():
if not self._is_defined_format_field(format_key, headers):
assert format_key not in formats, (
"Bad VCF file. Duplicate FORMAT field. %s" %format_key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/Bad VCF file/Invalid VCF record.
And please also print the entire variant record.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done



def process(self, variant, headers):
return self._extract_header_fields(variant, headers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the body of self._extract_header_fields is simple enough to be inlined here. That way, you can use yield instead of sometimes returning empty list as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

'even if this is provided.'))

parser.add_argument(
'--robust_header_extraction',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something like infer_undefined_headers is a more representative name for this flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@nmousavi nmousavi force-pushed the schema branch 2 times, most recently from 2694de1 to 16ed0fb Compare February 22, 2018 20:33
Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Just a few more nits ...

@@ -0,0 +1,119 @@
# Copyright 2017 Google Inc. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2018 :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# See the License for the specific language governing permissions and
# limitations under the License.

"""A PTransform to output a PCollection of single ``HeaderFields`` extracted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/HeaderFields/VcfHeader here and below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

# See the License for the specific language governing permissions and
# limitations under the License.

"""A PTransform to output a PCollection of single ``HeaderFields`` extracted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The style guide says to have a 1-line summary for modules and then a more descriptive version below it separated by a blank line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

__all__ = ['InferUndefinedHeaderFields']


_FLAG = 'Flag'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a class VcfFieldType with values like this (similar to ColumnKeyConstants in bigquery_vcf_schema).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

def __init__(self, defined_headers):
"""Initializes the transform.
Args:
defined_headers (:class:``vcf_header_io.VCFHeader``): side input
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the new type format, but for now, please at least add args sections for the methods so that we can change it easily later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def __init__(self, defined_headers):
"""Initializes the transform.
Args:
defined_headers (:class:``vcf_header_io.VCFHeader``): side input
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/side/Side

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def __init__(self, defined_headers):
"""Initializes the transform.
Args:
defined_headers (:class:``vcf_header_io.VCFHeader``): side input
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: when using :class: , you only need one backquote.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@nmousavi
Copy link
Contributor Author

PTAL

Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just comment nits :)

# See the License for the specific language governing permissions and
# limitations under the License.

"""A PTransform to infer undefined header fields"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please add a dot at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

"""
Args:
field_value (list, bool, integer, or string): value for the field
returned by pyvcf. E.g. [0.33, 0.66] is a field value for Allele freq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/pyvcf/PyVCF

"""
Args:
field_value (list, bool, integer, or string): value for the field
returned by pyvcf. E.g. [0.33, 0.66] is a field value for Allele freq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/for Allele freq../for the allele frequency (AF) field.
(no need to imply semantics of the field here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. sorry, i went back and forth in my comment about implying the semantics lol and I just decided to reword your original comment. Please ignore my "no need to imply semantics..." comment as it doesnt make sense.

field_value (list, bool, integer, or string): value for the field
returned by pyvcf. E.g. [0.33, 0.66] is a field value for Allele freq
field (AF).
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add Returns as well (here and below).

return VcfFieldType.STRING

def _infer_undefined_info_fields(self, variant, defined_headers):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a short description as well, since it's not obvious that it only returns fields not defined in the header (for both info and format).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

return infos

def _infer_undefined_format_fields(self, variant, defined_headers):
formats = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add args/returns here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

@nmousavi nmousavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

Add a PTransform to extract header fields (FORMAT, INFO) form variants.
This feature is activated by a flag, and is useful when there are header
fields in variant with no definition in headers. Without this PTransform
we currently fail the pipeline in such cases.
Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@nmousavi nmousavi merged commit 9de46ac into googlegenomics:master Feb 23, 2018
@nmousavi nmousavi deleted the schema branch April 26, 2018 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants