Skip to content

Conversation

@snarfed
Copy link
Contributor

@snarfed snarfed commented Feb 3, 2018

...to skip empty calls, e.g. samples that don't have a given variant. Useful for multi-sample VCFs, especially when most variants are sparse and are only called in a small fraction of samples.

This switches from translating VCFs literally to using BigTable's support for repeated columns and sparse storage more effectively.

I'm using this to load our CNVs, which generally have hundreds of samples but usually just one sample per call. This makes the resulting rows in BigTable easier to work with, since each call row is an actual call.

I'm flexible on pretty much all the details and code here. Open to other ideas. Thanks in advance!

@arostamianfar arostamianfar self-requested a review February 5, 2018 14:59
Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Ryan! This is awesome! Our first external contribution :)
I added a few comments. Please also check the lint errors (you'd need to pip install pylint and run pylint gcp_variant_transforms or ./run_presubmit to run both the tests and pylint).

"""Yields BigQuery rows according to the schema from the given variant.
There is a 10MB limit for each BigQuqery row, which can exceed by having
There is a 10MB limit for each BigQuery row, which can be exceeded by having
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching and fixing these! :)

for call in variant.calls:
call_record = _get_call_record(call)

if omit_empty_sample_calls:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's cleaner to modify _get_call_record to return call_record, is_empty. You can then omit the call here based on is_empty.
As for the logic in _get_call_record, you can check for genotype emptiness using call.genotype and then add a condition on the return value of _get_bigquery_sanitized_field.
This avoids the 2nd loop and is easier to follow b/c it uses the native object instead of the transformed map.
P.S. I don't think 'name' can be empty per schema. Also, phaseset doesn't really mean anything if genotype is empty, so we can ignore those fields.

# Omit empty calls, e.g. samples that don't have this variant.
for key, val in call_record.items():
if key == 'genotype':
if val and set(val) != set((vcfio.MISSING_GENOTYPE_VALUE,)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, genotype is actually an integer and we use '-1' to denote missing field. Please use vcfio.MISSING_GENOTYPE_VALUE.
P.S. These kinds of bugs will be easier to catch once we have more thorough integration tests. Our plan is to have at least one integration test for every option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! I just noticed you were using the correct value and I misread it! Please ignore my comment.

if key == 'genotype':
if val and set(val) != set((vcfio.MISSING_GENOTYPE_VALUE,)):
break
elif (key != 'name' and val and val != vcfio.MISSING_FIELD_VALUE and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider creating an _is_empty_field function that returns true for "empty fields". This is a bit tricky (especially since 0 evaluates to False in python), so I feel like we should case on type to return a correct value (ugly, I know :/). I'll look around to see if there is a better way of doing this (crazy idea: can we just cast to string and check with a regular expression for just \.\]\[\{\}\(\)?)

@snarfed
Copy link
Contributor Author

snarfed commented Feb 6, 2018

thanks for the review! these suggestions all sound good. I'll ping once I've incorporated them.

@snarfed
Copy link
Contributor Author

snarfed commented Feb 6, 2018

done. definitely cleaner this way, thanks for the suggestions.

i kept the boolean coercion in _is_empty_field() and just check for 0 explicitly. apart from genotype, input values all come from _get_bigquery_sanitized_field, so i'm not too worried about unusual types accidentally boolean evaluating to False. otherwise, checking for an explicit set of known empty values might be the next best thing, e.g. None, {}, [], (), '', set(), frozenset().

i'm also happy to squash the two commits if you want, just let me know.

@snarfed snarfed force-pushed the omit_empty_sample_calls branch 2 times, most recently from 6c5c12a to 0be71af Compare February 6, 2018 02:18
@coveralls
Copy link

coveralls commented Feb 6, 2018

Pull Request Test Coverage Report for Build 292

  • 19 of 21 (90.48%) changed or added relevant lines in 4 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.02%) to 90.607%

Changes Missing Coverage Covered Lines Changed/Added Lines %
gcp_variant_transforms/options/variant_transform_options.py 0 1 0.0%
gcp_variant_transforms/transforms/variant_to_bigquery.py 2 3 66.67%
Files with Coverage Reduction New Missed Lines %
gcp_variant_transforms/transforms/variant_to_bigquery.py 2 68.18%
Totals Coverage Status
Change from base Build 284: -0.02%
Covered Lines: 2778
Relevant Lines: 3066

💛 - Coveralls

Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Ryan! Just two more nits.
Please also squash the commits (a convention used by our team). You can also use --amend so that you only work on one commit.
See https://github.com/googlegenomics/gcp-variant-transforms/blob/master/docs/development_guide.md#updating-changes

And please ignore the coverage error. We recently added it and I wasn't expecting it to be a "failure". I'll look into how to treat it as "FYI" rather than failed check.

"""A helper method for ``get_rows_from_variant`` to get a call as JSON.
Args:
call: PyVCF Variant call to convert.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is actually ``VariantCall`` inside vcfio.
Please use:
call (``VariantCall``): Variant call to convert.

ColumnKeyConstants.CALLS_PHASESET: call.phaseset,
ColumnKeyConstants.CALLS_GENOTYPE: call.genotype or []
}
is_empty = _is_empty_field('genotype', call.genotype)
Copy link
Contributor

@arostamianfar arostamianfar Feb 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of treating genotype as a special field and just using is_empty_field for other info fields? This avoids special-casing genotype inside is_empty_field
In other words, something like: is_empty = not call.genotype or set(call.genotype) == set((vcfio.MISSING_GENOTYPE_VALUE,))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure! done. i actually considered this originally, but didn't do it because the boolean value check gets duplicated. i'm fine either way though!

@snarfed snarfed force-pushed the omit_empty_sample_calls branch 3 times, most recently from 24101c5 to 7912bed Compare February 6, 2018 16:17
@snarfed
Copy link
Contributor Author

snarfed commented Feb 6, 2018

thanks! all done, and squashed.

btw, we also use coveralls, and had the same concern with it as a check on github, since it's overly sensitive and marks any drop as a failure. we still use it, but we disconnected it from github and track it more offline, over the longer term.

...to skip empty calls, e.g. samples that don't have a given variant. Useful for multi-sample VCFs, especially when most variants are sparse and are only called in a small fraction of samples.
@snarfed snarfed force-pushed the omit_empty_sample_calls branch from 7912bed to f0e3d9a Compare February 6, 2018 16:31
Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@arostamianfar arostamianfar merged commit 9ab7430 into googlegenomics:master Feb 6, 2018
@snarfed
Copy link
Contributor Author

snarfed commented Feb 6, 2018

woo, thanks for merging!

@arostamianfar
Copy link
Contributor

I'll release a new docker image with this change once our integration tests pass :)

@arostamianfar
Copy link
Contributor

FYI, the docker image is updated as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants