-
Notifications
You must be signed in to change notification settings - Fork 59
Add new --omit_empty_sample_calls option. #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new --omit_empty_sample_calls option. #105
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Ryan! This is awesome! Our first external contribution :)
I added a few comments. Please also check the lint errors (you'd need to pip install pylint and run pylint gcp_variant_transforms or ./run_presubmit to run both the tests and pylint).
| """Yields BigQuery rows according to the schema from the given variant. | ||
| There is a 10MB limit for each BigQuqery row, which can exceed by having | ||
| There is a 10MB limit for each BigQuery row, which can be exceeded by having |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching and fixing these! :)
| for call in variant.calls: | ||
| call_record = _get_call_record(call) | ||
|
|
||
| if omit_empty_sample_calls: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's cleaner to modify _get_call_record to return call_record, is_empty. You can then omit the call here based on is_empty.
As for the logic in _get_call_record, you can check for genotype emptiness using call.genotype and then add a condition on the return value of _get_bigquery_sanitized_field.
This avoids the 2nd loop and is easier to follow b/c it uses the native object instead of the transformed map.
P.S. I don't think 'name' can be empty per schema. Also, phaseset doesn't really mean anything if genotype is empty, so we can ignore those fields.
| # Omit empty calls, e.g. samples that don't have this variant. | ||
| for key, val in call_record.items(): | ||
| if key == 'genotype': | ||
| if val and set(val) != set((vcfio.MISSING_GENOTYPE_VALUE,)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, genotype is actually an integer and we use '-1' to denote missing field. Please use vcfio.MISSING_GENOTYPE_VALUE.
P.S. These kinds of bugs will be easier to catch once we have more thorough integration tests. Our plan is to have at least one integration test for every option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! I just noticed you were using the correct value and I misread it! Please ignore my comment.
| if key == 'genotype': | ||
| if val and set(val) != set((vcfio.MISSING_GENOTYPE_VALUE,)): | ||
| break | ||
| elif (key != 'name' and val and val != vcfio.MISSING_FIELD_VALUE and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider creating an _is_empty_field function that returns true for "empty fields". This is a bit tricky (especially since 0 evaluates to False in python), so I feel like we should case on type to return a correct value (ugly, I know :/). I'll look around to see if there is a better way of doing this (crazy idea: can we just cast to string and check with a regular expression for just \.\]\[\{\}\(\)?)
|
thanks for the review! these suggestions all sound good. I'll ping once I've incorporated them. |
|
done. definitely cleaner this way, thanks for the suggestions. i kept the boolean coercion in i'm also happy to squash the two commits if you want, just let me know. |
6c5c12a to
0be71af
Compare
Pull Request Test Coverage Report for Build 292
💛 - Coveralls |
arostamianfar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Ryan! Just two more nits.
Please also squash the commits (a convention used by our team). You can also use --amend so that you only work on one commit.
See https://github.com/googlegenomics/gcp-variant-transforms/blob/master/docs/development_guide.md#updating-changes
And please ignore the coverage error. We recently added it and I wasn't expecting it to be a "failure". I'll look into how to treat it as "FYI" rather than failed check.
| """A helper method for ``get_rows_from_variant`` to get a call as JSON. | ||
| Args: | ||
| call: PyVCF Variant call to convert. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is actually ``VariantCall`` inside vcfio.
Please use:
call (``VariantCall``): Variant call to convert.
| ColumnKeyConstants.CALLS_PHASESET: call.phaseset, | ||
| ColumnKeyConstants.CALLS_GENOTYPE: call.genotype or [] | ||
| } | ||
| is_empty = _is_empty_field('genotype', call.genotype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of treating genotype as a special field and just using is_empty_field for other info fields? This avoids special-casing genotype inside is_empty_field
In other words, something like: is_empty = not call.genotype or set(call.genotype) == set((vcfio.MISSING_GENOTYPE_VALUE,))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure! done. i actually considered this originally, but didn't do it because the boolean value check gets duplicated. i'm fine either way though!
24101c5 to
7912bed
Compare
|
thanks! all done, and squashed. btw, we also use coveralls, and had the same concern with it as a check on github, since it's overly sensitive and marks any drop as a failure. we still use it, but we disconnected it from github and track it more offline, over the longer term. |
...to skip empty calls, e.g. samples that don't have a given variant. Useful for multi-sample VCFs, especially when most variants are sparse and are only called in a small fraction of samples.
7912bed to
f0e3d9a
Compare
arostamianfar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
|
woo, thanks for merging! |
|
I'll release a new docker image with this change once our integration tests pass :) |
|
FYI, the docker image is updated as well! |
...to skip empty calls, e.g. samples that don't have a given variant. Useful for multi-sample VCFs, especially when most variants are sparse and are only called in a small fraction of samples.
This switches from translating VCFs literally to using BigTable's support for repeated columns and sparse storage more effectively.
I'm using this to load our CNVs, which generally have hundreds of samples but usually just one sample per call. This makes the resulting rows in BigTable easier to work with, since each
callrow is an actual call.I'm flexible on pretty much all the details and code here. Open to other ideas. Thanks in advance!