[query] Teach hail to export a VDS to an SVCR-VCF#14743
[query] Teach hail to export a VDS to an SVCR-VCF#14743hail-ci-robot merged 7 commits intohail-is:mainfrom
Conversation
hail/python/hail/vds/impex.py
Outdated
| dataset : :class:`.MatrixTable` | ||
| Dataset. |
hail/python/hail/vds/impex.py
Outdated
| ref = ref.drop('END') | ||
|
|
||
| if 'gvcf_info' in var.entry and isinstance(var.gvcf_info.dtype, tstruct): | ||
| warning( |
There was a problem hiding this comment.
Do you need to warn here? You've documented this above and given an example for how to preserve this information.
I view warnings as a way of telling users about changing behaviour or when they've been a bit naughty (but not terrible) and this doesn't really fall into that. What's your view?
| if 'LGT' in var.entry: | ||
| if 'GT' not in var.entry: | ||
| var = var.annotate_entries(GT=lgt_to_gt(var.LGT, var.LA)) | ||
| var = var.drop('LGT') |
There was a problem hiding this comment.
Can you comment on dropping these fields if GT/PT are present? Is that safe to do?
There was a problem hiding this comment.
Yes. It's a data error if lgt_to_gt(var.LGT, var.LA) != var.GT.
I'm thinking about writing the reverse (gt_to_lgt). Not entirely sure why one would use it, which is what's stopping me.
| _create_row_uids=bool, | ||
| _create_col_uids=bool, | ||
| ) | ||
| def import_vcf( |
There was a problem hiding this comment.
would it make sense to call this import_svcr to make it distinct from the extant import_vcf?
There was a problem hiding this comment.
Yes. But I like the symmetry.
There was a problem hiding this comment.
I think import_vcf is actually more correct. We use "svcr" to refer to the logical representation, of which "vds" is our native implementation. The new vcf standard now supports an svcr-like representation, which is what this imports from.
hail/python/hail/vds/impex.py
Outdated
| ---- | ||
| The following validations and transformations take place: | ||
|
|
||
| #. The ``LEN`` FORMAT field must exist and be of type :py:data:`.tint32` |
There was a problem hiding this comment.
Should the types listed in FORMAT and INFO sections be vcf types rather than hail types?
hail/python/hail/vds/impex.py
Outdated
| The following validations and transformations take place: | ||
|
|
||
| #. The ``LEN`` FORMAT field must exist and be of type :py:data:`.tint32` | ||
| #. One of ``GT`` or ``LGT`` must be a FORMAT field of type :py:data:`.tcall` |
There was a problem hiding this comment.
Should LGT exist given you rename/drop it in export?
There was a problem hiding this comment.
I don't want to preclude it from existing, it's well formed to have it.
hail/python/hail/vds/impex.py
Outdated
| Parameters | ||
| ---------- | ||
| path : :class:`str` or :obj:`list` of :obj:`str` | ||
| One or more paths to VCF files to read. Each path may or may not include glob expressions |
There was a problem hiding this comment.
Good point, will fix both this and import_vcf
ehigham
left a comment
There was a problem hiding this comment.
Mostly a review on the documentation for new public-facing functionality. Your implementation looks good. It's nice how much existing functionality you could use.
Usage and notes in the documentation for the new methods.
ba167ab to
67602a0
Compare
ehigham
left a comment
There was a problem hiding this comment.
Looks good. Thanks for making the changes.
patrick-schultz
left a comment
There was a problem hiding this comment.
This looks great, nice and clean! Just have a few simple questions and typos.
| if 'END' in ref.entry: | ||
| ref = ref.drop('END') |
There was a problem hiding this comment.
I forget, does VDS now have a manditory LEN field, and optional END field?
| 'information."\n' | ||
| ) | ||
| if VariantDataset.ref_block_max_length_field in ref.globals: | ||
| rbml = hl.eval(ref[VariantDataset.ref_block_max_length_field]) |
There was a problem hiding this comment.
This could potentially be bad if the VDS isn't coming directly from disk, though we'd hopefully be able to optimize this read of a global field to not do much work. Do you think we should recommend writing before exporting?
There was a problem hiding this comment.
I don't think it's necessary. We do optimize away the entire computation if we can simply read one global field. We're not that bad.
There was a problem hiding this comment.
I'm not certain that's always guaranteed (what if we add a global field as the result of an aggregation?), but I think you're right that it would be rare enough we should scare people into writing when it isn't necessary, especially in this case when we can be pretty certain the max length field comes from disk.
| _create_row_uids=bool, | ||
| _create_col_uids=bool, | ||
| ) | ||
| def import_vcf( |
There was a problem hiding this comment.
I think import_vcf is actually more correct. We use "svcr" to refer to the logical representation, of which "vds" is our native implementation. The new vcf standard now supports an svcr-like representation, which is what this imports from.
| assert 'LAA' in format_md, format_md | ||
| assert 'LA' not in format_md, format_md | ||
| assert 'gvcf_info' not in format_md, format_md | ||
| # LPGT is in this VDS, so we can check for PGT existing and LPGT not existing |
There was a problem hiding this comment.
Is this comment backwards?
There was a problem hiding this comment.
no, but it's clearly unclear
Co-authored-by: Patrick Schultz <pschultz@broadinstitute.org>
Usage and notes in the documentation for the new methods.
Part of #14655.
Security Assessment
Impact Description
Does not increase attack surface. New methods only call into old ones.