Skip to content

[query] Teach hail to export a VDS to an SVCR-VCF#14743

Merged
hail-ci-robot merged 7 commits intohail-is:mainfrom
chrisvittal:vds/export-vcf
Oct 30, 2024
Merged

[query] Teach hail to export a VDS to an SVCR-VCF#14743
hail-ci-robot merged 7 commits intohail-is:mainfrom
chrisvittal:vds/export-vcf

Conversation

@chrisvittal
Copy link
Collaborator

Usage and notes in the documentation for the new methods.

Part of #14655.

Security Assessment

  • This change has no security impact

Impact Description

Does not increase attack surface. New methods only call into old ones.

Comment on lines +71 to +72
dataset : :class:`.MatrixTable`
Dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it a VariantDataset?

ref = ref.drop('END')

if 'gvcf_info' in var.entry and isinstance(var.gvcf_info.dtype, tstruct):
warning(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to warn here? You've documented this above and given an example for how to preserve this information.

I view warnings as a way of telling users about changing behaviour or when they've been a bit naughty (but not terrible) and this doesn't really fall into that. What's your view?

if 'LGT' in var.entry:
if 'GT' not in var.entry:
var = var.annotate_entries(GT=lgt_to_gt(var.LGT, var.LA))
var = var.drop('LGT')
Copy link
Member

@ehigham ehigham Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on dropping these fields if GT/PT are present? Is that safe to do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's a data error if lgt_to_gt(var.LGT, var.LA) != var.GT.

I'm thinking about writing the reverse (gt_to_lgt). Not entirely sure why one would use it, which is what's stopping me.

_create_row_uids=bool,
_create_col_uids=bool,
)
def import_vcf(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to call this import_svcr to make it distinct from the extant import_vcf?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But I like the symmetry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think import_vcf is actually more correct. We use "svcr" to refer to the logical representation, of which "vds" is our native implementation. The new vcf standard now supports an svcr-like representation, which is what this imports from.

----
The following validations and transformations take place:

#. The ``LEN`` FORMAT field must exist and be of type :py:data:`.tint32`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the types listed in FORMAT and INFO sections be vcf types rather than hail types?

The following validations and transformations take place:

#. The ``LEN`` FORMAT field must exist and be of type :py:data:`.tint32`
#. One of ``GT`` or ``LGT`` must be a FORMAT field of type :py:data:`.tcall`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should LGT exist given you rename/drop it in export?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to preclude it from existing, it's well formed to have it.

Parameters
----------
path : :class:`str` or :obj:`list` of :obj:`str`
One or more paths to VCF files to read. Each path may or may not include glob expressions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is it lol?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will fix both this and import_vcf

Copy link
Member

@ehigham ehigham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly a review on the documentation for new public-facing functionality. Your implementation looks good. It's nice how much existing functionality you could use.

Copy link
Member

@ehigham ehigham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for making the changes.

Copy link
Member

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, nice and clean! Just have a few simple questions and typos.

Comment on lines +95 to +96
if 'END' in ref.entry:
ref = ref.drop('END')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget, does VDS now have a manditory LEN field, and optional END field?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

'information."\n'
)
if VariantDataset.ref_block_max_length_field in ref.globals:
rbml = hl.eval(ref[VariantDataset.ref_block_max_length_field])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially be bad if the VDS isn't coming directly from disk, though we'd hopefully be able to optimize this read of a global field to not do much work. Do you think we should recommend writing before exporting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary. We do optimize away the entire computation if we can simply read one global field. We're not that bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain that's always guaranteed (what if we add a global field as the result of an aggregation?), but I think you're right that it would be rare enough we should scare people into writing when it isn't necessary, especially in this case when we can be pretty certain the max length field comes from disk.

_create_row_uids=bool,
_create_col_uids=bool,
)
def import_vcf(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think import_vcf is actually more correct. We use "svcr" to refer to the logical representation, of which "vds" is our native implementation. The new vcf standard now supports an svcr-like representation, which is what this imports from.

assert 'LAA' in format_md, format_md
assert 'LA' not in format_md, format_md
assert 'gvcf_info' not in format_md, format_md
# LPGT is in this VDS, so we can check for PGT existing and LPGT not existing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment backwards?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, but it's clearly unclear

chrisvittal and others added 2 commits October 30, 2024 10:51
Co-authored-by: Patrick Schultz <pschultz@broadinstitute.org>
Copy link
Member

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Chris!

@hail-ci-robot hail-ci-robot merged commit 056d9f2 into hail-is:main Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants