-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Prototype] Parallel exports of Hail Structures -> VCFs #629
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
MattWellie
requested review from
cassimons,
illusional,
vivbak and
michael-harper
February 27, 2024 07:30
illusional
reviewed
Feb 28, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keen to follow this along! To avoid daily notifications, leaving a review but re-request if there's something specific I can look at.
This looks good 👌 I'm keen to take a dig at this when time permits |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What sucks
Hail has a pretty cool way of generating VCF data from a parallelised data structure, and we're not using it
Whats cool
Our current RD pipeline's
gVCF -> Joint calling in intervals -> VCF fragments
may (hopefully?) be replaced with the VDS combiner at some point, so the way the RD pipeline generates/handles VCF fragments will need a re-think - this could be a base for that, instead of manual interval generation.Hail's repartition with
shuffle=True
would ensure balanced distribution of variants in each partition, then each partition is exported as a separate VCF fragment, which would be optimal for feeding into VEP/VQSR/whateverWhats this
Proof of concept script which takes a Hail Table/MatrixTable/VDS, and exports as a VCF. Instead of our current process, this uses a parallel export + post-processing.
This wouldn't slot perfectly into a pipeline in its current form - the shard manifest needs to be created in order to schedule jobs in its current state. That can be overcome in a few ways, but I'm not keen to develop this any further if it's just a curio.
As far as I've been able to test, this works great. Variant ordering is maintained, each file is block-zipped so they can just be cat'd together without specific tooling. Happy days.
However, I've only used it on minimal examples with no filtering/processing steps, and I'm not sure it's any faster than the native export. Will try on some larger datasets