Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Prototype] Parallel exports of Hail Structures -> VCFs #629

Closed
wants to merge 22 commits into from

Conversation

MattWellie
Copy link
Contributor

@MattWellie MattWellie commented Feb 27, 2024

What sucks

Hail has a pretty cool way of generating VCF data from a parallelised data structure, and we're not using it

Whats cool

Our current RD pipeline's gVCF -> Joint calling in intervals -> VCF fragments may (hopefully?) be replaced with the VDS combiner at some point, so the way the RD pipeline generates/handles VCF fragments will need a re-think - this could be a base for that, instead of manual interval generation.

Hail's repartition with shuffle=True would ensure balanced distribution of variants in each partition, then each partition is exported as a separate VCF fragment, which would be optimal for feeding into VEP/VQSR/whatever

Whats this

Proof of concept script which takes a Hail Table/MatrixTable/VDS, and exports as a VCF. Instead of our current process, this uses a parallel export + post-processing.

  1. Load up a Hail object, and force partitioning (if it's not already partitioned)
  2. Slim down the objects to sites-only (if required)
  3. Write to a VCF file per-partition, with a separate header file
  4. Open the shard-manifest file which contains all shard names in genomic coordinate order
  5. Group the shards into smaller clusters, and create a batch job to concatenate all files within each cluster (this probably isn't a necessary step, but Anna has recently encountered a 73300-shard VDS, at some point we'd hit a bash arguments limit again if we tried to cat all those together. Alternatively repartition to a realistic cap before writing VCF fragments - this is just a prototype...)
  6. Final job to gather all intermediate files and cat them together, tabix the output, write to output location
  • Example run using a VDS here (approx 8 mins total for sites-only)
  • Example run using a MT here (similar time, failed due to storage provisioning which has now been fixed)

This wouldn't slot perfectly into a pipeline in its current form - the shard manifest needs to be created in order to schedule jobs in its current state. That can be overcome in a few ways, but I'm not keen to develop this any further if it's just a curio.

As far as I've been able to test, this works great. Variant ordering is maintained, each file is block-zipped so they can just be cat'd together without specific tooling. Happy days.

However, I've only used it on minimal examples with no filtering/processing steps, and I'm not sure it's any faster than the native export. Will try on some larger datasets

Copy link
Contributor

@illusional illusional left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keen to follow this along! To avoid daily notifications, leaving a review but re-request if there's something specific I can look at.

@michael-harper
Copy link
Contributor

This looks good 👌 I'm keen to take a dig at this when time permits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants