[Prototype] Parallel exports of Hail Structures -> VCFs #629

MattWellie · 2024-02-27T07:30:07Z

What sucks

Hail has a pretty cool way of generating VCF data from a parallelised data structure, and we're not using it

Whats cool

Our current RD pipeline's gVCF -> Joint calling in intervals -> VCF fragments may (hopefully?) be replaced with the VDS combiner at some point, so the way the RD pipeline generates/handles VCF fragments will need a re-think - this could be a base for that, instead of manual interval generation.

Hail's repartition with shuffle=True would ensure balanced distribution of variants in each partition, then each partition is exported as a separate VCF fragment, which would be optimal for feeding into VEP/VQSR/whatever

Whats this

Proof of concept script which takes a Hail Table/MatrixTable/VDS, and exports as a VCF. Instead of our current process, this uses a parallel export + post-processing.

Load up a Hail object, and force partitioning (if it's not already partitioned)
Slim down the objects to sites-only (if required)
Write to a VCF file per-partition, with a separate header file
Open the shard-manifest file which contains all shard names in genomic coordinate order
Group the shards into smaller clusters, and create a batch job to concatenate all files within each cluster (this probably isn't a necessary step, but Anna has recently encountered a 73300-shard VDS, at some point we'd hit a bash arguments limit again if we tried to cat all those together. Alternatively repartition to a realistic cap before writing VCF fragments - this is just a prototype...)
Final job to gather all intermediate files and cat them together, tabix the output, write to output location

Example run using a VDS here (approx 8 mins total for sites-only)
Example run using a MT here (similar time, failed due to storage provisioning which has now been fixed)

This wouldn't slot perfectly into a pipeline in its current form - the shard manifest needs to be created in order to schedule jobs in its current state. That can be overcome in a few ways, but I'm not keen to develop this any further if it's just a curio.

As far as I've been able to test, this works great. Variant ordering is maintained, each file is block-zipped so they can just be cat'd together without specific tooling. Happy days.

However, I've only used it on minimal examples with no filtering/processing steps, and I'm not sure it's any faster than the native export. Will try on some larger datasets

illusional

Keen to follow this along! To avoid daily notifications, leaving a review but re-request if there's something specific I can look at.

michael-harper · 2024-02-28T22:11:42Z

This looks good 👌 I'm keen to take a dig at this when time permits

MattWellie added 18 commits February 27, 2024 15:12

parallel export test

074cf32

remove second logger

7e15e9c

tinkering

1fe3990

correction

e3574f8

correction again

5842a58

correction again

e08716b

rubbish comment

1b97263

weird...

bfcef63

silly silly silly

75d81be

silly silly silly * 2

7d47490

correction to storage

45dc723

correction

90cc97c

init

d74535e

init correction

1f51edd

vds attempt

b5c335d

vds attempt 2

3717dd2

docstring update

52649cf

import update

1fe5912

MattWellie requested review from cassimons, illusional, vivbak and michael-harper February 27, 2024 07:30

MattWellie added 3 commits February 27, 2024 17:30

unimportant update

68b68eb

lint

64f7710

change to MiB, these partitions are tiny

1414af9

MattWellie mentioned this pull request Feb 27, 2024

Make better use of Hail Features populationgenomics/automated-interpretation-pipeline#333

Open

please the linter

5a73ac3

michael-harper mentioned this pull request Feb 27, 2024

Remove dataproc from MakeSiteOnlyVcf #626

Merged

illusional reviewed Feb 28, 2024

View reviewed changes

MattWellie mentioned this pull request Mar 11, 2024

Parallel exports gcloud compose [prototype] #639

Draft

MattWellie closed this Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Prototype] Parallel exports of Hail Structures -> VCFs #629

[Prototype] Parallel exports of Hail Structures -> VCFs #629

MattWellie commented Feb 27, 2024 •

edited

illusional left a comment

michael-harper commented Feb 28, 2024

[Prototype] Parallel exports of Hail Structures -> VCFs #629

[Prototype] Parallel exports of Hail Structures -> VCFs #629

Conversation

MattWellie commented Feb 27, 2024 • edited

What sucks

Whats cool

Whats this

illusional left a comment

Choose a reason for hiding this comment

michael-harper commented Feb 28, 2024

MattWellie commented Feb 27, 2024 •

edited