586 Multiallelic splitting #602

MattWellie · 2024-01-19T02:47:06Z

Closes #586

Proposal here

The relevant changes match the proposal in the linked document:

during joint genotyping-per-interval, create a sites-only version of each fragment (for use in VQSR in fragments, and also as a whole VCF)
also generate a version of each VCF with multiallelics split (to be joined, becoming the final callset VCF)
ALSO generate a sites-only version of each split fragment (for use in VEP)

This reverts commit c3fe7c5.

cpg_workflows/jobs/joint_genotyping.py

cassimons · 2024-01-29T21:05:03Z

cpg_workflows/jobs/joint_genotyping.py

+
+            # depend on the previous job
+            if siteonly_j:
+                siteonly_j.depends_on(jobs[-1])


nit: I am not sure how I feel about the use of jobs[-1] in this context.

It's safer to depend on all prior jobs, and Hail can do much bigger dependency graphs than this. Will update

cassimons · 2024-01-29T21:08:59Z

cpg_workflows/jobs/joint_genotyping.py

+
+    # bcftools norm splits multiallelic variants
+    # -m -any: split all multiallelic variant types
+    # -N: don't left-align indels (NO Normalisation)


I would be keen to discuss this at some point. I know it would require some further rejig of our pipelines, but it still feels like normalisation would make sense for us.

cassimons · 2024-01-29T21:10:29Z

cpg_workflows/query_modules/seqr_loader.py

@@ -88,14 +83,9 @@ def annotate_cohort(

    logging.info('Annotating with clinvar and munging annotation fields')
    mt = mt.annotate_rows(
-        AC=mt.info.AC[mt.a_index - 1],
-        AF=mt.info.AF[mt.a_index - 1],
+        AC=mt.info.AC[0],  # TODO not sure about these two


Do these need to go up into where the split happens now?

I need to check the behaviour here, the splitting should mean these are either single numbers, or lists of length 1

EddieLF · 2024-01-30T00:48:28Z

cpg_workflows/stages/joint_genotyping.py

@@ -67,17 +71,27 @@ def queue_jobs(self, cohort: Cohort, inputs: StageInput) -> StageOutput | None:
        vcf_path = self.expected_outputs(cohort)['vcf']
        siteonly_vcf_path = self.expected_outputs(cohort)['siteonly']
        scatter_count = joint_calling_scatter_count(len(cohort.get_sequencing_groups()))
+        # vcf framents, stripped of genotypes


Suggested change

# vcf framents, stripped of genotypes

# vcf fragments, stripped of genotypes

EddieLF · 2024-01-30T00:56:33Z

cpg_workflows/stages/joint_genotyping.py


        jc_jobs = joint_genotyping.make_joint_genotyping_jobs(
            b=get_batch(),
            out_vcf_path=vcf_path,
+            out_split_sitesonly_vcf_part_paths=out_split_vcf_part_paths,


In the make_joint_genotyping_jobs changes in the jobs/joint_genotyping, two new arguments are added:

out_split_vcf_part_paths: list[Path] | None = None, out_split_sitesonly_vcf_part_paths: list[Path] | None = None,

But only the latter is passed in here, does this check out? Is it because we want to trigger the # if requested, make a site-only VCF for this split fragment (for VEP) condition? And that we would only have a out_split_vcf_part_paths argument to pass in if the split vcf had already been made?

Sorry, this is a bit convoluted. I confused myself a bit.

This process makes all the fragments separately and joins them up, with the code offering to save the end product, the fragments, or both. The files that we want at the end of this process are:

The whole VCF with split multiallelics and genotypes (for combining annotations -> Seqr)

The whole VCF without splitting, with no genotypes for VQSR

The fragments which are both split and no genotypes for VEP

All these VCF fragments are generated, presence in this list just determines which are generated in tmp, and which are persisted to GCP. I think it's correct to:

Not persist unsplit sites-only fragments (VQSR re-splits the whole sites-only VCF based on different logic. I think doing that is dumb, but the topic for another PR)

Persist split VCF with genotypes for annotateCohort later on

Persist the split & sites-only VCF fragments for feeding into VEP

I've made a bunch of corrections to bring it in line with this intent ^ which it was definitely not doing before

MattWellie · 2024-01-31T04:24:07Z

Test batches keep failing when re-aggregating all the VEP data
https://batch.hail.populationgenomics.org.au/batches/432509

MattWellie · 2024-02-15T04:46:19Z

Latest batch with the highmem worker patch has succeeded - requires evaluation
https://batch.hail.populationgenomics.org.au/batches/433214?q=&last_job_id=50

MattWellie added 9 commits January 19, 2024 11:47

add new Split VCF jobs

7e0451b

VQSR docstring clarification

e55ff91

essential VEP changes

3357b79

essential VEP changes 2

bf61b08

Optional VEP changes

c3fe7c5

Revert "Optional VEP changes"

4d5b42a

This reverts commit c3fe7c5.

sites-only version of the split vcf for VEP

14d3bd4

change input path for JG stage

d65b93d

Merge branch 'main' into 586_multiallelic-splitting

059576a

MattWellie requested review from vivbak, cassimons and EddieLF January 28, 2024 23:46

cassimons reviewed Jan 29, 2024

View reviewed changes

EddieLF reviewed Jan 30, 2024

View reviewed changes

MattWellie added 8 commits January 30, 2024 12:53

fix quite a few potential bugs

1612bde

don't pass argument for split parts

f815410

push that last change

1437140

maybe this?

a13da15

Job split name correction

1b073d0

input file name correction

505091a

sort split output file

c54a081

comment change

e61a079

MattWellie added 2 commits February 14, 2024 12:12

highmem worker for vep coercion

e0e7009

force-bump cpg-utils

fc04d12

MattWellie added 3 commits February 15, 2024 15:46

Merge branch 'main' into 586_multiallelic-splitting

38ff18b

remove VQSR dependency for pipeline VEP Stage

f5b17e2

black

d408b4f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

586 Multiallelic splitting #602

586 Multiallelic splitting #602

MattWellie commented Jan 19, 2024

cassimons Jan 29, 2024

MattWellie Jan 29, 2024

cassimons Jan 29, 2024

cassimons Jan 29, 2024

MattWellie Jan 29, 2024

EddieLF Jan 30, 2024

EddieLF Jan 30, 2024

MattWellie Jan 30, 2024

MattWellie Jan 30, 2024

MattWellie commented Jan 31, 2024

MattWellie commented Feb 15, 2024

	# vcf framents, stripped of genotypes
	# vcf fragments, stripped of genotypes

586 Multiallelic splitting #602

Are you sure you want to change the base?

586 Multiallelic splitting #602

Conversation

MattWellie commented Jan 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MattWellie commented Jan 31, 2024

MattWellie commented Feb 15, 2024