Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

586 Multiallelic splitting #602

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

Conversation

MattWellie
Copy link
Contributor

Closes #586

Proposal here

The relevant changes match the proposal in the linked document:

  • during joint genotyping-per-interval, create a sites-only version of each fragment (for use in VQSR in fragments, and also as a whole VCF)
  • also generate a version of each VCF with multiallelics split (to be joined, becoming the final callset VCF)
  • ALSO generate a sites-only version of each split fragment (for use in VEP)

cpg_workflows/jobs/joint_genotyping.py Show resolved Hide resolved

# depend on the previous job
if siteonly_j:
siteonly_j.depends_on(jobs[-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I am not sure how I feel about the use of jobs[-1] in this context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safer to depend on all prior jobs, and Hail can do much bigger dependency graphs than this. Will update


# bcftools norm splits multiallelic variants
# -m -any: split all multiallelic variant types
# -N: don't left-align indels (NO Normalisation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be keen to discuss this at some point. I know it would require some further rejig of our pipelines, but it still feels like normalisation would make sense for us.

@@ -88,14 +83,9 @@ def annotate_cohort(

logging.info('Annotating with clinvar and munging annotation fields')
mt = mt.annotate_rows(
AC=mt.info.AC[mt.a_index - 1],
AF=mt.info.AF[mt.a_index - 1],
AC=mt.info.AC[0], # TODO not sure about these two
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to go up into where the split happens now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to check the behaviour here, the splitting should mean these are either single numbers, or lists of length 1

@@ -67,17 +71,27 @@ def queue_jobs(self, cohort: Cohort, inputs: StageInput) -> StageOutput | None:
vcf_path = self.expected_outputs(cohort)['vcf']
siteonly_vcf_path = self.expected_outputs(cohort)['siteonly']
scatter_count = joint_calling_scatter_count(len(cohort.get_sequencing_groups()))
# vcf framents, stripped of genotypes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# vcf framents, stripped of genotypes
# vcf fragments, stripped of genotypes


jc_jobs = joint_genotyping.make_joint_genotyping_jobs(
b=get_batch(),
out_vcf_path=vcf_path,
out_split_sitesonly_vcf_part_paths=out_split_vcf_part_paths,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the make_joint_genotyping_jobs changes in the jobs/joint_genotyping, two new arguments are added:

out_split_vcf_part_paths: list[Path] | None = None,
out_split_sitesonly_vcf_part_paths: list[Path] | None = None,

But only the latter is passed in here, does this check out? Is it because we want to trigger the # if requested, make a site-only VCF for this split fragment (for VEP) condition? And that we would only have a out_split_vcf_part_paths argument to pass in if the split vcf had already been made?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this is a bit convoluted. I confused myself a bit.

This process makes all the fragments separately and joins them up, with the code offering to save the end product, the fragments, or both. The files that we want at the end of this process are:

  1. The whole VCF with split multiallelics and genotypes (for combining annotations -> Seqr)
  2. The whole VCF without splitting, with no genotypes for VQSR
  3. The fragments which are both split and no genotypes for VEP

All these VCF fragments are generated, presence in this list just determines which are generated in tmp, and which are persisted to GCP. I think it's correct to:

  1. Not persist unsplit sites-only fragments (VQSR re-splits the whole sites-only VCF based on different logic. I think doing that is dumb, but the topic for another PR)
  2. Persist split VCF with genotypes for annotateCohort later on
  3. Persist the split & sites-only VCF fragments for feeding into VEP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a bunch of corrections to bring it in line with this intent ^ which it was definitely not doing before

@MattWellie
Copy link
Contributor Author

Test batches keep failing when re-aggregating all the VEP data
https://batch.hail.populationgenomics.org.au/batches/432509

@MattWellie
Copy link
Contributor Author

Latest batch with the highmem worker patch has succeeded - requires evaluation
https://batch.hail.populationgenomics.org.au/batches/433214?q=&last_job_id=50

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Split Multiallelic Variants in main pipeline
3 participants