-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
586 Multiallelic splitting #602
base: main
Are you sure you want to change the base?
Conversation
This reverts commit c3fe7c5.
|
||
# depend on the previous job | ||
if siteonly_j: | ||
siteonly_j.depends_on(jobs[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I am not sure how I feel about the use of jobs[-1] in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's safer to depend on all prior jobs, and Hail can do much bigger dependency graphs than this. Will update
|
||
# bcftools norm splits multiallelic variants | ||
# -m -any: split all multiallelic variant types | ||
# -N: don't left-align indels (NO Normalisation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be keen to discuss this at some point. I know it would require some further rejig of our pipelines, but it still feels like normalisation would make sense for us.
@@ -88,14 +83,9 @@ def annotate_cohort( | |||
|
|||
logging.info('Annotating with clinvar and munging annotation fields') | |||
mt = mt.annotate_rows( | |||
AC=mt.info.AC[mt.a_index - 1], | |||
AF=mt.info.AF[mt.a_index - 1], | |||
AC=mt.info.AC[0], # TODO not sure about these two |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these need to go up into where the split happens now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to check the behaviour here, the splitting should mean these are either single numbers, or lists of length 1
@@ -67,17 +71,27 @@ def queue_jobs(self, cohort: Cohort, inputs: StageInput) -> StageOutput | None: | |||
vcf_path = self.expected_outputs(cohort)['vcf'] | |||
siteonly_vcf_path = self.expected_outputs(cohort)['siteonly'] | |||
scatter_count = joint_calling_scatter_count(len(cohort.get_sequencing_groups())) | |||
# vcf framents, stripped of genotypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# vcf framents, stripped of genotypes | |
# vcf fragments, stripped of genotypes |
|
||
jc_jobs = joint_genotyping.make_joint_genotyping_jobs( | ||
b=get_batch(), | ||
out_vcf_path=vcf_path, | ||
out_split_sitesonly_vcf_part_paths=out_split_vcf_part_paths, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the make_joint_genotyping_jobs
changes in the jobs/joint_genotyping
, two new arguments are added:
out_split_vcf_part_paths: list[Path] | None = None,
out_split_sitesonly_vcf_part_paths: list[Path] | None = None,
But only the latter is passed in here, does this check out? Is it because we want to trigger the # if requested, make a site-only VCF for this split fragment (for VEP)
condition? And that we would only have a out_split_vcf_part_paths
argument to pass in if the split vcf had already been made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this is a bit convoluted. I confused myself a bit.
This process makes all the fragments separately and joins them up, with the code offering to save the end product, the fragments, or both. The files that we want at the end of this process are:
- The whole VCF with split multiallelics and genotypes (for combining annotations -> Seqr)
- The whole VCF without splitting, with no genotypes for VQSR
- The fragments which are both split and no genotypes for VEP
All these VCF fragments are generated, presence in this list just determines which are generated in tmp, and which are persisted to GCP. I think it's correct to:
- Not persist unsplit sites-only fragments (VQSR re-splits the whole sites-only VCF based on different logic. I think doing that is dumb, but the topic for another PR)
- Persist split VCF with genotypes for annotateCohort later on
- Persist the split & sites-only VCF fragments for feeding into VEP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a bunch of corrections to bring it in line with this intent ^ which it was definitely not doing before
Test batches keep failing when re-aggregating all the VEP data |
Latest batch with the highmem worker patch has succeeded - requires evaluation |
Closes #586
Proposal here
The relevant changes match the proposal in the linked document: