Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for outputting gvcfs instead of vcfs #908

Open
asp8200 opened this issue Jan 11, 2023 · 8 comments
Open

Add option for outputting gvcfs instead of vcfs #908

asp8200 opened this issue Jan 11, 2023 · 8 comments
Labels
enhancement New feature or request

Comments

@asp8200
Copy link
Contributor

asp8200 commented Jan 11, 2023

Description of feature

It might be useful to add a CLI-option for outputting gvcf-files instead of vcf-files. It seems that with the current version that is only possible when running --joint_germline. (However, I could be wrong).

@asp8200 asp8200 added the enhancement New feature or request label Jan 11, 2023
@FriederikeHanssen
Copy link
Contributor

No that is correct, mostly based on some discussions we had with @nickhsmith that for most the final ones would be the most interesting . as well as the instructions for the single sample workflow by GATK here

We would have to see how we process with single sample filtering then. However, since that step seems to cause quite some issues for some, because of the time requirements it has, it might be good to also allow skipping it with a parameter. we could probably combine then both new parameters here.

@asp8200
Copy link
Contributor Author

asp8200 commented May 26, 2023

Some work has now been done in this direction.

The new sentieon-haplotyper-flow is able to output vcf, gvcf or both:

#1007

@amizeranschi
Copy link
Contributor

Is there any update on this regarding GATK? There was some recent progress on a related issue, but that was also related to joint_germline, and not the single sample workflow: #764

@FriederikeHanssen
Copy link
Contributor

No, when running GATK in single sample it is not using -ERC GVCF, so no gvcf are produced as according to the gatk docs:https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels- . You could still force it with a custom config, but not sure if we should at it in by default

@IsmailM
Copy link

IsmailM commented Aug 24, 2023

Thanks for the update.

I agree it probably shouldn't be the default, but it would be nice if it could be created if using a flag (--generate_gvcf etc.).

Would you mind providing details (or pointing me in the right direction/docs) on how to create a custom config to force the gvcf generation?

@FriederikeHanssen
Copy link
Contributor

sure, this should work:

process {

     withName: GATK4_HAPLOTYPECALLER {
       ext.args = "-ERC GVCF"
      ext.prefix = { meta.num_intervals <= 1 ? "${meta.id}.haplotypecaller.g" : "${meta.id}.haplotypecaller.${intervals.simpleName}.g"  }
     }
     
      withName: 'MERGE_HAPLOTYPECALLER' {
        ext.prefix       = { "${meta.id}.haplotypecaller.g" }
    }
}

should work.
single sample is by default enabled (CNNScoreVariants/FilterVarianttranches), but you can skip it if you'd like

@IsmailM
Copy link

IsmailM commented Oct 10, 2023

Hi Friederike,

Many thanks for your help here.

We managed to produce the gVCF successfully with your custom config.

However, even though the CNNScoreVariants finishes running (after 2 days), it does not seem to doesn't actually work, so the next step fails with a pretty telling error:

02:13:39.663 WARN  IntelInflater - Zero Bytes Written : 0
02:13:39.676 INFO  FilterVariantTranches - Finished pass 0 through the variants
02:13:39.676 INFO  FilterVariantTranches - Found 0 SNPs and 0 indels with INFO score key:CNN_1D.
02:13:39.676 INFO  FilterVariantTranches - Found 0 SNPs and 4894774 indels in the resources.
02:13:39.677 INFO  FilterVariantTranches - Filtered 0 SNPs out of 0 and filtered 0 indels out of 0 with INFO score: CNN_1D.
02:13:39.681 INFO  FilterVariantTranches - Shutting down engine
[October 6, 2023 at 2:13:39 AM GMT] org.broadinstitute.hellbender.tools.walkers.vqsr.FilterVariantTranches done. Elapsed time: 29.85 minutes.
Runtime.totalMemory()=2281701376
***********************************************************************


A USER ERROR has occurred: Bad input: VCF contains no variants or no variants with INFO score key "CNN_1D"

Any recommendations/hints here would be really appreciated.

(I have had a look at the documentation/google for CNNScoreVariants and there doesn't seem to be an option for gVcf (or any hints on how to run with gvcf).

(or should we simply skip the CNNScoreVariants / FilterVarianttranches steps)?

@FriederikeHanssen
Copy link
Contributor

FriederikeHanssen commented Oct 10, 2023

I guess CNNScoreVariants doesn't like gvcf. Yes you can skip by setting --skip_tools haplotypecaller_filter. You could ask on GATk forum if there is such an optionm would be happy to add it
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants