Skip to content
master
Switch branches/tags
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
bin
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Kids First DRC Joint Genotyping Workflow

Kids First Data Resource Center Joint Genotyping Workflow (cram-to-deNovoGVCF). Cohort sample variant calling and genotype refinement.

Using existing gVCFs, likely from GATK Haplotype Caller, we follow this workflow: Germline short variant discovery (SNPs + Indels), to create family joint calling and joint trios (typically mother-father-child) variant calls. Peddy is run to raise any potential issues in family relation definitions and sex assignment.

If you would like to run this workflow using the cavatica public app, a basic primer on running public apps can be found here. Alternatively, if you'd like to run it locally using cwltool, a basic primer on that can be found here and combined with app-specific info from the readme below. This workflow is the current production workflow, equivalent to this Cavatica public app.

data service logo

Runtime Estimates

  • Single 5 GB gVCF Input: 90 Minutes & $2.25
  • Trio of 6 GB gVCFs Input: 240 Minutes & $3.25

Tips To Run:

  1. inputs vcf files are the gVCF files from GATK Haplotype Caller, need to have the index .tbi files copy to the same project too.
  2. If you are experiencing issues with Variant Recalibration either in VariantRecalibrator or ApplyVQSR, consider adjusting the max_gaussians. If a dataset gives fewer variants than the expected scale, the number of Gaussians for training should be turned down. Lowering the max-Gaussians forces the program to group variants into a smaller number of clusters, which results in more variants per cluster.
  3. ped file in the input shows the family relationship between samples, the format should be the same as in GATK website link, the Individual ID, Paternal ID and Maternal ID must be the same as in the inputs vcf files header.
  4. Here we recommend to use GRCh38 as reference genome to do the analysis, positions in gVCF should be GRCh38 too.
  5. Reference locations:
  6. Suggested inputs:
    • Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz
    • Homo_sapiens_assembly38.dbsnp138.vcf
    • hapmap_3.3.hg38.vcf.gz
    • Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
    • 1000G_omni2.5.hg38.vcf.gz
    • 1000G_phase1.snps.high_confidence.hg38.vcf.gz
    • Homo_sapiens_assembly38.dict
    • Homo_sapiens_assembly38.fasta.fai
    • Homo_sapiens_assembly38.fasta
    • 1000G_phase3_v4_20130502.sites.hg38.vcf
    • hg38.even.handcurated.20k.intervals
    • homo_sapiens_vep_93_GRCh38_convert_cache.tar.gz, from ftp://ftp.ensembl.org/pub/release-93/variation/indexed_vep_cache/ - variant effect predictor cache.
    • wgs_evaluation_regions.hg38.interval_list

Other Resources

pipeline flowchart