Skip to content

Basic multi-sample variant calling workflow with GATK

License

Notifications You must be signed in to change notification settings

ibebio/vc-gatk4-snakemake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Snakemake workflow: GATK variant calling using gVCFs and hard filtering

Snakemake Build Status

Authors

  • Snakemake workflow: Ilja Bezrukov

Usage

Step 1: Obtain a copy of this workflow

Clone the repository into the place where you want to perform the data analysis.

git clone https://github.com/ibebio/vc-gatk4-snakemake.git

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and samples.csv to specify your sample setup.

Run the following command to make the required scripts executable:

$ chmod u+x workflow/scripts/*.*

Step 3: Install Snakemake

Install Snakemake using conda:

conda create -c bioconda -c conda-forge -n snakemake snakemake"=="5.28.0 python">="3.7

For installation details, see the instructions in the Snakemake documentation.

Step 4: Execute workflow

For the Weigel lab, set up your SGE cluster profile as follows:

git clone https://github.com/ibebio/snakemake_profiles.git
cd snakemake_profiles
mkdir -p ~/.config/snakemake/
chmod u+x sge/*.py
cp -r sge ~/.config/snakemake/

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake -n

A helper script run-workflow.sh, is included to conveniently run the workflow, either locally or on the cluster:

./run-workflow.sh sge

would run the pipeline on the SGE cluster, as set up previously.

./run-workflow.sh local

would run it on the local maschine.

To customize how many cores and jobs are used, you can either modify the run-workflow.sh script or run the commands required to run the workflow by hand, as described below.

To clean up all output files and conda environments to rerun the workflow from scratch, the helper script clean-all.sh is included.

Run the workflow with custom settings

Execute the workflow locally via

snakemake --use-conda --cores $N --scheduler greedy

using $N cores.

To run it in a cluster environment, first create all required conda environments via

snakemake --use-conda --conda-create-envs-only --cores 4

Then, run the workflow via

snakemake --use-conda --profile sge --jobs 100 --scheduler greedy

The number of jobs can be adjusted as required. Additional arguments for Snakemake can also be supplied.

Step 5: Investigate results

All output is stored in the results/ subfolder. Logs for each step are stored in logs/.

The pipeline produces the following outputs:

Filtered variants

results/variants/filtered/all.vcf contains SNPs and INDELs for all samples filtered according to the variant_filtering section in the config file.

Biallelic SNPs

results/variants/filtered/biallelic-snps.vcf contains only biallelic SNPs. Non-PASS variants are removed, and the requirement of a minimal missing fraction (default 0.1) is added. This variants are the basis for further analysis.

FASTA files for specific regions for each sample

The directory results/region_fasta/ contains fasta nucleotid sequence files for specific regions, e.g. effectors. The files are generated for each sample separately, and contain the variants from the previous step (Biallelic SNPs) which are present in this samples. They are named {sample}.{region}.fasta. The regions are defined in the config file in the section regions.

Step 6: Obtain updates from upstream

Whenever you want to synchronize your workflow copy with bugfixes or new developments from the upstream repository, do the following:

  1. At the very least, your config files will be different, compared to the example ones from upstream. Therefore, they need to be secured before obtaining the upstream copy: git stash
  2. Obtain the updates from the Github repository: git pull
  3. Restore your modifications to the config files: gut stash pop

The above steps assume that you did not modify any parts of the workflow, except the config files. If the config format has changed, you might need to update them.

The workflow/ folder contains the Snakemake files and scripts that are needed to run the workflow. It does not need to be changed unless the workflow has to be modifed.

See the Snakemake documentation for further details.

About

Basic multi-sample variant calling workflow with GATK

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published