GitHub - kramundson/potato-mapping: snakemake pipeline for fastq to vcf

Potato mapping and variant calling from short read sequencing

Before running, install the Python3 version of Miniconda

Once Miniconda is installed, install software dependencies using included environment.yaml file. To set up a conda environment, see the tutorial

Environment creating instructions

To activate this environment on the command line:

source activate <name_of_environment>

This Snakemake workflow runs the following:

Download of DM1-3 reference genome from http://solanaceae.plantbiology.msu.edu/pgsc_download.shtml
Download Illumina sequence data for all publicly available samples in units.tsv (those that start with "SRR")
Read quality and adapter trimming
BWA mem alignment to reference, including sam to bam conversion, bam sort, and samtools index. Paired end reads should be uninterleaved and are are parsed from units.tsv by having the appropriate file name in column fq2 of units.tsv. See units.tsv for example. For single end reads, set the fq2 field of that row to "NaN"
HaplotypeCaller from GATK4 in GVCF mode on each sample, scattered across intervals.
CombineGVCFs from biological samples, using GATK4 CombineGVCFs
Joint genotype calling on per-sample GVCFs using GATK4 GenotypeGVCFs

Outputs:

One GVCF file per sample and its index
Population VCF file
Currently, all intermediate files (e.g., trimmed reads, unprocessed bams, region GVCFs) are also kept. Uses too much disk space, so this will likely go away soon.

Configuration:

Modify units.tsv to suit your needs. Each column specifies: sample: unique biological sample unit: unique combination of biological sample, library prep, and sequencing run fq1: name of forward read fq2: name of reverse read (enter NaN here if reads are single-ended) parhap: not actually used, yet

Note: Avoid using the "-" character in the sample and unit fields.
Place reads described in fq1 and fq2 not from SRA in the subfolder data/reads/
Modify parameters, thread usage, and names of target output files in config.yaml
Snakemake will automatically spawn jobs when running on a cluster. If desired, you can change the memory and CPU requirements of each job (as well as other params) by modifying the file cluster.yaml. The params specified in cluster.yaml worked.
Run pipeline. In a cluster, the job can be submitted with the following command:

sbatch runSnakes.slurm

This command will run two Snakefiles in succession. The first, init_genome.snakes, downloads the reference, generates reference index files, and sets up intervals that GATK4 will operate on in parallel. The second file, Snakefile, downloads and processes sequencing reads.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis

analysis

data/reads

data/reads

rules

rules

scripts

scripts

.gitignore

.gitignore

README.md

README.md

Snakefile

Snakefile

cluster.yaml

cluster.yaml

config.yaml

config.yaml

environment.yaml

environment.yaml

init_genome.snakes

init_genome.snakes

runSnakes.slurm

runSnakes.slurm

units.tsv

units.tsv

Repository files navigation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
analysis		analysis
data/reads		data/reads
rules		rules
scripts		scripts
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
cluster.yaml		cluster.yaml
config.yaml		config.yaml
environment.yaml		environment.yaml
init_genome.snakes		init_genome.snakes
runSnakes.slurm		runSnakes.slurm
units.tsv		units.tsv

kramundson/potato-mapping

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages