Skip to content

nigyta/rice_reseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAPDB variant call pipeline

How to run the RAPDB pipeline

Tutorial to run on NIG Super-Computer (Japanese).

Prerequisites

  • Python3 and pip (required to install cwltool by pip install cwltool)
  • cwltool
  • samtools, awk, perl (required to create a depth file for Tasuke+)
  • Docker
    If you want to run the pipeline on NIG-SuperCompurer, use singularity instead of Docker, and node.js is required additionally (See below).
  • 20GB memory

Prepare reference files

Execute the commands below to download reference files.

$ ./scripts/download_reference_files.sh

Genomic Fasta, Annotation GTF, and Protein Fasta will be downloaded to reference directory.

How to Run

Shown below is the commands to run the pipeline from test directory. You can invoke the pipeline from any directory by changing the file paths appropriately.

  • Help
    cwltool ../workflows/rapdb-pipeline.cwl -h
    
  • Run
    All the output files will be generated into the directory specified with --outdir. If not specified, files will be generated into the current working directory.
    cwltool --outdir test_out ../workflows/rapdb-pipeline.cwl --fastq1 read1.fq.gz --fastq2 read2.fq.gz --outprefix SAMDxxxxxxx --threads 2 --reference ../reference/IRGSP-1.0_genome_M_C_unanchored.fa --ref_gtf ../reference/RAP-DB_MSU_concatenated_for_snpEff.gtf --ref_protein ../reference/RAP-DB_MSU_concatenated_protein.fa
    
    or specifying a job.yaml file
    cwltool --outdir test_out ../workflows/rapdb-pipeline.cwl rapdb-pipeline.test.job.yaml 
    
  • Run using singularity (NIG-SC)
    Use --singularity. Singularity image files will be generated into the current directory. It is recommended to run the pipeline in the same working directory to avoid generating redundant image files. "node.js" is also required. See this for the detailed tutorial in Japanese.
    cwltool --singularity --outdir test_out ../workflows/rapdb-pipeline.cwl --fastq1 read1.fq.gz --fastq2 read2.fq.gz --outprefix SAMDxxxxxxx --threads 2 --reference ../reference/IRGSP-1.0_genome_M_C_unanchored.fa --ref_gtf ../reference/RAP-DB_MSU_concatenated_for_snpEff.gtf --ref_protein ../reference/RAP-DB_MSU_concatenated_protein.fa
    
  • For debugging (Use --cachedir to keep cache files and to resume the job)
    cwltool --cachedir test_cache --outdir test_out ../workflows/rapdb-pipeline.cwl --fastq1 read1.fq.gz --fastq2 read2.fq.gz --outprefix SAMDxxxxxxx --threads 2 --reference ../reference/IRGSP-1.0_genome_M_C_unanchored.fa --ref_gtf ../reference/RAP-DB_MSU_concatenated_for_snpEff.gtf --ref_protein ../reference/RAP-DB_MSU_concatenated_protein.fa 
    
  • Run without making depth data for Tasuke+
    If the depth file for Tasuke+ is not required, use rapdb-pipeline_wo_tasuke.cwl.
    cwltool --outdir test_out ../workflows/rapdb-pipeline_wo_tasuke.cwl --fastq1 read1.fq.gz --fastq2 read2.fq.gz --outprefix SAMDxxxxxxx --threads 2 --reference ../reference/IRGSP-1.0_genome_M_C_unanchored.fa --ref_gtf ../reference/RAP-DB_MSU_concatenated_for_snpEff.gtf --ref_protein ../reference/RAP-DB_MSU_concatenated_protein.fa
    
    or
    cwltool --outdir test_out ../workflows/rapdb-pipeline_wo_tasuke.cwl rapdb-pipeline.test.job.yaml 
    

Options

Name Description Mandatory / (default)
--fastq1 FASTQ file for the forward read YES
--fastq2 FASTQ file for the reverse read YES
--reference FASTA file for the reference genome YES
--ref_gtf Reference annotation in GTF for snpEff YES
--ref_protein Reference protein FASTA for snpEff YES
--outprefix Prefix for the output file (out)
--filter-expression Filtering condition for GATK-VariantFiltration ("QD < 5.0 || FS > 50.0 || SOR > 3.0 || MQ < 50.0 || MQRankSum < -2.5 || ReadPosRankSum < -1.0 || ReadPosRankSum > 3.5")
--threads Number of threads for parallel processing (1)

Reference files (genomic FASTA, protein FASTA, GTF) can be retrieved using download_reference_files.sh.

Workflow

See Analysis workflow for detection of genome-wide variations in TASUKE+ of RAP-DB
Files in the brackets ([ ]) will be output in the result directory.

  1. Read preprocessing (read_preprocessing.cwl)
    1.1 FASTQ stats for raw reads (seqkit stats) [stats report]
    1.2 Quality check for raw reads (FastQC) [Report HTML]
    1.3 Adapter trimming and read QC (Trimmomatic) [summary]
    1.4 FASTQ stats for preprocessed reads (seqkit stats) [stats report]
    1.5 Read quality check (FastQC) [Report HTML]
  2. Read mapping and BAM conversion (fastq2bam.cwl)
    2.1 Read mapping (BWA)
    2.2 Convert SAM to BAM (samtools)
    2.3 Sort BAM (samtools)
    2.4 Create unmapped BAM (Picard FastqToSam)
    2.5 Merge mapped and unmapped BAM (Picard Merge)
    2.6 Remove duplicated reads (Picard MarkDuplicate) [BAM, metrics file]
    2.7 Create BAM index (samtools index) [BAM index (.bai)]
    2.8 Statistics for de-duplicated BAM (samtools stats) [stats.txt]
    2.9 Extract unmapped reads from BAM (samtools, picard) [FASTQ]
  3. Variant calling, genotyping, filtering (bam2vcf.cwl)
    3.1 Variant calling (GATK HaplotypeCaller) [gVCF, tbi]
    3.2 Genotyping (GATK GenotypeGVCFs)
    3.3 Filtering (GATK VariantFiltration)
    Filtering condition: "QD < 5.0 || FS > 50.0 || SOR > 3.0 || MQ < 50.0 || MQRankSum < -2.5 || ReadPosRankSum < -1.0 || ReadPosRankSum > 3.5"
    3.4 Variant selection for SNP and INDEL (GATK SelectVariants) [VCF, tbi]
  4. Variant Annotation (snpeff_all.cwl)
    4.1 Build SnpEff database from reference files (SnpEff build)
    4.2 Annotate variants (SnpEff) [VCF, tbi, summary, effected genes]
  5. Make depth data for TASUKE+ (bam2tasuke.cwl) [Tasuke+ depth file, average depth]

Output

File name Description Step
{outprefix}_read-stats-raw.tsv Stats file for raw FASTQ files 1.1
{fastq1/2}_fastqc.html FastQC report for raw reads 1.2
{outprefix}.trimmomatic.summary.txt Summary of Trimmomatic result 1.3
{outprefix}_read-stats.tsv Stats file for preprocessed FASTQ files 1.4
{fastq1/2}.trimmomatic-pe_fastqc.html FastQC report for reads processed using Trimmomatoc 1.5
{outprefix}.rmdup.bam Alignment result in BAM format, after de-duplication 2.6
{outprefix}.rmdup.bam.bai BAM index for the file above 2.7
{outprefix}.rmdup.metrics Metrics file for BAM de-duplication 2.6
{outprefix}.rmdup.bam.stats.txt Statistics for de-duplicated BAM (generated using samtools-depth, in-house script) 2.8
{outprefix}.unmapped-read.{r1/r2}.fastq.gz Reads not mapped to the reference genome 2.9
variants_{outprefix}.g.vcf.gz gVCF file generated from GATK-HapolotypeCaller 3.1
variants_{outprefix}.g.vcf.gz.tbi Tab index for the file above 3.1
variants_{outprefix}.varonly.vcf.gz VCF file genotyped, filtered, and selected for variants 3.4
variants_{outprefix}.varonly.vcf.gz.tbi Tab index for the file above 3.4
variants_{outprefix}.snpEff.vcf.gz VCF file annotated using snpEFff 4.2
variants_{outprefix}.snpEff.vcf.gz.tbi Tab index for the file above 4.2
snpEff_genes_{outprefix}.txt snpEff result for effected genes 4.2
snpEff_summary_{outprefix}.html Summary of snpEff 4.2
{outprefix}_tasuke_depth.tsv.gz Depth file for Tasuke+ 5
{outprefix}.rmdup.bam.avg-depth.txt Average depth file for de-duplicated BAM (generated using samtools-depth) 5

Misc

  • Create depth file for Tasuku+
    tasuku_bamtodepth.cwl now runs locally without using docker. Make sure that Samtools must be installed and be in the PATH environmental variable.