Fully Automated RNAseq pipeline, goes from raw reads to count matrix. Includes quality control.
This pipeline automates and parellizes Illumina NGS processing to produce a count matrix ready for differential expression analyses. It also automates all quality control for this process. Reads can be Paired or Single End. It builds indexes and performs allignment using STAR. Trimming is optionally performed by Trimmomatic if specified. Read quality control is performed by FastQC. Aligned reads are aggregated to genes using FeatureCounts. Quality control reports are generated using MultiQC.
The pipeline itself is uses BigDataScript (BDS) to organize the execution of bash and R scripts for each required task. BDS allows this pipeline to be deployed on personal machines, or clusters with minimal modifcation, see the BDS docs. BDS handles task parellelization and queueing, including interfacing with cluster task managers.
This pipeline is an automated wrapper for incredible pre-exisiting softwares. Please check out the work involved in everything under the hood:
- Java - Required by BigDataScript
Bad news: The only way to install is through git and I've only tried it so far on Linux systems.
Good news: It self assembles.
git clone https://github.com/mattisabrat/CountMatrix
cd CountMatirx/
./Install.sh
./CountMatrix.sh -e ${Experimental_Directory}
- -e ${Experimental_Directory} : Full path to a correctly formatted experimental directory
- -n ${nThreads} : The number of threads to be used by each task within the pipeline, default is 1
- -t : Activates trimming
- -f : Pipeline passes user supplied flags
Your Experimental_Directory must be correctly formatted and contain the requisite files for the pathway to run. Regardless of pipeline mode, the Experimental directory must contain a folder named raw_reads which contains a subfolder for each sample, lets call them sample_folders, with the sample name as the sample_folder name. Each of these sample_folders must contain the .fastq.gz files for that sample. The actual filenames don't really matter since it assigns the sample name based on the sample_folder not the .fastq.gz, though its always best to be consistent when naming files.
The pipeline requires the read files be in .fastq.gz or .fq.gz format.
Don't worry about it. Place the read files into sample folders in raw_reads and the pipeline will infer if you've given it PE or SE reads based on the number of .fastq.gz files. It should even be able to run if an experiment contains a mix of PE and SE samples, though I don't know why you would ever do that.
The Experimental_Directory must contain a folder named genome. This folder needs to contain an .fa or.fna genome file and a .gtf annotation file to build the genome indices.
Once the pipeline has been run, you can reuse the processed genome folder in subsequent experiments where the same genome is needed. Simply copy and paste the whole folder into a new Experimental_Directory. When this new experiment is run through the pipeline, it will skip the indexing steps saving a bit of time.
This pipe uses overwritable defaults to manage the options used by each of its consituent softwares. The idea behind this is that a lab could set up the pipeline with their preferred settings. If these default settings are ever inappropriate for an experiment, they can be overwritten for a single pipeline run without having to touch the defaults, leaving the lab's typical pipeline settings unaltered.
The flags, whether default for user supplied, are appended to the base call for each task at the time of execution. For the Aggregate calls, which are in R, additional options are added before the closing parenthesis. The base calls can be found below.
Hidden in the head directory of the pipeline is a .Default_Flags folder containing a CountMatrix_Defaults.config file with the default flags for each task. Edit to set up your defaults, see their respective docs (linked above) for how each software uses flags.
To overwrite the default flags for a pipeline run, the -f flag must be provided when invoking the pileline and the Experimental_Directory should contain a folder named flags. This folder should contain a user_flags.config file. Any flags provided in this file will overwrite the corresponding default flags. The pipeline will continue to use the defaults for all tasks not specified in user_flags.config.
-
STAR indexing :
STAR --runThreadN ${nThreads} --runMode genomeGenerate --genomeDir ${Index_Destination} --genomeFastaFiles ${UnIndexed_FA} --sjdbGTFfile ${Annotation} -
STAR quantifying PE :
STAR --genomeDir ${Index_Path} --outFileNamePrefix ${Quant_Destination} --runThreadN ${nThreads} --readFilesIn ${Read_1} ${Read_2} --readFilesCommand gunzip -c -
STAR quantifying SE :
STAR --genomeDir ${Index_Path} --outFileNamePrefix ${Quant_Destination} --runThreadN ${nThreads} --readFilesIn ${Read_1} --readFilesCommand gunzip -c -
STAR aggregating (featureCounts) :
featureCounts( files = Quant_File_Paths, annot.ext = opt$Annotation, isGTFAnnotationFile = TRUE, nthreads = opt$nThreads, isPairedEnd = Is_Paired,) -
Trimmomatic PE :
java -jar ${Jar} PE -threads ${nThreads} ${Fastqs_Joined} -baseout ${Trim_Output_File} -
Trimmomatic SE :
java -jar ${Jar} SE -threads ${nThreads} ${Fastqs_Joined} ${Trim_Output_File} -
MultiQC
multiqc ${Experiment} -f -o ${Experiment} -n Quality_Control -
FastQC
fastqc ${Fastqs_Joined} --outdir ${FastQC_Destination}
- R-3.6.0
- Python-3.5.5
- STAR-2.7.1a
- Trimmomatic-0.36
- tximport-1.12.3
- Rsubread-1.34.4
This pipeline was written and is maintained by Matt Davenport (mdavenport@rockefeller.edu).