Skip to content

lzy604/RNA-seq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 

Repository files navigation

RNA-SEQ PIPELINE

This pipeline performs the following tasks:

  • Create an isolated environment for RNA-seq analysis
  • Intallation and Reference Genomes
  • Quality control on FastQ files (FastQC)
  • Adapter Trim(Trim Galore)
  • Alignment(STAR)
  • QC reports for RNA-seq(RSeQC)
  • Quantifying gene expression(HTSeq-count)
  • RPKMs and TPM (using edgeR or StringTie)
  • Differential expression analysis for standard designs (DESeq2)
  • Visualization(R)
  • Running the pipeline

RNA-seq Data Standards

  1. A bulk RNA-seq experiment is an RNA-seq assay in which the average library insert size is 200 base pairs.
  2. Experiments should have two or more replicates. Assays performed using EN-TEx samples may be exempted due to limited availability of experimental material.
  3. Each replicate should have 30 million aligned reads, although older projects aimed for 20 million reads. Best practices for ENCODE2 RNA-seq experiments have been outlined here.
  4. Replicate concordance: the gene level quantification should have a Spearman correlation of >0.9 between isogenic replicates and >0.8 between anisogenic replicates (i.e. replicates from different donors).

System requirements

  • Linux/Unix
  • Python
  • R

Installation

We uses the Miniconda3 package management system to harmonize all of the software packages. Use the following commands to install Minicoda3:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Create an isolated environment for RNA-seq

conda create -n rna-seq
conda activate rna-seq

Install tools

Tools needed for this analysis are: R, samtools, FastQC, Trim Galore, STAR, RSeQC, stringtie, gffcompare, htseq-count.

conda config --add channels bioconda
conda config --add channels conda-forge
conda install -c r r 
conda install -c bioconda samtools
conda install -c bioconda fastqc
conda install trim-galore
conda install STAR
conda install -c bioconda rseqc 
conda install -c bioconda htseq
conda install -c bioconda bioconductor-deseq2
conda install -c bioconda stringtie 

Genome files

Obtain a reference genome from Ensembl, iGenomes, NCBI or UCSC. In this example analysis we will use the mouse mm10 version of the genome from UCSC.

mkdir anno
cd anno
mkdir mm10
cd mm10
wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Mus_musculus/UCSC/mm10/Mus_musculus_UCSC_mm10.tar.gz

Generate genome indexes files for STAR mapping

tar zxvf Mus_musculus_UCSC_mm10.tar.gz
STAR --runThreadN 30 --runMode genomeGenerate --genomeDir star_index_mm10 --genomeFastaFiles /Sequence/WholeGenomeFasta/genome.fa --sjdbGTFfile /Annotation/Archives/archive-current/Genes/genes.gtf 

Download the public data

for ((i = 12;i<=15;i++)); #
do
fastq-dump --split-3 -O data/ SRR0020$i.sra.sra 
done

Quality control on FastQ files

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines.

run FastQC interactively or using ht CLI, which offers the following options:

fastqc seqfile1 seqfile2 .. seqfileN

Adapter Trim[OPTIONAL]

Use trim_glore to trim sequence adapter from the read FASTQ files.

trim_galore -q 20 --phred33 --stringency 3 --length 20 -e 0.1 \
            --paired $dir/cmp/01raw_data/$fq1 $dir/cmp/01raw_data/$fq2  \
            --gzip -o $input_data

Alignment

Perform alignments with STAR to the genome and transcriptome.

STAR --runThreadN 10 --genomeDir ~/anno/mm10/ --readFilesCommand zcat --readFilesIn R1.fastq.gz R2.fastq.gz --outFileNamePrefix samplename   --outSAMtype BAM SortedByCoordinate --outBAMsortingThreadN 5

Visualization the mapping ratio by R

Rscript ~rcode/mapping.R

QC reports for RNA-seq

RSeQC package comprehensively evaluate different aspects of RNA-seq experiments, such as sequence quality, GC bias, polymerase chain reaction bias, nucleotide composition bias, sequencing depth, strand specificity, coverage uniformity and read distribution over the genome structure.

‘geneBody_coverage.py’ scales all transcripts to 100 nt and calculates the number of reads covering each nucleotide position. Finally, it generates a plot illustrating the coverage profile along the gene body.

samtools index input.sorted.bam 
geneBody_coverage.py -r hg19.housekeeping.bed -i test1.bam,test2.bam,test3.bam  -o output

'clipping_profile.py' calculate the distributions of clipped nucleotides across reads.This program is used to estimate clipping profile of RNA-seq reads from BAM or SAM file. Note that to use this funciton, CIGAR strings within SAM/BAM file should have ‘S’ operation (This means your reads aligner should support clipped mapping).

clipping_profile.py -i test1.bam -s "PE" -o out

Quantifying gene expression

Counting reads in features with htseq-count

htseq-count -f bam -r name -s no -a 10 -t exon -i gene_id -m intersection-nonempty yourfile_name.bam ~/reference/hisat2_reference/Homo_sapiens.GRCh38.86.chr_patch_hapl_scaff.gtf > counts.txt

Calculate RPKM and TPM by R

Rscript ~rcode/counts2rpkm_tpm.R

Differential expression analysis

DE analysis is done using the DESeq2 Bioconductor package. It takes the merged raw read counts (from HTseq-count) as an input:

Rscript ~rcode/deseq2.R

Running the pepline using the bash file

One command for running

bash rna_seq.bash

About

RNA-seq analysis

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published