Whilst IGV is used to initial glance at coverage, these methods normalise & objectively quantify gene expression. To compare the expression rates of individual genes between samples you need to quantify the number of reads per gene. We first use RNA seq to determine the abundance of mRNA (cDNA) fragments, rather than the composition of the fragments.
Counting reads requires us to know the number of reads originating from a transcript, however counting reads is frequently ambiguous as to their transcript of origin because of multi-isoform genes. Precise measurement of transcript counts is not currently possible. Count based methods (DESeq2 / edgeR) study at gene level - they first requre counting reads across genome features using HTSeq or featureCounts.
The alternative approach is to use a transcript-level count rather than gene level. This infers an estimate of read number from each transcript.
- Overlapping bundles of reads are assembled
- Reads are connected in an overlap graph that models variability in read count for each gene across replicates (if the reads are compatible then these are assumed to come from the same genetic locus - calculates mutually incompatible reads where splice sites are mid exon or intron).
- Transcript isoforms are inferred from the minimum paths required to cover the graph
- Abundance of each gene isoform is estimated with a maximum likelihood probabilstic model. The variance estimates of reads count for each transcript are statistically tested to report DE genes.
Cufflinks is now outdated and superseded by StringTie
Raw refers to non-normalised read counts. Simply assign reads to a defined set of genes/transcripts & determine raw counts. Does not have a step to calculate assembly and different isoforms.
Use a BAM file & GTF file and assign each read as best as possible to a known gene to calculate counts. Then run statistical methods on these counts for differental expression.
Tools: HTSeq featureCounts QoRTs (Nobby uses this as it also does QC simultaneously) STAR (also does counts if you give it GTF file)
ml HTSeq ml Pysam
https://htseq.readthedocs.io/en/release_0.11.1/
for SAMPLE in $BAM
do
SRRID=`echo $SAMPLE | grep -E -o 'SRR[0-9]+'`
sbatch -N 1 -c 8 --mem=24GB --wrap="htseq-count --format bam --order pos --mode intersection-strict --stranded reverse --minaqual 1 --type exon --idattr gene_id $SAMPLE $GTF > ${OUT}_${SRRID}.tab"
done
### NEW RUNING ALL SAMPLES AT ONCE FROM ALL TIME POINTS
# create one htseq/ output folder in expression/
mkdir -p /home/camp/ziffo/working/oliver/projects/airals/expression/htseq/
#set GTF annoation
GTF=/home/camp/ziffo/working/oliver/genomes/annotation/gencode.v28.primary_assembly.annotation.gtf
# set timepoint folders
TIMEPOINT=/home/camp/ziffo/working/oliver/projects/airals/alignment/D*_samples
#set htseq-count output file aligned
## run multiple alignments using in for loop
for SAMPLE in $TIMEPOINT;
do
DAY=`echo $SAMPLE | grep -E -o 'D[0-9]+_samples'`
#set the SAM file to read in (using STAR output)
SAM=$SAMPLE/*.sam
echo "Running timepoint $SAMPLE"
for REPLICATE in $SAM
do
#set htseq output file
OUT=/home/camp/ziffo/working/oliver/projects/airals/expression/htseq
SRRID=`echo $REPLICATE | grep -E -o 'SRR[0-9]+'`
sbatch -N 1 -c 8 --mem 40G --wrap="htseq-count -s reverse $REPLICATE $GTF > $OUT/$DAY_${SRRID}.tab"
echo "Running sample $REPLICATE"
done
done
join SRR5483788.tsv SRR5483789.tsv | join - SRR5483790.tsv | join - SRR5483794.tsv | join - SRR5483795.tsv | join - SRR5483796.tsv > htseq_counts_table_temp.tsv
echo "GeneID SRR5483788 SRR5483789 SRR5483790 SRR5483794 SRR5483795 SRR5483796" > header.txt
cat header.txt htseq_counts_table_temp.tsv | grep -v "__" | perl -ne 'chomp $_; $_ =~ s/\s+/\t/g; print "$_\n"' > htseq_counts_table.tsv
rm -f htseq_counts_table_temp.tsv header.txt
head htseq_counts_table.tsv
Raphaelle runs htseq on each sample from each time point separately to create one output table per sample. The output of these are then loaded into the R script SVD_analysis.Rmd script where we run the Differential Gene Expression & Visualisation steps. This logs read counts, filters out low read counts & normalises read counts. She then runs heirachical clustering, PCA & SVD
This merged output can then be analysed for differential expression using edgeR or DESeq2 (see next chapter)
ml QoRTs
see Chapter 10.2 on Merging Count Data in http://hartleys.github.io/QoRTs/doc/QoRTs-vignette.pdf see Chapter 7 (page 12) of walkthrough example
QoRTs package is composed of 2 parts: java jar-file (for data processing) & R package (for generating tables, figures, plots).
QoRTs already produced the Counts files during the initial processing QC run and individual files formatted for DESeq. So skip straight to next chapter for DE analysis.
ml Subread
- Count reads (estimate abundance) per sample:
# Create output folder
mkdir -p featureCounts
#set gene coordinates
GTF=/home/camp/ziffo/working/oliver/genomes/annotation/gencode.v28.primary_assembly.annotation.gtf
#set BAM input file
BAM=/home/camp/ziffo/working/oliver/projects/airals/alignment/D7_samples/trimmed_filtered_depleted/SRR54837*_Aligned.sortedByCoord.out.bam
#set Counts.txt output file
OUT=/home/camp/ziffo/working/oliver/projects/airals/featureCounts/D7_samples/featureCounts/
#run featureCounts on each BAM file separately
for SAMPLE in $BAM
do
SRRID=`echo $SAMPLE | grep -E -o 'SRR[0-9]+'`
sbatch -N 1 -c 8 --mem=24GB --wrap="featureCounts -a $GTF -g gene_name -o ${OUT}_${SRRID} $SAMPLE"
done
#run featureCounts on all BAM files together. By default it uses gene_id in the GTF - override with gene_name
# Using * wildcard list all BAM files into 1 file. The output file contains a column for each sample.
featureCounts -a $GTF -g gene_name -o $OUT $BAM
For each featureCounts command run there are 2 output files:
- featureCounts_results.txt has actual read counts per gene - tab delimited file where the first six columns contain feature specific information and the rest of the columns contain the read counts that overlap with that feature.
- featureCounts_results.txt.sumary gives quick overview of how many reads were assigned to genes.
- Find sequences with highest abundance
To find sequences with most hits sort by column 7: cat counts.txt | sort -rn -k 7 | head
Output table is in columns as:
Geneid Chr Start End Strand Length Hits
Run MultiQC in the counts directory to generate a report of expression using MultiQC from different Quality Control tools eg featureCounts, QoRTs. MultiQC aggregates results from bioinformatic analyses across samples into a single report
MultiQC searches a folder for analyses & compiles a HTLM report that summarises the output from multiple bioinformatic tools
Go to the counts
folder with the aligned QC files in and run: multiqc .
Interpret the HTML report.
ml StringTie
StringTie manual StringTie is principally designed for Alignment by HISAT2. In STAR when mapping you need to specify --outSAMstrandField parameter
mkdir -p stringtie
GTF=/home/camp/ziffo/working/oliver/genomes/annotation/gencode.v28.primary_assembly.annotation.gtf
GTF_OUT=/home/camp/ziffo/working/oliver/projects/airals/expression/D7_samples/stringtie/
TSV_OUT=/home/camp/ziffo/working/oliver/projects/airals/expression/D7_samples/stringtie/
BAM=/home/camp/ziffo/working/oliver/projects/airals/alignment/D7_samples/trimmed_filtered_depleted/SRR54837*_Aligned.sortedByCoord.out.bam
for SAMPLE in $BAM
do
SRRID=`echo $SAMPLE | grep -E -o 'SRR[0-9]+'`
sbatch -N 1 -c 8 --mem=24GB --wrap="stringtie -p 8 -G $GTF -e -B -o $GTF_OUT_transcripts_$SRRID -A $TSV_OUT_gene_abundances_$SRRID $BAM"
done
# '-p 8' tells Stringtie to use eight CPUs
# '-G <known transcripts file>' reference annotation to use for guiding the assembly process (GTF/GFF3)
# '-e' only estimate the abundance of given reference transcripts (requires -G)
# '-B' enable output of Ballgown table files which will be created in the same directory as the output GTF (requires -G, -o recommended)
# '-o' output path/file name for the assembled transcripts GTF (default: stdout)
# '-A' output path/file name for gene abundance estimates
View transcript records & expression values (FPKM):
awk '{if ($3=="transcript") print}' $GTF_OUT_${SAMPLE} | cut -f 1,4,9 | less
Press 'q' to exit the 'less' display
Create a tidy expression matrix file for StringTie results:
# cd to stringtie output directory
wget https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial/master/scripts/stringtie_expression_matrix.pl
chmod +x stringtie_expression_matrix.pl
./stringtie_expression_matrix.pl --expression_metric=TPM --result_dirs='SRR5483788,SRR5483789, SRR5483790,SRR5483794,SRR5483795,SRR5483796' --transcript_matrix_file=transcript_tpm_all_samples.tsv --gene_matrix_file=gene_tpm_all_samples.tsv
./stringtie_expression_matrix.pl --expression_metric=FPKM --result_dirs='SRR5483788,SRR5483789, SRR5483790,SRR5483794,SRR5483795,SRR5483796' --transcript_matrix_file=transcript_fpkm_all_samples.tsv --gene_matrix_file=gene_fpkm_all_samples.tsv
./stringtie_expression_matrix.pl --expression_metric=Coverage --result_dirs='SRR5483788,SRR5483789, SRR5483790,SRR5483794,SRR5483795,SRR5483796' --transcript_matrix_file=transcript_coverage_all_samples.tsv --gene_matrix_file=gene_coverage_all_samples.tsv
head transcript_tpm_all_samples.tsv gene_tpm_all_samples.tsv
- Spike-ins provide a known concentrations of transcripts that we can compare to the experimental samples
- A common spike in product is ERCC ExFold RNA spike-in control mix which is added to the experimental samples
- Fold change in transcript expression between 2 samples tells you about the difference between the 2; not about whether they are highly or lowly expressed.
- At lower transcript expression levels accuracy in determining fold change deteriorates.
Using read counts from FPKM or raw counts, plot the linearity of the ERCC spike-in reads counts versus the known concentration of the ERCC spike-in Mix.
# 1. Download file with concentrations & fold change differences for the ERCC spike-in reagent
wget http://genomedata.org/rnaseq-tutorial/ERCC_Controls_Analysis.txt
cat ERCC_Controls_Analysis.txt
# 2. Use perl script to organise the ERCC expected values & our obersved counts
wget https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial/master/scripts/Tutorial_ERCC_expression.pl
chmod +x Tutorial_ERCC_expression.pl
./Tutorial_ERCC_expression.pl
cat $RNA_HOME/expression/htseq_counts/ercc_read_counts.tsv
# 3. Run R script to produce x-y scatter plot comparing expected and observed values
wget https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial/master/scripts/Tutorial_ERCC_expression.R
chmod +x Tutorial_ERCC_expression.R
./Tutorial_ERCC_expression.R ercc_read_counts.tsv
To view the resulting figure, navigate to the below URL replacing YOUR_IP_ADDRESS with your IP address:
- http://YOUR_IP_ADDRESS/rnaseq/expression/htseq_counts/Tutorial_ERCC_expression.pdf