DEE2 quality metrics
The DEE2 pipeline gathers a large number of metrics that are critical to determining the validity of transcriptome data. Remember that DEE2 is organised by SRA runs. Here, I will describe how these metrics are generated and what they mean. Further down, I discuss the logic behind classification of pass, warn and fail datasets.
Prior to a full analysis, a sample of 4000 reads is obtained from the .sra file to perform some checks. The output is checked and the .sra file is classified as single (SE) or paired end (PE).
Fastq files come in a variety of subtle flavours. These vary in the way that quality scores are defined by ASCII characters. The standard in the field is Sanger/Illumina1.9 but there may be a small number of files with different quality encodings.
After these initial checks, the entire fastq data are unpacked and FastQC is run. We can see the minimum, median and maximum read length of read 1 and read 2 (if paired end). These metrics come under the following headings:
Is the total number of reads or "spots" in the fastq file before filtering.
Is the number of reads that pass quality control filtering. This includes 3' quality trimming with Skewer, adapter clipping (if an adapter was detected with a frequency of >2.5%), progressive 5' trimming (this is described in the manuscript) and removal of reads shorted than 18 bp.
The percentage of reads that pass quality control filtering.
Before performing a full alignment, a sample of 10,000 reads are selected and mapped with STAR. If a dataset is paired end, then read 1 and read 2 are mapped independently (PE_Read1_StarMapRateTest and PE_Read2_StarMapRateTest). This value ranges from 0 to 100 and is an integer.
If the StarMapRateTest number is below 40 for one read, then that read is dropped from the pair. If the value for PE_Read1_Excluded is TRUE, then read 1 has been excluded. if PE_Read2_Excluded is TRUE then read 2 has been excluded.
After performing all of the checks, the mapping is performed: SE for single end and PE for paired end. In the case that SequenceFormat=PE and MappingFormat=SE, one read from the pair has been omitted due to low mapping rate.
This is the number of uniquely mapped reads as determined by STAR.
After mapping, STAR produces three sets of expression counts. These are based on assumptions of the strand orientation of the library. The pipeline calculates the strand bias and classifies the strand information if there is a bias >5.
Star read assignment metrics
STAR classifies reads in a number of ways that can be useful to know:
- STAR_UnmappedReads: Number of reads that didn't map to the reference genome
- STAR_MultiMappedReads: Number of reads that mapped to multiple locations in the genome
- STAR_NoFeatureReads: Number of reads that mapped uniquely to a locus without an annotated gene
- STAR_AmbiguousReads: Number of reads that were not assigned because it only partially mapped to an exon
- STAR_AssignedReads: Number of reads that are uniquely mapped to a gene and included in the gene expression count data. This information is important because the larger the number of assigned reads, the more accurate the quantificaiton will be.
- STAR_UniqMapRate: Proportion of reads that were uniquely mapped to the genome. This is a key quality indicator as a high proportion of unmapped reads could be indicative of a high carryover of rRNA or a sample swap.
- STAR_AssignRate: Proportion of reads that are assigned to genes. This is another important metric as a low value suggests contamination with gDNA or a high rate of intron mapping.
This is the k-mer parameter used by Kallisto to map reads to transcripts. Normally this is 31 by default, but when the read length is shorter, the kmer used might be smaller.
This is the number of reads mapped & assigned by Kallisto. This information is important because the larger the number of assigned reads, the more accurate the quantificaiton will be.
This is the proportion of reads that were assigned to transcripts. A low value here could be a sign that there are issues with the library preparation.
This is the Pearson correlation between the gene expression profile of the dataset as compared to an average profile of all other "PASS" datasets. The rationale behind this is that datasets that are too different can be easily filtered out as they might not be standard RNA-seq but some other derivative assay. As the "average" profile will vary slightly over time with the inclusion of more datasets, the correlation coefficient stated here might also vary slightly. This is why it is reported to 2 significant figures only.
Using the QC metrics outlined above, we have classified the datasets as “pass”, “warn” and “fail” according to some simple rules as summarised in the table below. Each rule has a numeric code.
|Metric||Fail threshold||Warn threshold||Code|
|NumReadsQcPass||< 50 reads per gene||< 500 reads per gene||1|
|QcPassRate||< 60%||< 80%||2|
|STAR_UniqMapRate||< 50%||< 70%||3|
|STAR_AssignRate||< 40%||< 60%||4|
|STAR_AssignedReads||< 50 reads per gene||< 500 reads per gene||5|
|Kallisto_MapRate||< 40%||< 60%||6|
|Kallisto_MappedReads||< 50 reads per gene||< 500 reads per gene||7|
Datasets with metrics above these thresholds are given a "pass" classification.