Skip to content
Roberto Vera Alvarez edited this page Dec 4, 2018 · 17 revisions

TPMCalculator quantifies mRNA abundance directly from the alignments by parsing BAM files. The input parameters are the same GTF files used to generate the alignments, and one or multiple input BAM file(s) containing either single-end or paired-end sequencing reads. The TPMCalculator output is comprised of four files per sample reporting the TPM values and raw read counts for genes, transcripts, exons and introns respectively.

The model to describe the genomic features used for a gene is created from the GTF provided by the user. TPMCalculator performs two transformations which are executed on the genomic coordinates generating regions for the genes that include the exons and “pure” intron regions as shown in Figure S1. The first transformation creates overlapped exons for all alternative spliced forms of the genes. A single gene model is generated with unique exons and introns which includes the sequence of all exonic regions. The second transformation process creates a list of pure intron regions that replace those generated by the first transformation. We should indicate that only the intron regions are modified to generate regions not overlapped by exons of other genes. Reporting TPM values for these unique introns allows further identification of alternative splicing events like intron retention. Additionally, a set of non-overlapped gene features (exons and introns) are generated and used for TPM calculation.

Gene model

Discussion

TPMCalculator results were compared with three freely available popular tools used by the scientific community to count raw read counts or calculate FPKM values from RNA-seq aligned samples in BAM format. The comparison shows a high correlation between TPMCalculator and the rest of the three tools. However, few samples correlated with correlation coefficients below 0.2 while comparing HTSeq and featureCounts with TPMCalculator ExonReads at gene level. We summarized the correlation coefficients for those samples in Table S3. This table uses a 2-color scale for highlighting the results for an easy visual comparison. White was used for correlation coefficient equal to zero and green was used for correlation coefficient equal to one. Table S3 shows, for MAPQ values of 1, 3 and 255, that in all cases samples with low correlation between TPMCalculator and any of HTSeq or featureCounts have a high correlation with the other tool and with the FPKM values. For MAPQ equal 0 the same tendency remains between HTSeq and featureCounts results but the correlation with FPKM is not as high as in the previous discussed examples. We analyzed the low correlation samples with RSeQC quality control tools as described by Qi et al (Qi, et al., 2017). No differences were found while comparing those samples with the rest of the dataset that may produce the low correlation. We were not able to find the cause of this low correlation. A discussed before, TPMCalculator highly correlate with results obtained from very well tested tools for the quantification of mRNA abundance. However, TPMCalculator reports in one single analysis, raw read counts and TPM values for gene, transcripts, exon and introns. None of the currently available tools are able to generated a complete set of data for all genomic features as TPMCalculator does. Additionally, TPMCalculator reduces the compute time and the resource requirements of RNA-Seq pipelines by eliminating multiple steps. TPMCalculator processes BAM files of size 7.0 GB in ~20 minutes requiring only 4GB of RAM.

Credits

Roberto Vera Alvarez Email: veraalva@ncbi.nlm.nih.gov

Lorinc Pongor Email: pongorlorinc@gmail.com

Leonardo Mariño-Ramírez Email: marino@ncbi.nlm.nih.gov

David Landsman Email: landsman@ncbi.nlm.nih.gov

Public Domain notice

National Center for Biotechnology Information.

This software is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties as United States Government employees and thus cannot be copyrighted. This software is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction.

Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose.

Please cite NCBI in any work or product based on this material.