Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Majority of FPKM values is 0 ( Stringtie-1.2.3) #61

Closed
tylim opened this issue Jul 14, 2016 · 2 comments
Closed

Majority of FPKM values is 0 ( Stringtie-1.2.3) #61

tylim opened this issue Jul 14, 2016 · 2 comments

Comments

@tylim
Copy link

tylim commented Jul 14, 2016

Hiya,

We are using Stringtie version 1.2.3 to compute the relative expressions of a few transcripts-of-interest, these are all single-exon transcripts. We found that most of the computed FPKM values are 0, even though there are reads mapping to the annotated region. This is in both the t_data.ctab and the output transcripts GTF file. The link to a folder containing all the files (incl. a subset of the bam and gff files) is here

We used the following command right after mapping the reads to the reference genome using HISAT with the --rna-strandness FR option,

stringtie -e -b $sampleName -G $annotation $bam -p 2 -o $output

The alignment suggests that there are reads mapping to our region of interest, so does the e_data.ctab file.

e_id chr strand start end rcount ucount mrcount cov cov_sd mcov mcov_sd
4 Supercontig_1.1 - 2344806 2344985 7 7 7.00 2.4944 2.0235 2.4944 2.0235

t_id chr strand start end t_name num_exons length gene_id gene_name cov FPKM
4 Supercontig_1.1 - 2344806 2344985 orf_29038 1 180 (null) . 0.000000 0.000000

Running stringtie in default, without the annotation file produces the following result.

Supercontig_1.1 StringTie transcript 2344602 2345152 1000 - . gene_id "STRG.146"; transcript_id "STRG.146.1"; cov "3.232305"; FPKM "14.725345"; TPM "24.361681";

Please can you have a look?

Cheers and thanks in advance,
Joanne

@gpertea
Copy link
Owner

gpertea commented Jul 20, 2016

Thank you for the well documented report. We took a look at this case and we acknowledge that it exposed some issues that we are going to correct (like the missing gene_id causing some ugly "(null)" entries and "inf" TPM values there -- these are going to be fixed soon).
However the main issue of FPKMs being 0 seems to be caused by a structural mismatch between the "reference annotation" used and the read alignments overlapping them. As you have noted, the "raw counts" reported in the .ctab files seem to indicate a great deal of overlaps between read alignments with some of the reference transcripts, but those are just based on a very permissive counting of any overlap larger than 5bp between a read alignment and a reference exon. But when StringTie goes into actually assessing coverage and calculating FPKM values etc. with the -e option, on a given set of transcripts, it will use only structurally sound overlaps between such read alignments and the reference transcripts, so many of those "raw" overlaps reported in the .ctab files may be discarded. This seems to be the case here for many of those reference single-exon transcripts. For example, since the reference annotation was all short single-exon transcripts, all the spliced alignments were automatically discarded and so were the alignments extending significantly beyond the boundary of the reference transcripts (exons). Since all the alignments in this data set had a transcription strand assigned (XS tag), even for unspliced alignments, all those alignments on the opposite strand were also discarded, of course.
I will take an even closer look at all those overlaps (it still seems weird that in most cases, all of those alignments were discarded) but so far the glaring structural difference between the given reference annotation and the actual read alignments for that sample does seem to explain, at least partially, why when using -e option of StringTie (which is, not try to assemble the read alignments but just carefully "map" them, and structurally assign them to the given transcript), a lot of the overlaps might be discarded.

I guess this could be considered a way of warning you that your reference annotation might be wrong (or incomplete) here -- which is in fact the main problem with the -e option, as StringTie is forced to take the given transcripts for granted as the valid, true set of transcripts being sampled by the RNA-Seq experiment; but if that's wrong... then the GIGO principle might kick in :).

On the other hand if you allow StringTie to look at the "evidence" (the actual read alignments) and assemble the transcripts by itself, you'll see that there will be some transcripts generated there which do overlap the reference annotation significantly but it will have different start-end coordinates - however the cov and FPKM estimates will be non-zero for those assembled transcripts.

@tylim
Copy link
Author

tylim commented Jul 22, 2016

@gpertea , thanks for the thorough explanation. It's very enlightening.

Since all the alignments in this data set had a transcription strand assigned (XS tag), even for unspliced alignments, all those alignments on the opposite strand were also discarded, of course.

I made a mistake with the strandness option in HISAT2, this is probably one of the causes for the alignments to be discarded -- The samples were prepared with the Illumina TruSeq stranded mRNA HT library preparation kit, and I recently learnt that the option should be switched to --library-type fr-firststrand in Tophat and --rna-strandness RF in HISAT2 instead of FR. On a related note, say if the alignment was set to the default unstranded option in HISAT2, how does Stringtie assemble the transcripts without the XS tag information?

On the other hand if you allow StringTie to look at the "evidence" (the actual read alignments) and assemble the transcripts by itself, you'll see that there will be some transcripts generated there which do overlap the reference annotation significantly but it will have different start-end coordinates

As we were only interested in quantifying the expression of those transcripts-of-interest with Stringtie, this got missed out in the initial plan! We'll try it out and make a comparison with the current annotation. Thanks for the suggestion.

Cheers and have a nice weekend,
Joanne

@gpertea gpertea closed this as completed Sep 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants