Enable quantification using StringTie AND a custom Ensembl genome #1074

mplescher · 2023-08-31T15:05:31Z

Description of feature

Hi.

I am using the nf-core rnaseq pipeline, version 3.12.0.
Since you pointed out that the transcriptome and GTF files in iGenomes are vastly out of date here, I am using a custom Ensembl genome, version 110.
I tried out these two:

I would like to use StringTie for transcript assembly and quantification, but had to face this bug. It seems like all genes in the ensembl genome lack the transcript_id required for StringTie. Since StringTie only needs the annotation of transcripts anyway, simply removing all genes from the GTF file solves the problem.
e.g. run:

 awk -F'\t' '($3!="gene")' my_genome.gtf > my_genome.no_genes.gtf

Would it be possible to check for ensembl genomes automatically and (temporarily) remove the gene lines if necessary?
Many thanks.

pinin4fjords · 2023-11-07T18:54:28Z

@mplescher could you please provide a reproducible example of this? I've actually written a solution up quickly, but wanted to replicate your issue before I ask to merge, and find that I can't.

For example, in the test profile, I intercepted the stringtie process, and changed the first entry to a gene without a transcript_id attribute, and that seemed to be fine.

MatthiasZepper · 2023-11-08T14:48:57Z

Probably related to/similar to #1102?

For an example GTF file see this thread in the rnaseq-Slack channel and this gffread issue for some background information.

pinin4fjords · 2023-11-09T11:37:16Z

Well, this is a stringtie error, so possibly not directly related, but yes, maybe it's empty transcript_id attributes rather than missing ones which are the issue.

MatthiasZepper · 2023-11-09T16:31:39Z

Yes, that is what I was thinking.

Multiple tools struggle with empty strings in the GTF attributes, because that is against the original format specification. Yet, NCBI and Ensembl release this kind of GTF files now.

This is why I linked the gffread issue above, where Geo Pertea states:

Apparently somebody had the brilliant idea to allow stuff like this (empty string values!) in GTF:
This is the kind of abomination that a zombie format like GTF becomes as it tries to "evolve" and be something that was never meant to be. GTF was really meant only for annotating transcripts, nothing else. For any other features -- use GFF3, that's why it's there!

Let's be fair, the vast majority of bioinformatics pipelines care only about transcript/exon/CDS annotation. I assumed/hoped that's why GTF was allowed to survive alongside GFF3 for so long -- as a convenient shortcut for the annotation users to get only the transcript annotation whenever that was all they needed, instead of the GFF3 which can have many other genomic features that might not be needed for those annotation users.

This is an easy fix for me but I really dislike encouraging the survival of bad ideas (which this GTF2.2 perversion clearly is), let's allow "natural selection" of such ideas take its course instead (hopefully).

Based on this, I do not expect a fix for gffread and possibly other tools as well.

Hence, I think, we must remove those lines or at least include a check, because everyone trying to use an up-to-date reference transcriptome in GTF format will likely download an invalid file.

pinin4fjords · 2023-11-09T18:00:44Z

OK, think this is now addressed in #1107. I've also added a line to the GTF in the test data to make doubly sure I'm right.

mplescher added the enhancement label Aug 31, 2023

mplescher mentioned this issue Sep 19, 2023

process hisat2-align-s runs for several days using a single thread until time limit is reached #1075

Closed

drpatelh added this to the 3.12.1 milestone Oct 15, 2023

pinin4fjords mentioned this issue Nov 7, 2023

Expand GTF filtering to remove rows with empty transcript ID when required, fix STAR GTF usage #1107

Merged

10 tasks

pinin4fjords closed this as completed Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable quantification using StringTie AND a custom Ensembl genome #1074

Enable quantification using StringTie AND a custom Ensembl genome #1074

mplescher commented Aug 31, 2023

pinin4fjords commented Nov 7, 2023

MatthiasZepper commented Nov 8, 2023

pinin4fjords commented Nov 9, 2023

MatthiasZepper commented Nov 9, 2023 •

edited

pinin4fjords commented Nov 9, 2023

Enable quantification using StringTie AND a custom Ensembl genome #1074

Enable quantification using StringTie AND a custom Ensembl genome #1074

Comments

mplescher commented Aug 31, 2023

Description of feature

pinin4fjords commented Nov 7, 2023

MatthiasZepper commented Nov 8, 2023

pinin4fjords commented Nov 9, 2023

MatthiasZepper commented Nov 9, 2023 • edited

pinin4fjords commented Nov 9, 2023

MatthiasZepper commented Nov 9, 2023 •

edited