Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable quantification using StringTie AND a custom Ensembl genome #1074

Closed
mplescher opened this issue Aug 31, 2023 · 5 comments · Fixed by #1107
Closed

Enable quantification using StringTie AND a custom Ensembl genome #1074

mplescher opened this issue Aug 31, 2023 · 5 comments · Fixed by #1107
Milestone

Comments

@mplescher
Copy link

Description of feature

Hi.

I am using the nf-core rnaseq pipeline, version 3.12.0.
Since you pointed out that the transcriptome and GTF files in iGenomes are vastly out of date here, I am using a custom Ensembl genome, version 110.
I tried out these two:

I would like to use StringTie for transcript assembly and quantification, but had to face this bug. It seems like all genes in the ensembl genome lack the transcript_id required for StringTie. Since StringTie only needs the annotation of transcripts anyway, simply removing all genes from the GTF file solves the problem.
e.g. run:

 awk -F'\t' '($3!="gene")' my_genome.gtf > my_genome.no_genes.gtf

Would it be possible to check for ensembl genomes automatically and (temporarily) remove the gene lines if necessary?
Many thanks.

@pinin4fjords
Copy link
Member

@mplescher could you please provide a reproducible example of this? I've actually written a solution up quickly, but wanted to replicate your issue before I ask to merge, and find that I can't.

For example, in the test profile, I intercepted the stringtie process, and changed the first entry to a gene without a transcript_id attribute, and that seemed to be fine.

@MatthiasZepper
Copy link
Member

Probably related to/similar to #1102?

For an example GTF file see this thread in the rnaseq-Slack channel and this gffread issue for some background information.

@pinin4fjords
Copy link
Member

Well, this is a stringtie error, so possibly not directly related, but yes, maybe it's empty transcript_id attributes rather than missing ones which are the issue.

@MatthiasZepper
Copy link
Member

MatthiasZepper commented Nov 9, 2023

Yes, that is what I was thinking.

Multiple tools struggle with empty strings in the GTF attributes, because that is against the original format specification. Yet, NCBI and Ensembl release this kind of GTF files now.

This is why I linked the gffread issue above, where Geo Pertea states:

Apparently somebody had the brilliant idea to allow stuff like this (empty string values!) in GTF:
This is the kind of abomination that a zombie format like GTF becomes as it tries to "evolve" and be something that was never meant to be. GTF was really meant only for annotating transcripts, nothing else. For any other features -- use GFF3, that's why it's there!

Let's be fair, the vast majority of bioinformatics pipelines care only about transcript/exon/CDS annotation. I assumed/hoped that's why GTF was allowed to survive alongside GFF3 for so long -- as a convenient shortcut for the annotation users to get only the transcript annotation whenever that was all they needed, instead of the GFF3 which can have many other genomic features that might not be needed for those annotation users.

This is an easy fix for me but I really dislike encouraging the survival of bad ideas (which this GTF2.2 perversion clearly is), let's allow "natural selection" of such ideas take its course instead (hopefully).

Based on this, I do not expect a fix for gffread and possibly other tools as well.

Hence, I think, we must remove those lines or at least include a check, because everyone trying to use an up-to-date reference transcriptome in GTF format will likely download an invalid file.

@pinin4fjords
Copy link
Member

OK, think this is now addressed in #1107. I've also added a line to the GTF in the test data to make doubly sure I'm right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants