Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTF output file is different from GFF3 input file #74

Open
eduardopdev opened this issue May 21, 2021 · 7 comments
Open

GTF output file is different from GFF3 input file #74

eduardopdev opened this issue May 21, 2021 · 7 comments

Comments

@eduardopdev
Copy link

Hello. I installed gffread 0.12.1 through anaconda and used to convert a GFF3 file into GTF format.
The file I used is the gff3 annotation file from Araport11:
https://www.arabidopsis.org/download_files/Genes/Araport11_genome_release/Araport11_GFF3_genes_transposons.Mar92021.gff.gz

the i executed gffread:
gffread test_gffread/Araport11_GFF3_genes_transposons.Mar92021.gff -v -T -F --keep-exon-attrs --keep-genes -o test_gffread/Araport11_GFF3_genes_transposons.Mar92021_gffread.gtf

i wrote a small script to check if the files are the same, and some information from the gff3 file is not in
the gtf output file and vice versa.
Could someone do the same thing as i did and tell me what is wrong?
I can send the script i used to check if the files have the same info.

@gpertea gpertea added the bug label May 21, 2021
@gpertea
Copy link
Owner

gpertea commented May 21, 2021

I suppose you are talking about specific GFF attributes being lost? Or it's worse -- actual transcript structures?! (that would be a serious bug).
If you could give me more details (output/log of your script), I would appreciate it.

I do have to admit that back in the day when I started writing this code I made some rather arbitrary decisions to discard some attributes which I found annoying and useless in my work at the time. But now I want to revise that - I agree there should be a way to preserve the full integrity of the original data (without having to use 3 options to hope to achieve that !).

@eduardopdev
Copy link
Author

@gpertea would you like me to send you the information through here or email?

@gpertea
Copy link
Owner

gpertea commented May 21, 2021

Having the issue documented/summarized here would be better, so we can keep track of it properly (and github is neater than my mailbox). I was hoping it would be easy to summarize what is being lost (attributes? features?), we don't even need the whole log file attached here (or the script you used) if there are obvious patterns in what is lost/mishandled (which features or attributes).

@eduardopdev
Copy link
Author

So, i executed gffread again with the following command line:

gffread test_gffread/Araport11_GFF3_genes_transposons.Mar92021.gff -v -T -F --keep-exon-attrs --keep-genes -o test_gffread/Araport11_GFF3_genes_transposons.Mar92021_gffread.gtf

and the output i got from the '-v' option was:
Command line was: gffread tests/test_gffread[v0.12.1]/Araport11_GFF3_genes_transposons.Mar92021.gff -v -T -F --keep-exon-attrs --keep-genes -o tests/test_gffread[v0.12.1]/Araport11_GFF3_genes_transposons.Mar92021_gffread.gtf Warning: exon feature found before transcript ID ATMG01370 Warning: duplicate feature ID ATMG01275 (230304-234132) (discontinuous feature?) Warning: duplicate feature ID ATMG00665 (234352-236628) (discontinuous feature?) Warning: duplicate feature ID ATMG00790 (286837-286915) (discontinuous feature?) Error: discarding overlapping duplicate gene feature (329976-330773) with ID=ATMG00735 Warning: duplicate feature ID ATMG00790 (358842-358913) (discontinuous feature?) Warning: duplicate feature ID ATMG00665 (360107-360128) (discontinuous feature?) Warning: adjusted transcript ATMG00735 boundaries according to terminal exons. .. loaded 97444 genomic features from tests/test_gffread[v0.12.1]/Araport11_GFF3_genes_transposons.Mar92021.gff

So i ran my scripts. (I can paste the code to if you want)
The first one prints a dictionary with the amount of features that are in one file but not in the other.
For the amount of features in the GFF3 that were not in the output GTF File i got this:
{'five_prime_UTR': 35208, 'protein': 37444, 'three_prime_UTR': 36053, 'transposable_element': 31061, 'transposon_fragment': 34728, 'gene': 3309, 'transcript_region': 726, 'exon': 1518, 'pseudogene': 32, 'uORF': 184, 'region': 1}
and for the amount of features in the output GTF file not in the input GFF3 file, i got:
{'exon': 36, 'transcript': 2}

Then i ran the second script, the one who checks for IDs in the GFF3 file lost in the output GTF file:
For the number of IDs lost i got: 31722
and something happend with the id ATMG01370 (the one in the gffread output above): in the input file it only appears as the parent for the tRNA ATMG01375, but in the outpufile it appears as the gene_id for the same features:

INPUT FILE:

ChrM Araport11 tRNA 124603 124676 . - . ID=ATMG01375.1;Parent=ATMG01370;Name=ATMG01370.1;id2=id-trnH(GTG);parent2=gene-trnH(GTG);gbkey=tRNA;gene=trnH(GTG);product=tRNA-His;Note=chloroplast-like

ChrM Araport11 exon 124603 124676 . - . ID=ATMG01375.1;Parent=ATMG01370;Name=ATMG01370.1;id2=exon-id-trnH(GTG)-1;parent2=id-trnH(GTG);gbkey=tRNA;gene=trnH(GTG);product=tRNA-His

OUTPUT FILE:

ChrM Araport11 transcript 124603 124676 . - . transcript_id "ATMG01370"; gene_id "ATMG01370"; gene_name "trnH(GTG)";

ChrM Araport11 exon 124603 124676 . - . transcript_id "ATMG01370"; gene_name "trnH(GTG)"; Name "ATMG01370.1"; id2 "exon-id-trnH(GTG)-1"; parent2 "id-trnH(GTG)"; gbkey "tRNA"; gene "trnH(GTG)"; product "tRNA-His";

ChrM Araport11 transcript 124603 124676 . - . transcript_id "ATMG01375.1"; gene_id "ATMG01375.1"; gene_name "trnH(GTG)"; Name "ATMG01370.1"; id2 "id-trnH(GTG)"; parent2 "gene-trnH(GTG)"; gbkey "tRNA"; gene "trnH(GTG)"; product "tRNA-His"; Note "chloroplast-like";

what really matter for me are the exons. No exons can be lost. But it seems that some are lost and some are created.

@gpertea
Copy link
Owner

gpertea commented May 22, 2021

To address your 2nd point first -- from what you are showing me, no exons were lost -- just their useless IDs were discarded :)

Now I recall the contempt I've always had for exon or CDS IDs in GFF3 files -- what's their point, besides increasing the file size? An exon is uniquely and more informatively identified by its location (chr, strand and start-end coordinates), and as an annotation feature they are the lowest in the feature hierarchy - they don't "parent" any sub-features (which would indeed justify an ID).. So I never encountered any reason to keep exon/CDS ID in the output of gffread.

Also, GTF does not even support exon IDs as an attribute for the exon feature -- sure one can add a custom "exon_id" attribute to each exon line.. but again, why?

As for your 1st point, a few notes:

  • gffread merges UTR features into exons, but no structural information is lost about them when both exon and CDS features are present (just subtract the CDS intervals from exon intervals and you get the UTRs)
  • by default gffread discards non-transcript features like "protein", "region" etc. These features are usually redundant or pointless for structural transcriptomics (the "region" is just a non-transcript, non-gene annotation of a genomic region; a "protein" feature is just duplicating the CDS intervals). More importantly for your use case, such features are not even compatible with GTF output
  • other features that seemed lost might have been silently converted (pseudogenes -> genes etc.)

As I said before, I still think as a matter of principle gffread should provide some option to fully preserve the original attributes (when it is used for transcript filtering), but I don't think this is reasonable to enforce when converting from GFF3 to GTF. GTF is a much simpler, transcript-centered format and actually one of the motivations for writing gffread was to convert bloated GFF3 files into a much "leaner" GFF3 output where only the transcripts' structural information is retained -- which is what GTF was meant for as well.

@eduardopdev
Copy link
Author

Thanks for the response. First i want to apologize. The transcripts in the output file that are not in the input file are not there because in the annotation file the gene_id is the same for 3 genes, and gffread merged the coordinates. I made, by my message, make it look like a bug.
So the exons that are not in both files were actually merged, right? I got that, but there is also a 'weird' behaviour relative to 'transcript_regions' and 'transposable elements'. I can't find some of their IDs in the GTF output file.
Is there a way to keep transcript_region and transposable_elements and to stop the 'merging behaviour'?

I am asking this because I need to use hisat2 in my pipeline but i am working with rice and the only annotation file I have is a GFF3 one. I have two annotation files (GFF3 and GTF) on Arabidopsis thaliana and when I extract the exons and splice sites coordinates from them, the results are the same. I need to obtain the same result from the GFF3 and the GTF output file i get from gffread.

@gpertea
Copy link
Owner

gpertea commented May 24, 2021

Thank you for clarifying the purpose of this conversion, I understand better the issue now. Merging UTR features into exons does not affect the splice sites at all (and not the full exon coordinates that hisat2 also takes as input when building the index).

I haven't worked with transposable element annotations before but they don't seem to actually have exons per se but just transposon fragments as their sub-feature and generally only one of them, so there are no splice sites for those.. In the rare cases where a transposable_element has more than one transposon_fragment (2659 out of 31189 in that file), I do not think there is a proper intron between them, so likely no splice sites (but I might be wrong).

I've never seen "transcript_region" annotations before, I suppose those are some sort of predicted potential transcripts (or fragments). Those do have "exons" as sub-features in that file and I suppose gffread should've seen those and preserved them in the GTF output (and converted them to transcripts, since there is no "transcript_region" feature accepted in GTF). So there might be a bug in there after all, or at least an option to preserve those transcript-like feature, since they do have exons defined -- I'll investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants