-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GTF output file is different from GFF3 input file #74
Comments
I suppose you are talking about specific GFF attributes being lost? Or it's worse -- actual transcript structures?! (that would be a serious bug). I do have to admit that back in the day when I started writing this code I made some rather arbitrary decisions to discard some attributes which I found annoying and useless in my work at the time. But now I want to revise that - I agree there should be a way to preserve the full integrity of the original data (without having to use 3 options to hope to achieve that !). |
@gpertea would you like me to send you the information through here or email? |
Having the issue documented/summarized here would be better, so we can keep track of it properly (and github is neater than my mailbox). I was hoping it would be easy to summarize what is being lost (attributes? features?), we don't even need the whole log file attached here (or the script you used) if there are obvious patterns in what is lost/mishandled (which features or attributes). |
So, i executed gffread again with the following command line:
and the output i got from the '-v' option was: So i ran my scripts. (I can paste the code to if you want) Then i ran the second script, the one who checks for IDs in the GFF3 file lost in the output GTF file: INPUT FILE:
OUTPUT FILE:
what really matter for me are the exons. No exons can be lost. But it seems that some are lost and some are created. |
To address your 2nd point first -- from what you are showing me, no exons were lost -- just their useless IDs were discarded :) Now I recall the contempt I've always had for exon or CDS IDs in GFF3 files -- what's their point, besides increasing the file size? An exon is uniquely and more informatively identified by its location (chr, strand and start-end coordinates), and as an annotation feature they are the lowest in the feature hierarchy - they don't "parent" any sub-features (which would indeed justify an ID).. So I never encountered any reason to keep exon/CDS ID in the output of gffread. Also, GTF does not even support exon IDs as an attribute for the exon feature -- sure one can add a custom "exon_id" attribute to each exon line.. but again, why? As for your 1st point, a few notes:
As I said before, I still think as a matter of principle gffread should provide some option to fully preserve the original attributes (when it is used for transcript filtering), but I don't think this is reasonable to enforce when converting from GFF3 to GTF. GTF is a much simpler, transcript-centered format and actually one of the motivations for writing gffread was to convert bloated GFF3 files into a much "leaner" GFF3 output where only the transcripts' structural information is retained -- which is what GTF was meant for as well. |
Thanks for the response. First i want to apologize. The transcripts in the output file that are not in the input file are not there because in the annotation file the gene_id is the same for 3 genes, and gffread merged the coordinates. I made, by my message, make it look like a bug. I am asking this because I need to use hisat2 in my pipeline but i am working with rice and the only annotation file I have is a GFF3 one. I have two annotation files (GFF3 and GTF) on Arabidopsis thaliana and when I extract the exons and splice sites coordinates from them, the results are the same. I need to obtain the same result from the GFF3 and the GTF output file i get from gffread. |
Thank you for clarifying the purpose of this conversion, I understand better the issue now. Merging UTR features into exons does not affect the splice sites at all (and not the full exon coordinates that hisat2 also takes as input when building the index). I haven't worked with transposable element annotations before but they don't seem to actually have exons per se but just transposon fragments as their sub-feature and generally only one of them, so there are no splice sites for those.. In the rare cases where a transposable_element has more than one transposon_fragment (2659 out of 31189 in that file), I do not think there is a proper intron between them, so likely no splice sites (but I might be wrong). I've never seen "transcript_region" annotations before, I suppose those are some sort of predicted potential transcripts (or fragments). Those do have "exons" as sub-features in that file and I suppose gffread should've seen those and preserved them in the GTF output (and converted them to transcripts, since there is no "transcript_region" feature accepted in GTF). So there might be a bug in there after all, or at least an option to preserve those transcript-like feature, since they do have exons defined -- I'll investigate. |
Hello. I installed gffread 0.12.1 through anaconda and used to convert a GFF3 file into GTF format.
The file I used is the gff3 annotation file from Araport11:
https://www.arabidopsis.org/download_files/Genes/Araport11_genome_release/Araport11_GFF3_genes_transposons.Mar92021.gff.gz
the i executed gffread:
gffread test_gffread/Araport11_GFF3_genes_transposons.Mar92021.gff -v -T -F --keep-exon-attrs --keep-genes -o test_gffread/Araport11_GFF3_genes_transposons.Mar92021_gffread.gtf
i wrote a small script to check if the files are the same, and some information from the gff3 file is not in
the gtf output file and vice versa.
Could someone do the same thing as i did and tell me what is wrong?
I can send the script i used to check if the files have the same info.
The text was updated successfully, but these errors were encountered: