New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDS gets filtered out and its attributes overwrite attributes for another CDS in the same gene #112
Comments
And in the same file again here:
turns into this where again a CDS disappears and its attributes overwrite another CDS for a different protein (same gene):
|
interesting - thank you for these reports related to parsing failures of this viral annotation. In this particular case I can see that some CDSs are dropped when they are found to be "contained" in another CDS or having large overlaps with another CDS from the same transcript (so it's not a programmed ribosomal shift, which would be part of the same CDS). It seems the error stems from the assumption that there can be at most one CDS (one chain of CDS segments) per transcript ID (i.e. one protein per transcript), which is clearly not the case in this annotation -- each CDS segment here, even though parented by the same transcript ID, seems to be a distinct coding sequence and thus leading to different protein products (!). Currently the transcript data structure I am using only keeps track of one CDS segment chain per transcript, and changing that will be quite impactful in other downstream code I am using. Perhaps the easiest/least impactful workaround would be for the parser to emit two transcripts in such cases (and thus treat these distinct CDSs as distinct transcript "isoforms" from the same gene..). This workaround might actually be useful with other bioinformatics software that may have also assumed "one transcript => one protein". |
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/845/545/GCA_000845545.1_ViralProj14434/GCA_000845545.1_ViralProj14434_genomic.gff.gz
When I run with
-F --keep-exon-attrs
to show what happens, these lines:get converted into this below, where CDS 1692..1838 get filtered out for some reason, and its attributes overwrite CDS 1405..2262 which is for a different protein (same gene):
The text was updated successfully, but these errors were encountered: