Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDS gets filtered out and its attributes overwrite attributes for another CDS in the same gene #112

Open
hermidalc opened this issue Nov 16, 2022 · 2 comments

Comments

@hermidalc
Copy link

hermidalc commented Nov 16, 2022

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/845/545/GCA_000845545.1_ViralProj14434/GCA_000845545.1_ViralProj14434_genomic.gff.gz

When I run with -F --keep-exon-attrs to show what happens, these lines:

AF234533.1	Genbank	gene	1405	2262	.	+	.	ID=gene-P;Name=P;gbkey=Gene;gene=P;gene_biotype=protein_coding
AF234533.1	Genbank	mRNA	1405	2262	.	+	.	ID=rna-P;Parent=gene-P;gbkey=mRNA;gene=P;product=polymerase-associated protein P
AF234533.1	Genbank	exon	1405	2262	.	+	.	ID=exon-P-1;Parent=rna-P;gbkey=mRNA;gene=P;product=polymerase-associated protein P
AF234533.1	Genbank	CDS	1415	2251	.	+	0	ID=cds-AAG10410.1;Parent=rna-P;Dbxref=NCBI_GP:AAG10410.1;Name=AAG10410.1;gbkey=CDS;gene=P;product=polymerase-associated protein P;protein_id=AAG10410.1
AF234533.1	Genbank	CDS	1692	1838	.	+	0	ID=cds-AAG10411.1;Parent=rna-P;Dbxref=NCBI_GP:AAG10411.1;Name=AAG10411.1;gbkey=CDS;gene=P;product=protein P';protein_id=AAG10411.1

get converted into this below, where CDS 1692..1838 get filtered out for some reason, and its attributes overwrite CDS 1405..2262 which is for a different protein (same gene):

AF234533.1	Genbank	mRNA	1405	2262	.	+	.	ID=rna-P;geneID=gene-P;gene_name=P;gbkey=mRNA;gene=P;product=polymerase-associated protein P;Name=P;gene_biotype=protein_coding
AF234533.1	Genbank	exon	1405	2262	.	+	.	Parent=rna-P;gbkey=mRNA;gene=P;product=polymerase-associated protein P
AF234533.1	Genbank	CDS	1415	2251	.	+	0	Parent=rna-P;Dbxref=NCBI_GP:AAG10411.1;Name=AAG10411.1;gbkey=CDS;gene=P;product=protein P';protein_id=AAG10411.1
@hermidalc
Copy link
Author

hermidalc commented Nov 16, 2022

And in the same file again here:

AF234533.1	Genbank	gene	6816	7453	.	+	.	ID=gene-alpha;Name=alpha;gbkey=Gene;gene=alpha;gene_biotype=protein_coding
AF234533.1	Genbank	mRNA	6816	7453	.	+	.	ID=rna-alpha;Parent=gene-alpha;gbkey=mRNA;gene=alpha;product=alpha 1 protein
AF234533.1	Genbank	exon	6816	7453	.	+	.	ID=exon-alpha-1;Parent=rna-alpha;gbkey=mRNA;gene=alpha;product=alpha 1 protein
AF234533.1	Genbank	CDS	6824	7090	.	+	0	ID=cds-AAG10415.1;Parent=rna-alpha;Dbxref=NCBI_GP:AAG10415.1;Name=AAG10415.1;gbkey=CDS;gene=alpha;product=alpha 1 protein;protein_id=AAG10415.1
AF234533.1	Genbank	CDS	7092	7442	.	+	0	ID=cds-AAG10416.1;Parent=rna-alpha;Dbxref=NCBI_GP:AAG10416.1;Name=AAG10416.1;gbkey=CDS;gene=alpha;product=alpha 2 protein;protein_id=AAG10416.1
AF234533.1	Genbank	CDS	7166	7321	.	+	0	ID=cds-AAG10417.1;Parent=rna-alpha;Dbxref=NCBI_GP:AAG10417.1;Name=AAG10417.1;gbkey=CDS;gene=alpha;product=alpha 3 protein;protein_id=AAG10417.1

turns into this where again a CDS disappears and its attributes overwrite another CDS for a different protein (same gene):

AF234533.1	Genbank	mRNA	6816	7453	.	+	.	ID=rna-alpha;geneID=gene-alpha;gene_name=alpha;gbkey=mRNA;gene=alpha;product=alpha 1 protein;Name=alpha;gene_biotype=protein_coding
AF234533.1	Genbank	exon	6816	7453	.	+	.	Parent=rna-alpha;gbkey=mRNA;gene=alpha;product=alpha 1 protein
AF234533.1	Genbank	CDS	6824	7090	.	+	0	Parent=rna-alpha;Dbxref=NCBI_GP:AAG10415.1;Name=AAG10415.1;gbkey=CDS;gene=alpha;product=alpha 1 protein;protein_id=AAG10415.1
AF234533.1	Genbank	CDS	7092	7442	.	+	0	Parent=rna-alpha;Dbxref=NCBI_GP:AAG10417.1;Name=AAG10417.1;gbkey=CDS;gene=alpha;product=alpha 3 protein;protein_id=AAG10417.1

@hermidalc hermidalc changed the title Valid CDS feature gets filtered out and its attributes overwrite attributes for another valid CDS Valid CDS feature gets filtered out and its attributes overwrite attributes for another valid CDS in the same gene Nov 16, 2022
@hermidalc hermidalc changed the title Valid CDS feature gets filtered out and its attributes overwrite attributes for another valid CDS in the same gene CDS feature gets filtered out and its attributes overwrite attributes for another CDS in the same gene Nov 16, 2022
@hermidalc hermidalc changed the title CDS feature gets filtered out and its attributes overwrite attributes for another CDS in the same gene CDS gets filtered out and its attributes overwrite attributes for another CDS in the same gene Nov 16, 2022
@gpertea
Copy link
Owner

gpertea commented Nov 16, 2022

interesting - thank you for these reports related to parsing failures of this viral annotation.

In this particular case I can see that some CDSs are dropped when they are found to be "contained" in another CDS or having large overlaps with another CDS from the same transcript (so it's not a programmed ribosomal shift, which would be part of the same CDS).

It seems the error stems from the assumption that there can be at most one CDS (one chain of CDS segments) per transcript ID (i.e. one protein per transcript), which is clearly not the case in this annotation -- each CDS segment here, even though parented by the same transcript ID, seems to be a distinct coding sequence and thus leading to different protein products (!).

Currently the transcript data structure I am using only keeps track of one CDS segment chain per transcript, and changing that will be quite impactful in other downstream code I am using. Perhaps the easiest/least impactful workaround would be for the parser to emit two transcripts in such cases (and thus treat these distinct CDSs as distinct transcript "isoforms" from the same gene..). This workaround might actually be useful with other bioinformatics software that may have also assumed "one transcript => one protein".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants