Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with translating genes on complement strand #1

Closed
reubwn opened this issue Nov 24, 2017 · 6 comments
Closed

Issue with translating genes on complement strand #1

reubwn opened this issue Nov 24, 2017 · 6 comments

Comments

@reubwn
Copy link

reubwn commented Nov 24, 2017

Hello,

I've come across an issue with how CDS features are printed for genes encoded on the complementary strand. The problem manifests itself clearly when using the --translate flag, as it produces lots of erroneous translations riddled with stop codons *.

I give an example below.

The EMBL output for an affected gene looks like:

FT   gene            complement(123273..128445)
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.g00007"
FT   mRNA            complement(join(128366..128445,126919..127115,124188..124
FT                   406,123273..123484))
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007"
FT   exon            complement(128366..128445)
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007-E1"
FT   exon            complement(126919..127115)
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007-E2"
FT   exon            complement(124188..124406)
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007-E3"
FT   exon            complement(123273..123484)
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007-E4"
FT   CDS             complement(join(<128366..128445,126919..127115,124188..12
FT                   4406,123273..>123484))
FT                   /locus_tag="BANY_locus6"
FT                   /codon_start=1
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007-CDS"
FT                   /translation="QKFI*SNIWC*HLVIRS*TTNALTLVCVTFSACRRGSSIRCRVVS
FT                   LHVAAALSSRAMEIPPRAMTTPL*VSS*QTNMDRE*RASNDRHTVVQRNVWRTCEDRKI
FT                   DS*RRNSNRKRLSV*GRCR*CCF*MWFR*L**MGSSYKL*FGEKCEIIKISKPIKSHWA
FT                   KENNLNLNELLSDGEYKELYRLAMIKWSEDMREKDYGCFCRAACENDVSTSNFTVQR*E
FT                   KVWQRFFN*SLKRK"
FT                   /transl_table=1

The mRNA feature looks fine, but there are some puzzling < and > characters in the CDS feature that I think may be the problem. The translation is then subsequently messed up, and in fact appears to be the translation for the exons in reverse order, as QKFI* corresponds to the first 4 "codons" of the last exon (E4, 123273..123484).

Hopefully an easy issue, and thanks for a great tool, this is going to extremely useful :-)

Or maybe something funny in the GFF? the entry for this gene is:

BANY00001       GenomeHubs      gene    123273  128445  .       -       .       ID=BANY.1.2.g00007
BANY00001       GenomeHubs      mRNA    123273  128445  .       -       .       ID=BANY.1.2.t00007;Parent=BANY.1.2.g00007
BANY00001       GenomeHubs      exon    128366  128445  .       -       .       ID=BANY.1.2.t00007-E1;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      exon    126919  127115  .       -       .       ID=BANY.1.2.t00007-E2;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      exon    124188  124406  .       -       .       ID=BANY.1.2.t00007-E3;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      exon    123273  123484  .       -       .       ID=BANY.1.2.t00007-E4;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      CDS     128366  128445  .       -       0       ID=BANY.1.2.t00007-CDS;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      CDS     126919  127115  .       -       2       ID=BANY.1.2.t00007-CDS;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      CDS     124188  124406  .       -       1       ID=BANY.1.2.t00007-CDS;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      CDS     123273  123484  .       -       1       ID=BANY.1.2.t00007-CDS;Parent=BANY.1.2.t00007

Running biopython version: 1.67 and bcbio-gff version: 0.6.4

@Juke34
Copy link
Collaborator

Juke34 commented Nov 24, 2017

Interesting we will look at it. Could you provide the fasta for the sequence BANY00001 ?

Thank you for having pointed that.

@reubwn
Copy link
Author

reubwn commented Nov 24, 2017

No problem, here's the file. Cheers!

BANY00001.fa.gz

@Juke34
Copy link
Collaborator

Juke34 commented Nov 24, 2017

It looks like it's due to your odd gff3. Indeed, when you have a gene on the minus strand, the sub-features (exons and cds) are inverse sorted. I will add a fix to systematically sort those sub-features when we go through them to avoid such cases.

@reubwn
Copy link
Author

reubwn commented Nov 24, 2017

Ah, gff3, the file format with no fixed format... This gff was downloaded directly from an ensembl database (ensembl.lepbase.org) too. Could you say what exactly is odd about it?

Thanks for the fix!

@Juke34
Copy link
Collaborator

Juke34 commented Nov 24, 2017

Usually all the cds or exon features are sorted in increasing order of their locations, no matter their strand.

@Juke34
Copy link
Collaborator

Juke34 commented Nov 27, 2017

Issue fixed.

@Juke34 Juke34 closed this as completed Nov 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants