Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEGG annotations - the number is much lower number than expected #285

Closed
tvtv195 opened this issue Apr 25, 2024 · 4 comments
Closed

KEGG annotations - the number is much lower number than expected #285

tvtv195 opened this issue Apr 25, 2024 · 4 comments
Labels
question Further information is requested

Comments

@tvtv195
Copy link

tvtv195 commented Apr 25, 2024

Dear bakta team,
We ran bakta on several bacterial genomes, with >50% estimated compl. and <10% contam., to obtain KEGG annotations.
Bakta ran just fine, without any issues, e.g.:

bakta --db /gpfs/gpfs1/scratch/cb761220/databases/bakta_db_2024/db
-o /scratch/cb761203/02.analysis/13.module13/01.bakta/ -v
/scratch/cb761203/02.analysis/12.module12/bins_hq/sample.bin.35.fa

However, we only got between 0 and 17 KEGG annotations (K0 ID) per genome.
For example, a very small bacterial genome of 531,276 bp had 517 CDS but not even a single KEGG annotation.

Is there something wrong with the way bakta assigns KEGG annotations?
Best,
Chris

@tvtv195 tvtv195 added the bug Something isn't working label Apr 25, 2024
@tvtv195 tvtv195 changed the title KEGG annotations - much lower number than expected KEGG annotations - the number is much lower number than expected Apr 26, 2024
@oschwengers
Copy link
Owner

Hi Chris, and thanks for reaching out. Based on the command line above, I guess you're working on a MAG. Depending on the species it could simply be the case that there are only few to no genes similar to those stored in KEGG. Could you provide some information about how many UniRef90-annotated genes and how many hypotheticals you get ?

@oschwengers oschwengers added question Further information is requested and removed bug Something isn't working labels Apr 28, 2024
@cpauvert
Copy link

Just wanted to add that not all the KEGG annotations are included in the database (see the line below), which could also explain the discrepancy between the results and your expectations @tvtv195

if(knum != '-' and hmm['f_measure'] > 0.77): # discard the lower 10th percentile

Best,

@tvtv195
Copy link
Author

tvtv195 commented Jul 17, 2024

Thanks, that's good to know. We have found a workaround, i.e. we run the gene calling through bakta and then submit the output to KEGG BlastKEGG Orthology Ank Links Annotation (BlastKOALA) (https://www.kegg.jp/blastkoala/)
This way we went from, e.g. 1 KEGG annotation (bakta) to >100 (BlastKOALA) - our downstream checks (gene neighborhood and operon analysis, blast, phylogeny, etc.) confirmed the BlastKOALA annotations.

@oschwengers
Copy link
Owner

Thanks @cpauvert for bringing this up - 100% correct!

The Bakta database integrates annotation information from various external databases, trying to rank external databases in a way that larger more comprising DBs come first and smaller, often more specific, higher quality, databases come later. By this, we try to exploit the potentially more specific, higher quality information from smaller DBs. However, it is far from trivial to formalize these rankings. Hence, I decided to only take into account the upper 90% of all BlastKOALA annotations, as @cpauvert mentioned.

I hope, that over time, more and more additional information, like dbxrefs, ECs, etc, will make it into the Bakta annotation database. Needless to mention, that we're always open to and thankful for any ideas and feedback how to improve these things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants