KEGG annotations - the number is much lower number than expected #285

tvtv195 · 2024-04-25T17:10:17Z

Dear bakta team,
We ran bakta on several bacterial genomes, with >50% estimated compl. and <10% contam., to obtain KEGG annotations.
Bakta ran just fine, without any issues, e.g.:

bakta --db /gpfs/gpfs1/scratch/cb761220/databases/bakta_db_2024/db
-o /scratch/cb761203/02.analysis/13.module13/01.bakta/ -v
/scratch/cb761203/02.analysis/12.module12/bins_hq/sample.bin.35.fa

However, we only got between 0 and 17 KEGG annotations (K0 ID) per genome.
For example, a very small bacterial genome of 531,276 bp had 517 CDS but not even a single KEGG annotation.

Is there something wrong with the way bakta assigns KEGG annotations?
Best,
Chris

oschwengers · 2024-04-28T08:06:18Z

Hi Chris, and thanks for reaching out. Based on the command line above, I guess you're working on a MAG. Depending on the species it could simply be the case that there are only few to no genes similar to those stored in KEGG. Could you provide some information about how many UniRef90-annotated genes and how many hypotheticals you get ?

cpauvert · 2024-07-15T08:46:54Z

Just wanted to add that not all the KEGG annotations are included in the database (see the line below), which could also explain the discrepancy between the results and your expectations @tvtv195

bakta/db-scripts/annotate-kofams.py

Line 58 in c93c3f1

    
           if(knum != '-' and hmm['f_measure'] > 0.77):  # discard the lower 10th percentile

Best,

tvtv195 · 2024-07-17T06:38:12Z

Thanks, that's good to know. We have found a workaround, i.e. we run the gene calling through bakta and then submit the output to KEGG BlastKEGG Orthology Ank Links Annotation (BlastKOALA) (https://www.kegg.jp/blastkoala/)
This way we went from, e.g. 1 KEGG annotation (bakta) to >100 (BlastKOALA) - our downstream checks (gene neighborhood and operon analysis, blast, phylogeny, etc.) confirmed the BlastKOALA annotations.

oschwengers · 2024-07-17T08:18:19Z

Thanks @cpauvert for bringing this up - 100% correct!

The Bakta database integrates annotation information from various external databases, trying to rank external databases in a way that larger more comprising DBs come first and smaller, often more specific, higher quality, databases come later. By this, we try to exploit the potentially more specific, higher quality information from smaller DBs. However, it is far from trivial to formalize these rankings. Hence, I decided to only take into account the upper 90% of all BlastKOALA annotations, as @cpauvert mentioned.

I hope, that over time, more and more additional information, like dbxrefs, ECs, etc, will make it into the Bakta annotation database. Needless to mention, that we're always open to and thankful for any ideas and feedback how to improve these things.

tvtv195 added the bug Something isn't working label Apr 25, 2024

tvtv195 changed the title ~~KEGG annotations - much lower number than expected~~ KEGG annotations - the number is much lower number than expected Apr 26, 2024

oschwengers added question Further information is requested and removed bug Something isn't working labels Apr 28, 2024

oschwengers closed this as completed Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEGG annotations - the number is much lower number than expected #285

KEGG annotations - the number is much lower number than expected #285

tvtv195 commented Apr 25, 2024

oschwengers commented Apr 28, 2024

cpauvert commented Jul 15, 2024

tvtv195 commented Jul 17, 2024

oschwengers commented Jul 17, 2024

KEGG annotations - the number is much lower number than expected #285

KEGG annotations - the number is much lower number than expected #285

Comments

tvtv195 commented Apr 25, 2024

oschwengers commented Apr 28, 2024

cpauvert commented Jul 15, 2024

tvtv195 commented Jul 17, 2024

oschwengers commented Jul 17, 2024