-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEGG annotations - the number is much lower number than expected #285
Comments
Hi Chris, and thanks for reaching out. Based on the command line above, I guess you're working on a MAG. Depending on the species it could simply be the case that there are only few to no genes similar to those stored in KEGG. Could you provide some information about how many UniRef90-annotated genes and how many |
Just wanted to add that not all the KEGG annotations are included in the database (see the line below), which could also explain the discrepancy between the results and your expectations @tvtv195 bakta/db-scripts/annotate-kofams.py Line 58 in c93c3f1
Best, |
Thanks, that's good to know. We have found a workaround, i.e. we run the gene calling through bakta and then submit the output to KEGG BlastKEGG Orthology Ank Links Annotation (BlastKOALA) (https://www.kegg.jp/blastkoala/) |
Thanks @cpauvert for bringing this up - 100% correct! The Bakta database integrates annotation information from various external databases, trying to rank external databases in a way that larger more comprising DBs come first and smaller, often more specific, higher quality, databases come later. By this, we try to exploit the potentially more specific, higher quality information from smaller DBs. However, it is far from trivial to formalize these rankings. Hence, I decided to only take into account the upper 90% of all BlastKOALA annotations, as @cpauvert mentioned. I hope, that over time, more and more additional information, like dbxrefs, ECs, etc, will make it into the Bakta annotation database. Needless to mention, that we're always open to and thankful for any ideas and feedback how to improve these things. |
Dear bakta team,
We ran bakta on several bacterial genomes, with >50% estimated compl. and <10% contam., to obtain KEGG annotations.
Bakta ran just fine, without any issues, e.g.:
bakta --db /gpfs/gpfs1/scratch/cb761220/databases/bakta_db_2024/db
-o /scratch/cb761203/02.analysis/13.module13/01.bakta/ -v
/scratch/cb761203/02.analysis/12.module12/bins_hq/sample.bin.35.fa
However, we only got between 0 and 17 KEGG annotations (K0 ID) per genome.
For example, a very small bacterial genome of 531,276 bp had 517 CDS but not even a single KEGG annotation.
Is there something wrong with the way bakta assigns KEGG annotations?
Best,
Chris
The text was updated successfully, but these errors were encountered: