-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compare gunc profiling with dbs progenomes and GTDB #16
Comments
Hallo Silas,
The differences are really only in the number and quality of genomes.
That must be due to more reference genomes/information available in GTDB which helps to reach a CSS score above the cutoff, i.e. more confidence in chimerism.
What do you exactly mean by "evaluate"? GUNC reports scores based on taxonomic distribution patterns at each level. If you mean the level at which max value of CSS occurs then the answer would be that there are more genomes in GTDB at all levels which increases CSS values regardless of the level. As a result, you also have more genomes labelled as chimeric with the max CSS at the kingdom level as well.
Hope the above helps. Otherwise, I would be happy to give a more detailed answer if you clarify what you meant by "evaluate".
We believe that progenomes is cleaner as we have applied a harsher filtering to it while we took GTDB as is. While GTDB has more genomes it also potentially contains a higher proportion of chimeric genomes, the effect of which is difficult to estimate. Hope this is helpful! |
Hi Silas! I agree with everything that @Askarbek-orakov wrote. I can just two more points. First, we did a rather thorough benchmark comparing GTDB and NCBI taxonomies for the original study. In the paper you'll find the results in Figures S4 & S5. Askarbek tried several things:
As you will see from those tests, the performance on the various simulated genomes was really comparable. The biggest drawback of using GTDB is efficacy: the db is larger and each run takes longer and is more resource-hungry, while the results are not noticeably better. Askarbek has outlined possible reasons for that above. Second, I can maybe comment on the 'kingdom' level issue you observed. I don't know what type of data you're processing, but the default GUNC db is certainly biased against several archaeal and CPR phyla which are much better represented in the GTDB. So if you expect loads of such genomes, GUNC with default db would give you cautious results (low reference-representation scores, basically signifying that these are outside of GUNC's comfort zone), whereas GTDB may resolve them better. On the flipside, the taxonomy in those particular parts of the tree also tends to be more shaky, so I'd expect more false positive chimerism calls (inflated CSS scores with inflated confidence). But we haven't systematically explored this so far. |
Thank you for your answers. If I understood it correctly the As you say there is the potential for contamination if you took the GTDB as is, especially if the CSS is calculated for the genus level. But I don't understand why many genomes could only be evaluated at the Kindom level using GTDB. |
Hi Silas! The CSS is calculated at every taxonomic level, accessible via the |
Hallo,
I run gunc on a collection of MAGs and wanted to find out what is the difference between the two dbs
progenomes
andgtdb
. What I saw is first that more MAGs fail when using GTDB. I also checked that more genomes are evaluated at thegenus
level. Which makes sense as I expect GTDB to have much more genera clusters to evaluate on. But then there are also more genomes evaluated at the Kindom level. Which Doesn't make sense to me?Do you have any explanation? Is the taxonomic placement more complicated?
What do you generally recommend gtdb or progenomes?
The text was updated successfully, but these errors were encountered: