Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare gunc profiling with dbs progenomes and GTDB #16

Closed
SilasK opened this issue Oct 22, 2021 · 4 comments
Closed

Compare gunc profiling with dbs progenomes and GTDB #16

SilasK opened this issue Oct 22, 2021 · 4 comments

Comments

@SilasK
Copy link

SilasK commented Oct 22, 2021

Hallo,

I run gunc on a collection of MAGs and wanted to find out what is the difference between the two dbs progenomes and gtdb. What I saw is first that more MAGs fail when using GTDB. I also checked that more genomes are evaluated at the genus level. Which makes sense as I expect GTDB to have much more genera clusters to evaluate on. But then there are also more genomes evaluated at the Kindom level. Which Doesn't make sense to me?

Do you have any explanation? Is the taxonomic placement more complicated?

What do you generally recommend gtdb or progenomes?

@Askarbek-orakov
Copy link

Hallo Silas,

I run gunc on a collection of MAGs and wanted to find out what is the difference between the two dbs progenomes and gtdb.

The differences are really only in the number and quality of genomes.

What I saw is first that more MAGs fail when using GTDB.

That must be due to more reference genomes/information available in GTDB which helps to reach a CSS score above the cutoff, i.e. more confidence in chimerism.

I also checked that more genomes are evaluated at the genus level. Which makes sense as I expect GTDB to have much more genera clusters to evaluate on. But then there are also more genomes evaluated at the Kindom level. Which Doesn't make sense to me?

What do you exactly mean by "evaluate"? GUNC reports scores based on taxonomic distribution patterns at each level. If you mean the level at which max value of CSS occurs then the answer would be that there are more genomes in GTDB at all levels which increases CSS values regardless of the level. As a result, you also have more genomes labelled as chimeric with the max CSS at the kingdom level as well.

Do you have any explanation? Is the taxonomic placement more complicated?

Hope the above helps. Otherwise, I would be happy to give a more detailed answer if you clarify what you meant by "evaluate".

What do you generally recommend gtdb or progenomes?

We believe that progenomes is cleaner as we have applied a harsher filtering to it while we took GTDB as is. While GTDB has more genomes it also potentially contains a higher proportion of chimeric genomes, the effect of which is difficult to estimate.

Hope this is helpful!

@defleury
Copy link

Hi Silas!

I agree with everything that @Askarbek-orakov wrote. I can just two more points.

First, we did a rather thorough benchmark comparing GTDB and NCBI taxonomies for the original study. In the paper you'll find the results in Figures S4 & S5. Askarbek tried several things:

  • GUNC db sequences, but with GTDB taxonomy (Fig S4)
  • GTDB sequences with GTDB taxonomy (Fig S5)

As you will see from those tests, the performance on the various simulated genomes was really comparable. The biggest drawback of using GTDB is efficacy: the db is larger and each run takes longer and is more resource-hungry, while the results are not noticeably better. Askarbek has outlined possible reasons for that above.

Second, I can maybe comment on the 'kingdom' level issue you observed. I don't know what type of data you're processing, but the default GUNC db is certainly biased against several archaeal and CPR phyla which are much better represented in the GTDB. So if you expect loads of such genomes, GUNC with default db would give you cautious results (low reference-representation scores, basically signifying that these are outside of GUNC's comfort zone), whereas GTDB may resolve them better. On the flipside, the taxonomy in those particular parts of the tree also tends to be more shaky, so I'd expect more false positive chimerism calls (inflated CSS scores with inflated confidence). But we haven't systematically explored this so far.

@SilasK
Copy link
Author

SilasK commented Oct 25, 2021

Thank you for your answers.

If I understood it correctly the taxonomic_level indicates on which taxonomic level the CSS score was been calculated, isn't it? That's what I mean by "evaluated". The CSS core on the genus level would therefore be more precise/informative than at the kingdom level. Am I right?

As you say there is the potential for contamination if you took the GTDB as is, especially if the CSS is calculated for the genus level. But I don't understand why many genomes could only be evaluated at the Kindom level using GTDB.

@defleury
Copy link

Hi Silas!

The CSS is calculated at every taxonomic level, accessible via the --detailed_output flag. In the default output, the tax level you see is the one at which CSS went above the threshold. Could you paste an example output (ideally using the --detailed_output flag) where you are surprised by a kingdom-level chimerism call?

@fullama fullama closed this as completed May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants