Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threshold CSS values when dealing with genomes in reference database #38

Closed
mgabriell1 opened this issue Jun 1, 2023 · 4 comments
Closed

Comments

@mgabriell1
Copy link

mgabriell1 commented Jun 1, 2023

Hi,
I have a question regarding the threshold CSS value and its relationship with genomes RRS and their inclusion in GUNC database.
I'm checking the quality of a set of genomes and in some cases I obtain CSS values between 0.48 and 0.50 which given the default CSS threshold are being flagged as contaminated.
In some cases these genomes present RRS values above 0.5 (in some cases up to 0.97). In fact, some of these genomes have been downloaded from RefSeq while others have a GTDB-Tk classification included in the reference database that I'm using (ProGenomes). Here is the table with the results:

<style> </style>
  CheckM2 completeness CheckM2 contamination CSS GUNC contamination GUNC effective surplus clades GUNC mean hit identity RRS
GCA_947444635.1 60.28 0.08 0.07 0.05 0.1 0.72 0.6
GCA_947470005.1 75.45 0.15 0.27 0.07 0.16 0.67 0.55
GCA_027437095.1 88.44 2.67 0.33 0.03 0.06 0.69 0.54
GCF_001467945.1 99.94 0.4 0.44 0.05 0.11 0.99 0.97
GCF_900461585.1 99.96 0.4 0.44 0.05 0.11 0.99 0.97
GCF_900639925.1 99.96 0.37 0.47 0.05 0.11 0.99 0.97
GCA_903907265.1 52.32 0.03 0.47 0.05 0.11 0.62 0.52
GCA_003507515.1 76.3 0.89 0.48 0.03 0.05 0.68 0.58
GCF_001736145.1 99.93 1.78 0.48 0.05 0.11 0.99 0.97
GCA_002352055.1 99.95 4.17 0.52 0.05 0.1 0.72 0.64
GCA_027358185.1 94.09 1.29 0.59 0.03 0.06 0.67 0.57
GCF_001468135.1 100 2.54 0.6 0.03 0.07 0.99 0.96
GCA_903885895.1 94.68 5.11 0.63 0.1 0.23 0.61 0.44
GCA_945865355.1 86.73 3.6 0.67 0.02 0.05 0.69 0.61
GCF_900452545.1 100 1.29 0.7 0.04 0.09 0.99 0.97
GCF_900639855.1 100 1.28 0.7 0.04 0.09 0.99 0.97
GCA_947474165.1 97.55 0.22 0.72 0.04 0.08 0.67 0.56
GCF_001467695.1 100 1.29 0.73 0.04 0.09 0.99 0.97
GCF_900639975.1 100 3.29 0.74 0.03 0.07 0.99 0.96
GCA_903842685.1 73.45 4.49 0.82 0.12 0.28 0.63 0.45
GCA_947485955.1 91.54 4.96 0.86 0.03 0.05 0.67 0.52
GCA_903901775.1 67.23 1.56 1 0.04 0.08 0.69 0.53

From what I understand in Figure S12 of the manuscript, the optimal CSS threshold for genomes which are included within the reference database is slightly higher than the for out-of-reference genomes (with a peak around 0.475) and for this reason I was wondering if it would make sense to not discard the genomes genomes with CSS between 0.45 and 0.48-0.50.
Thanks for the support and the great tool!

@defleury
Copy link

defleury commented Jun 5, 2023

Dear Marco,

since these are only a handful of genomes I'd suggest looking at the (interactive) output of gunc plot and then deciding based on that. This will give you an idea where the signal for the CSS comes from – which contigs are labelled as originating from a different source, and at which taxonomic level the conflict is introduced. In particular, this allows you to further follow up the "offending" contigs and see if there is a systematic bias.

I see that the genomes in question are all in the Legionella group. It might be that the contaminant pattern between them is systematic, e.g. if there is a mislabelled Legionella genome elsewhere in the database or another genome incorrectly labelled as Legionella. This would become visible based on the plots. I know very little about Legionella biology, but cryptic extrachromosomal elements that were not detected by standard plasmid removal tools could be an alternative explanation.

In general, I would always value expert curation (in this case, your assessment of the genomes since you have worked with them in the past I suppose) over any tool's output.

@mgabriell1
Copy link
Author

Dear Sebastian,
Thanks for the suggestions. I initially used Anvio to spot potential sources of contamination, but I will give a try also with gunc plot on the manually curated genomes.

@mgabriell1
Copy link
Author

Just to follow up briefly on this:
I noticed that all the RefSeq genomes that I have checked which showed a high CSS value achieved this a the genus level and the issue was due to the fact that besides Legionella, also the genus Fluoribacter was detected. However, Fluoribacter is a synonim for Legionella (https://lpsn.dsmz.de/genus/fluoribacter) so in my case the high CSS values were a false positive.
On one hand this highlights the importance of using gunc plot to better understand the data, but on the other possibly suggests to take the issue of synonyms in future db versions.
Thanks again for the great tool

@defleury
Copy link

defleury commented Jun 9, 2023

Dear Marco,

thanks for the update! Glad you could resolve the issue this way.

And thanks in particular for pointing out the Fluoribacter issue; we mostly inherited taxonomy from NCBI via proGenomes2, although we already did some extensive curation for the first db release. We are currently finalising the release of a new, larger db that will use GTDB taxonomy under the hood – I just checked and GTDB indeed lists Fluoribacter genomes under Legionella, so this particular problem should hopefully not occur any more in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants