-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threshold CSS values when dealing with genomes in reference database #38
Comments
Dear Marco, since these are only a handful of genomes I'd suggest looking at the (interactive) output of I see that the genomes in question are all in the Legionella group. It might be that the contaminant pattern between them is systematic, e.g. if there is a mislabelled Legionella genome elsewhere in the database or another genome incorrectly labelled as Legionella. This would become visible based on the plots. I know very little about Legionella biology, but cryptic extrachromosomal elements that were not detected by standard plasmid removal tools could be an alternative explanation. In general, I would always value expert curation (in this case, your assessment of the genomes since you have worked with them in the past I suppose) over any tool's output. |
Dear Sebastian, |
Just to follow up briefly on this: |
Dear Marco, thanks for the update! Glad you could resolve the issue this way. And thanks in particular for pointing out the Fluoribacter issue; we mostly inherited taxonomy from NCBI via proGenomes2, although we already did some extensive curation for the first db release. We are currently finalising the release of a new, larger db that will use GTDB taxonomy under the hood – I just checked and GTDB indeed lists Fluoribacter genomes under Legionella, so this particular problem should hopefully not occur any more in the future. |
Hi,
<style> </style>I have a question regarding the threshold CSS value and its relationship with genomes RRS and their inclusion in GUNC database.
I'm checking the quality of a set of genomes and in some cases I obtain CSS values between 0.48 and 0.50 which given the default CSS threshold are being flagged as contaminated.
In some cases these genomes present RRS values above 0.5 (in some cases up to 0.97). In fact, some of these genomes have been downloaded from RefSeq while others have a GTDB-Tk classification included in the reference database that I'm using (ProGenomes). Here is the table with the results:
From what I understand in Figure S12 of the manuscript, the optimal CSS threshold for genomes which are included within the reference database is slightly higher than the for out-of-reference genomes (with a peak around 0.475) and for this reason I was wondering if it would make sense to not discard the genomes genomes with CSS between 0.45 and 0.48-0.50.
Thanks for the support and the great tool!
The text was updated successfully, but these errors were encountered: