Skip to content

Use skani for genome clustering

Compare
Choose a tag to compare
@SilasK SilasK released this 15 Jun 13:11
da97853

Skani

The tool Skani claims to be better and faster than the combination of mash + FastANI as used by dRep
I implemented the skin for species clustering.
We now do the species clustering in the atlas run binning step.
So you get information about the number of dereplicated species in the binning report. This allows you to run different binners before choosing the one to use for the genome annotation.
Also, the file storage was improved all important files are in Binning/{binner}/

My custom species clustering does the following steps:

  1. Pre-cluster genomes with single-linkage at 92.5 ANI.
  2. Re-calibrate checkm2 results.
  • If a minority of genomes from a pre-cluster use a different translation table they are removed
  • If some genomes of a pre-cluster don't use the specialed completeness model we re-calibrate completeness to the minimum value.
    This ensures that not a bad genome evaluated on the general model is preferred over a better genome evaluated on the specific model.
    See also https://silask.github.io/post/better_genomes/ Section 2.
  • Drop genomes that don't correspond to the filter criteria after re-calibration
  1. Cluster genomes with ANI threshold default 95%
  2. Select the best genome as representative based on the Quality score Completeness - 5x Contamination

New Contributors

Full Changelog: v2.16.3...v2.17.0