## Checking completeness and contamination using CheckM

Now that we have the bins made with metabat we can check them for contamination and completeness (quality); for this, we will use CheckM. CheckM provides a set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes. It provides robust estimates of genome completeness and contamination by using collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage (also called marker/signature genes). http://ecogenomics.github.io/CheckM/

As you will be able to see in the checkm help pages, checkm has a workflow (lineage_wf) that will run all necessary steps to assess bin quality. 

Lineage_wf (lineage-specific workflow) steps: <br>
- The tree command places genome bins into a reference genome tree.
- The lineage_set command creates a marker file indicating lineage-specific marker sets suitable for evaluating each individual bin with the most appropriate reference set of markers. 
- This marker file is passed to the analyze command to identify marker genes and estimate the completeness and contamination of each genome bin. 
- Finally, the qa command can be used to produce different tables summarizing the quality of each genome bin. <br>


Unfortunately, the 'tree' part of this workflow is too memory intensive (about 32Gbytes of RAM (!) ), So we will cheat a bit. ** Instead, we will use the `checkm taxonomy_wf domain Bacteria` command. ** Hence, we don't load the full tree to find the most appropriate marker set, but assume all are bacteria (reasonable assumption in this case) and don't look any deeper than that. 


In [1]:
checkm taxonomy_wf -h 

usage: checkm taxonomy_wf [-h] [--ali] [--nt] [-g] [--individual_markers]
                          [--skip_adj_correction]
                          [--skip_pseudogene_correction]
                          [--aai_strain AAI_STRAIN] [-a ALIGNMENT_FILE]
                          [--ignore_thresholds] [-e E_VALUE] [-l LENGTH]
                          [-c COVERAGE_FILE] [-f FILE] [--tab_table]
                          [-x EXTENSION] [-t THREADS] [-q] [--tmpdir TMPDIR]
                          {life,domain,phylum,class,order,family,genus,species}
                          taxon bin_dir output_dir

Runs taxon_set, analyze, qa

positional arguments:
  {life,domain,phylum,class,order,family,genus,species}
                        taxonomic rank
  taxon                 taxon of interest
  bin_dir               directory containing bins (fasta format)
  output_dir            directory to write output files

optional arguments:
  -h, --help            show this help message and exit
  --ali   

In [3]:
checkm taxonomy_wf domain Bacteria data/bins/ data/checkm_taxonomy -x fa -t 4

[2022-03-22 15:58:15] INFO: CheckM v1.1.3
[2022-03-22 15:58:15] INFO: checkm taxonomy_wf domain Bacteria data/bins/ data/checkm_taxonomy -x fa -t 4
[2022-03-22 15:58:15] INFO: [CheckM - taxon_set] Generate taxonomic-specific marker set.
[2022-03-22 15:58:18] INFO: Marker set for Bacteria contains 104 marker genes arranged in 58 sets.
[2022-03-22 15:58:18] INFO: Marker set inferred from 5449 reference genomes.
[2022-03-22 15:58:18] INFO: Marker set written to: data/checkm_taxonomy/Bacteria.ms
[2022-03-22 15:58:18] INFO: { Current stage: 0:00:03.642 || Total: 0:00:03.642 }
[2022-03-22 15:58:18] INFO: [CheckM - analyze] Identifying marker genes in bins.
[2022-03-22 15:58:19] INFO: Identifying marker genes in 6 bins with 4 threads:
    Finished processing 6 of 6 (100.00%) bins.
[2022-03-22 16:00:30] INFO: Saving HMM info to file.
[2022-03-22 16:00:30] INFO: { Current stage: 0:02:12.072 || Total: 0:02:15.715 }
[2022-03-22 16:00:30] INFO: Parsing HMM hits to marker genes:
    Finished parsin

In [5]:
checkm lineage_wf --help

usage: checkm lineage_wf [-h] [-r] [--ali] [--nt] [-g] [-u UNIQUE] [-m MULTI]
                         [--force_domain] [--no_refinement]
                         [--individual_markers] [--skip_adj_correction]
                         [--skip_pseudogene_correction]
                         [--aai_strain AAI_STRAIN] [-a ALIGNMENT_FILE]
                         [--ignore_thresholds] [-e E_VALUE] [-l LENGTH]
                         [-f FILE] [--tab_table] [-x EXTENSION] [-t THREADS]
                         [--pplacer_threads PPLACER_THREADS] [-q]
                         [--tmpdir TMPDIR]
                         bin_dir output_dir

Runs tree, lineage_set, analyze, qa

positional arguments:
  bin_dir               directory containing bins (fasta format)
  output_dir            directory to write output files

optional arguments:
  -h, --help            show this help message and exit
  -r, --reduced_tree    use reduced tree (requires <16GB of memory) for determining lineage of each bin

In [8]:
checkm lineage_wf ./data/bins ./data/checkm_lineage --pplacer_threads 1 -t 4 -x fa 

[2022-03-22 16:02:16] INFO: CheckM v1.1.3
[2022-03-22 16:02:16] INFO: checkm lineage_wf ./data/bins ./data/checkm_lineage --pplacer_threads 1 -t 4 -x fa
[2022-03-22 16:02:16] INFO: [CheckM - tree] Placing bins in reference genome tree.
[2022-03-22 16:02:17] INFO: Identifying marker genes in 6 bins with 4 threads:
    Finished processing 6 of 6 (100.00%) bins.
[2022-03-22 16:03:18] INFO: Saving HMM info to file.
[2022-03-22 16:03:18] INFO: Calculating genome statistics for 6 bins with 4 threads:
    Finished processing 6 of 6 (100.00%) bins.
[2022-03-22 16:03:18] INFO: Extracting marker genes to align.
[2022-03-22 16:03:18] INFO: Parsing HMM hits to marker genes:
    Finished parsing hits for 6 of 6 (100.00%) bins.
[2022-03-22 16:03:18] INFO: Extracting 43 HMMs with 4 threads:
    Finished extracting 43 of 43 (100.00%) HMMs.
[2022-03-22 16:03:19] INFO: Aligning 43 marker genes with 4 threads:
    Finished aligning 43 of 43 (100.00%) marker genes.
[2022-03-22 16:03:19] INFO: Reading mark

: 1

The checkm manual may seem somewhat intimidating. However, remember that the options in square brackets are optional `[optional argument]`. Those without brackets are mandatory.

(think and discuss these questions) <br>

What did you do? <br>
Where is your output? <br>
What does your output look like? <br>
What can you say about the bins with this output? <br>
What can you say about lineages of the bins?<br>

You can create an extended checkm table with more information. 

1. Read the Checkm manual, and find out how to do this.
2. Save the table in 'tab-delimited format, so you can download it and open it in Excel/google-sheets/libroffie.
3. Choose what information you find valuable and discard the rest.
4. Add the mean depth +/- SEM (Standard Error Mean) of each bin, per sample type.
   - Sample types are E (Leaf) and P (Plant)
5. Congratulations, you got your first table for a manuscript/thesis about your metagenome analysis!

```
checkm qa --help
```

# Bonus
Did you try to vary binning parameters in the previous notebook (M7)? 
If so, run these through Checkm as well.
Remember to create clear and separate output directories. 

Are the bins of similar quality?