# Prokaryote genomics: Genome stats and Taxonomy prediction


***

## Introduction

Generate genome stats and taxonomy predictions for assembled isolate genomes or metagenome-assembled genomes (MAGs)

NOTE: This process is comparable whether working with assembled genomes generated from either:

- culture isolate sequencing (c.f. `Data_processing_and_assembly/1C.Prokaryote_isolate_sequencing_Nanopore/`) 
- genomes recovered from mixed metagenome assemblies (metagenome-assembled genomes (MAGs)) (c.f. `Data_processing_and_assembly/1A.Metagenomics_HiSeq/` and `2.Prokaryote_metagenomics_Binning/`)

## Index

- [1 Genome stats](#1-Genome-stats-via-checkM)
- [2 Taxonomy](#2-Taxonomy-prediction-via-gtdb)
- [3 Summary table of genome stats and taxonomy](#3-Summary-table-of-genome-stats-and-taxonomy)


## 1 Genome stats via *checkM*

Run *checkM* on the assembled genomes to generate genome statistics

In [None]:
#!/bin/bash
#SBATCH -A your_project_account
#SBATCH -J 3_isolate_genomes_checkm
#SBATCH --time 00:40:00
#SBATCH --ntasks=1
#SBATCH --mem 45GB
#SBATCH --cpus-per-task=20
#SBATCH -e 3_isolate_genomes_checkm.err
#SBATCH -o 3_isolate_genomes_checkm.out

# Working directory
cd /working/dir
mkdir -p 1.checkm/

# load CheckM
module purge
module load CheckM/1.0.13-gimkl-2018b-Python-2.7.16

# Run CheckM
checkm lineage_wf -t 20 --pplacer_threads 10 --tab_table \
-x fasta \
-f 1.checkm/checkm_bin_summary.txt \
/path/to/assembled/genome/files/ \
1.checkm/


## 2 Taxonomy prediction via *gtdb*

Generate taxonomy predictions for each isolate via *gtdbtk_202*

In [None]:
#!/bin/bash
#SBATCH -A your_project_account
#SBATCH -J 3_isolate_genomes_gtdb
#SBATCH --time 02:00:00
#SBATCH --ntasks=1
#SBATCH --mem 260GB
#SBATCH --cpus-per-task=20
#SBATCH -e 3_isolate_genomes_gtdb.err
#SBATCH -o 3_isolate_genomes_gtdb.out

# Working directory
cd /working/dir
mkdir -p 2.gtdbtk

# load module
module purge
module load GTDB-Tk/1.5.0-gimkl-2020a-Python-3.8.2

# Set the path to the reference data for the latest available gtdbtk (in case this isn't default for the loaded module)
export GTDBTK_DATA_PATH=/opt/nesi/db/gtdbtk_202/

# Run gtdb-tk
gtdbtk classify_wf --cpus 20 \
-x fasta \
--genome_dir /path/to/assembled/genome/files/ \
--out_dir 2.gtdbtk



## 3 Summary table of genome stats and taxonomy

Generate summary table of *checkM* results and *gtdb* taxonomy predictions. 


In [None]:
# Working directory
cd /working/dir

# Load python
module purge
module load Python/3.8.2-gimkl-2020a
python3

## Import required libraries
import pandas as pd
import numpy as np
import re
import glob

## Compile results

# checkm
checkm_df = pd.read_csv('1.checkm/checkm_bin_summary.txt', sep='\t')[['Bin Id', 'Completeness', 'Contamination', 'Strain heterogeneity']].rename({'Bin Id': 'genomeID'}, axis=1)
# taxonomy
gtdb_df = pd.concat([pd.read_csv(f, sep='\t') for f in glob.glob("2.gtdbtk/*.summary.tsv")],
                      ignore_index=True)[['user_genome', 'classification']]
gtdb_df.columns = ['genomeID', 'taxonomy_gtdb']

# Compile into one table
summary_df = pd.merge(checkm_df, gtdb_df, how="outer", on="genomeID")

# Write out summary table
summary_df.to_csv('summary_table_checkm_gtdb.tsv', sep='\t', index=False)

quit()


***