# Prokaryote genomics: Genome stats


***

## Introduction

Generate genome stats, taxonomy predictions, and gene annotation predictions for assembled isolate genomes or metagenome-assembled genomes (MAGs)

NOTE: This process is comparable whether working with assembled genomes generated from either:

- culture isolate sequencing (c.f. `Data_processing_and_assembly/1C.Prokaryote_isolate_sequencing_Nanopore/`) 
- genomes recovered from mixed metagenome assemblies (metagenome-assembled genomes (MAGs)) (c.f. `Data_processing_and_assembly/1A.Metagenomics_HiSeq/` and `2.Prokaryote_metagenomics_Binning/`)

## Index

- [3.1 Genome stats](#3.1-Genome-stats-via-checkM)
- [3.2 Taxonomy](#3.2-Taxonomy-prediction-via-gtdb)
- [3.3 Summary table of genome stats and taxonomy](#3.3-Summary-table-of-genome-stats-and-taxonomy)
- [3.4 Gene calling and annotation](#3.4-Gene-Calling-and-Annotation)


## 3.1 Genome stats via *checkM*

Run *checkM* on the assembled genomes to generate genome statistics. (See the metagenomics processing pipeline for more information).

In [None]:
#!/bin/bash
#SBATCH -A your_project_account
#SBATCH -J 3_isolate_genomes_checkm
#SBATCH --time 00:40:00
#SBATCH --ntasks=1
#SBATCH --mem 45GB
#SBATCH --cpus-per-task=20
#SBATCH -e 3_isolate_genomes_checkm.err
#SBATCH -o 3_isolate_genomes_checkm.out

# Working directory
cd /working/dir
mkdir -p 3.isolate_genomes/1.checkm/

# load CheckM
module purge
module load CheckM/1.0.13-gimkl-2018b-Python-2.7.16

# Run CheckM
checkm lineage_wf -t 20 --pplacer_threads 10 --tab_table \
-x fasta \
-f 2.isolate_genomes/1.checkm/checkm_bin_summary.txt \
1.assembly/2.assembly.flye.nano_hq.LR_polished/polished_assembly_files/ \
2.isolate_genomes/1.checkm/


## 3.2 Taxonomy prediction via *gtdb*

Generate taxonomy predictions for each isolate via *gtdbtk_202*. (See the metagenomics processing pipeline for more information).

In [None]:
#!/bin/bash
#SBATCH -A your_project_account
#SBATCH -J 3_isolate_genomes_gtdb
#SBATCH --time 02:00:00
#SBATCH --ntasks=1
#SBATCH --mem 260GB
#SBATCH --cpus-per-task=20
#SBATCH -e 3_isolate_genomes_gtdb.err
#SBATCH -o 3_isolate_genomes_gtdb.out

# Working directory
cd /working/dir
mkdir -p 2.isolate_genomes/2.gtdbtk

# load module
module purge
module load GTDB-Tk/1.5.0-gimkl-2020a-Python-3.8.2

# Set the path to the reference data for the latest available gtdbtk (in case this isn't default for the loaded module)
export GTDBTK_DATA_PATH=/opt/nesi/db/gtdbtk_202/

# Run gtdb-tk
gtdbtk classify_wf --cpus 20 \
-x fasta \
--genome_dir 1.assembly/2.assembly.flye.nano_hq.LR_polished/polished_assembly_files/ \
--out_dir 2.isolate_genomes/2.gtdbtk



## 3.3 Summary table of genome stats and taxonomy

Generate summary table of *checkM* results and *gtdb* taxonomy predictions. 

In this example, we will also incorporate the previously generated summary table of polished assembly stats

In [None]:
# Working directory
cd /working/dir

# Load python
module purge
module load Python/3.8.2-gimkl-2020a
python3

## Import required libraries
import pandas as pd
import numpy as np
import re
import glob

## Compile results

# polished assembly stats
assembly_stats_df = pd.read_csv('1.assembly/2.assembly.flye.nano_hq.LR_polished/polished_assembly_stats/summary_table_LRpolished_assembly_stats.tsv', sep='\t')
# checkm
checkm_df = pd.read_csv('2.isolate_genomes/1.checkm/checkm_bin_summary.txt', sep='\t')[['Bin Id', 'Completeness', 'Contamination', 'Strain heterogeneity']].rename({'Bin Id': 'sampleID'}, axis=1)
# taxonomy
gtdb_df = pd.concat([pd.read_csv(f, sep='\t') for f in glob.glob("2.isolate_genomes/2.gtdbtk/*.summary.tsv")],
                      ignore_index=True)[['user_genome', 'classification']]
gtdb_df.columns = ['isolateID', 'taxonomy_gtdb']

# Strip '.consensus' off sampleIDs for merging
checkm_df['isolateID'] = checkm_df['isolateID'].str.replace(r'.consensus', '')
gtdb_df['isolateID'] = gtdb_df['isolateID'].str.replace(r'.consensus', '')

# Compile into one table
summary_df = pd.merge(assembly_stats_df, checkm_df, how="outer", on="sampleID").merge(gtdb_df, how="outer", on="sampleID")

# Write out summary table
summary_df.to_csv('2.isolate_genomes/summary_table_checkm_gtdb.tsv', sep='\t', index=False)

quit()


## 3.4 Gene Calling and Annotation

#### Introduction

As per the metagenomics processing pipeline, we can now predict genes and the annotations of those genes for each of our assembled isolate genomes.

*prodigal* is a commonly used tool for calling genes. This can also be combined with other tools to call, for example, rRNA (*metaxa2*) and tRNA (*aragorn*). Identified genes can then be searched against a suite of gene annotation databases, such as *KEGG*, *UniProt*, *UniRef*, *pfam*, *tigrfam*, etc. to generate predicted annotations for each gene (e.g. searches via *blast*, *usearch*, *diamond*, or *hmmsearch*). Finally, these can be compiled to generate a single user friendly table of all annotation predictions (by each of the databases searched against) for each called gene.

*DRAM* is a convenient annotation tool that completes each of the above steps for a set of common annotation databases and compiles a user friendly output table of predicted gene annotations. In the example below, we will run gene calling and annotation via *DRAM*, which is installed as a NeSI module.

For more information on *DRAM*, see here: https://github.com/WrightonLabCSU/DRAM

NOTE:

- *DRAM* comes installed with a number of freely available databases that it searches against. The full *KEGG* database, however, requires a paid licence, and is not available as part of the NeSI module. Unfortunately the *DRAM* version 1.3 (e.g. NeSI module `DRAM/1.3.5-Miniconda3`) cannot be set to include the full *KEGG* database, even if you have a full *KEGG* licence in your group (as the config file pointing to database locations is a fixed setting in DRAM < v1.4). However, future versions of *DRAM* (from v1.4) are reportedly going to include an extra option to set your own config file (including options to copy the current config file to retrieve the NeSI paths to the rest of the available databases, which can then be modified to include your own database (e.g. the full *KEGG* database)). (*DRAM_1.4* includes several major updates, so will hopefully be upgraded in NeSI in the near future).


#### Gene prediction and annotation via *DRAM*

NOTE:

- We no longer have the full sequential set of isolateIDs (due to isolate_2 and isolate_7 failing to assemble). Previously, we have simply listed the exact isolateID numbers in the `#SBATCH --array=` line. As an alternative, here we will:
  - run a slurm array for the total number of final assemblies (here, 12-2=10, so `--array=0-9`)
  - generate an array of all assembly file names (`ASSEMBLY_FILES_ARRAY`)
  - extract an individual file name from this array based on the `SLURM_ARRAY_TASK_ID`
  - use this individual file name for input and output names


In [None]:
#!/bin/bash
#SBATCH -A your_project_account
#SBATCH -J 3_isolate_genomes_DRAM
#SBATCH --time 2-00:00:00
#SBATCH --mem 100GB
#SBATCH --ntasks=1
#SBATCH --array=0-9
#SBATCH --cpus-per-task=36
#SBATCH -e 3_isolate_genomes_DRAM_%a.err
#SBATCH -o 3_isolate_genomes_DRAM_%a.out

# Working directory
cd /working/dir
mkdir -p 2.isolate_genomes/3.annotations_DRAM/dram_annotation/isolate_subsets

# Load module 
module purge
module load DRAM/1.3.5-Miniconda3

# array of assembly files
ASSEMBLY_FILES_ARRAY=(1.assembly/2.assembly.flye.nano_hq.LR_polished/polished_assembly_files/*.fasta)                    
# Set variables
ASSEMBLY_FILE=$(echo ${ASSEMBLY_FILES_ARRAY[${SLURM_ARRAY_TASK_ID}]})
OUTPUT_FILE=$(basename ${ASSEMBLY_FILE} .consensus.fasta)

## Run DRAM
# n.b. can add --gtdb_taxonomy twice *if* there's also an ar122 summary file available
# For DRAM_1.4 you can manaully set config file (e.g. to include KEGG path): --config_loc /path/to/DRAM_1.4_CONFIG_EDITED
DRAM.py annotate --threads 36 --use_uniref \
--input_fasta ${ASSEMBLY_FILE} \
--checkm_quality ./2.isolate_genomes/1.checkm/checkm_bin_summary.txt \
--gtdb_taxonomy ./2.isolate_genomes/2.gtdbtk/gtdbtk.bac120.summary.tsv \
-o 2.isolate_genomes/3.annotations_DRAM/dram_annotation/isolate_subsets/${OUTPUT_FILE}


#### Merge *DRAM* annotations via *compile_dram_annotations.py*

The script `compile_dram_annotations.py` was written to recompile subsets of DRAM outputs, while allowing for cases where some results files were not generated (e.g. no tRNAs were identified for a given subset). It takes as input the directory path that contains each of the *DRAM* subsets outputs. This script is available in `../scripts/`.

In [None]:
# Working directory
cd /working/dir/2.isolate_genomes/3.annotations_DRAM

# Load python
module purge
module load Python/3.8.2-gimkl-2020a

# Run compile dram annotations
/path/to/scripts/compile_dram_annotations.py \
-i dram_annotation/isolate_subsets/ \
-o dram_annotation/collated_dram_


#### *DRAM* distill

`DRAM.py distill` can be used to generate summaries of annotations and some metabolic pathways.

For more information, see here: https://github.com/WrightonLabCSU/DRAM


In [None]:
# Working directory
cd /working/dir/2.isolate_genomes/3.annotations_DRAM

# Load modules
module purge
module load DRAM/1.3.5-Miniconda3

# Run DRAM
DRAM.py distill \
-i dram_annotation/collated_dramv_annotations.tsv \
-o dram_distillation


***