# ONT sequencing data processing for prokaryote isolates

# Isolate genomes

***

Generate genome stats, taxonomy predictions, and gene annotation predictions for assembled isolate genomes.

This is directly analogous to the steps outlined in the metagenomics processing pipeline (e.g. [here](https://github.com/GenomicsAotearoa/environmental_metagenomics)). In this case, we are inputting isolate genomes assembled from Nanopore long read sequencing data compared with metagenome-assembled genomes generated from Illumina HiSeq sequencing, but the process is the same.

**Important caveat**: Several of these tools will not work well if assembled contigs from more than one organism are present (e.g. if your starting cultures are not pure cultures). If you know your samples are contain more than one organism in advance (and/or results from CheckM and gtdb suggest there may be more than one organism present), it will first be necessary to separate out genomes of individual organisms. In simple cases (e.g. your target organism + 1 contaminating organism) with genomes assembled into relatively complete single contigs, this might be possible by manually splitting the contigs into separate fna files. For more complex situations, you will need to do this via binning into separate genomes. This is not covered here, but should be analogous to the process described in the [metagenomics processing pipeline](https://github.com/GenomicsAotearoa/environmental_metagenomics) (although the tools *may* need to be specific for long read data).

## Genome stats via CheckM

See the metagenomics processing pipeline [here](https://github.com/GenomicsAotearoa/environmental_metagenomics) for more detailed information.

Note: CheckM v1 and v2 are built differently and can produce different results. E.g. CheckM2 reportedly handles ultrasmall prokaryote genomes better, but may introduce some errors in placing some bacterial lineages. It can be worth trying both.

#### Genome stats via *checkM1*

In [None]:
#!/bin/bash
#SBATCH -A <>
#SBATCH -J checkm
#SBATCH --time 08:00:00
#SBATCH --ntasks=1
#SBATCH --mem 60GB
#SBATCH --cpus-per-task=20
#SBATCH -e checkm.err
#SBATCH -o checkm.out

cd /work/dir
mkdir -p 3.isolate_genomes/1.checkm1

module purge
module load CheckM/1.2.3-foss-2023a-Python-3.11.6

checkm lineage_wf -t 20 --pplacer_threads 10 --tab_table \
-x fna \
-f 3.isolate_genomes/1.checkm1/checkm_bin_summary.txt \
2.assembly/2.assembly.LR_polished.fna_files.m1000/ \
3.isolate_genomes/1.checkm1/


#### Genome stats via *CheckM2*

In [None]:
#!/bin/bash
#SBATCH -A <>
#SBATCH -J checkm2
#SBATCH --time 12:00:00
#SBATCH --ntasks=1
#SBATCH --mem 120GB
#SBATCH --cpus-per-task=20
#SBATCH -e checkm2.err
#SBATCH -o checkm2.out

cd /work/dir
mkdir -p 3.isolate_genomes/1.checkm2

module purge
module load CheckM2/1.0.1-Miniconda3

checkm2 predict --force --threads ${SLURM_CPUS_PER_TASK} -x fna \
--input 2.assembly/2.assembly.LR_polished.fna_files.m1000 \
--output-directory 3.isolate_genomes/1.checkm2


***

## Taxonomy prediction via *gtdb*

Generate taxonomy predictions for each isolate via *gtdb-tk*. (See the metagenomics processing pipeline [here](https://github.com/GenomicsAotearoa/environmental_metagenomics) for more detailed information.

In [None]:
#!/bin/bash
#SBATCH -A <>
#SBATCH -J gtdbtk
#SBATCH --time 04:00:00
#SBATCH --ntasks=1
#SBATCH --mem 100GB
#SBATCH --cpus-per-task=20
#SBATCH -e gtdbtk.err
#SBATCH -o gtdbtk.out

cd /work/dir
mkdir -p 3.isolate_genomes/2.gtdbtk

module purge
module load GTDB-Tk/2.4.0-foss-2023a-Python-3.11.6

gtdbtk classify_wf --cpus ${SLURM_CPUS_PER_TASK} \
-x fna --skip_ani_screen \
--genome_dir 2.assembly/2.assembly.LR_polished.fna_files.m1000 \
--out_dir 3.isolate_genomes/2.gtdbtk


***

## Summary table of genome stats and taxonomy

Example python script to generate summary table of *checkM* results and *gtdb* taxonomy predictions. 

In this example, we will also incorporate the previously generated summary table of polished assembly stats

In [None]:
# Working directory
cd /working/dir

# Load python
module purge
module load Python/3.8.2-gimkl-2020a
python3

## Import required libraries
import pandas as pd
import numpy as np
import re
import glob

## Compile results

# polished assembly stats
assembly_stats_df = pd.read_csv('2.assembly/2.assembly.LR_polished.assembly_stats/summary_table.polished_assembly.stats.tsv', sep='\t')
# checkm
checkm_df = pd.read_csv('3.isolate_genomes/1.checkm1/checkm_bin_summary.txt', sep='\t')[['Bin Id', 'Completeness', 'Contamination', 'Strain heterogeneity']].rename({'Bin Id': 'sampleID'}, axis=1)
# taxonomy
gtdb_df = pd.concat([pd.read_csv(f, sep='\t') for f in glob.glob("3.isolate_genomes/2.gtdbtk/*.summary.tsv")],
                      ignore_index=True)[['user_genome', 'classification']]
gtdb_df.columns = ['isolateID', 'taxonomy_gtdb']

# # If necessary, strip '.consensus' off sampleIDs for merging
# checkm_df['isolateID'] = checkm_df['isolateID'].str.replace(r'.consensus', '')
# gtdb_df['isolateID'] = gtdb_df['isolateID'].str.replace(r'.consensus', '')

# Compile into one table
summary_df = pd.merge(assembly_stats_df, checkm_df, how="outer", on="sampleID").merge(gtdb_df, how="outer", on="sampleID")

# Write out summary table
summary_df.to_csv('3.isolate_genomes/summary_table_checkm_gtdb.tsv', sep='\t', index=False)

quit()


***

## Gene Calling and Annotation

#### Introduction

As per the metagenomics processing pipeline, we can now predict genes and the annotations of those genes for each of our assembled isolate genomes.

*prodigal* is a commonly used tool for calling genes. This can also be combined with other tools to call, for example, rRNA (*metaxa2*) and tRNA (*aragorn*). Identified genes can then be searched against a suite of gene annotation databases, such as *KEGG*, *UniProt*, *UniRef*, *pfam*, *tigrfam*, etc. to generate predicted annotations for each gene (e.g. searches via *blast*, *usearch*, *diamond*, or *hmmsearch*). Finally, these can be compiled to generate a single user friendly table of all annotation predictions (by each of the databases searched against) for each called gene.

*DRAM* is a convenient annotation tool that completes each of the above steps for a set of common annotation databases and compiles a user friendly output table of predicted gene annotations. In the example below, we will run gene calling and annotation via *DRAM*, which is installed as a NeSI module.

For more information on *DRAM*, see here: https://github.com/WrightonLabCSU/DRAM

NOTE:

- *DRAM* comes installed with a number of freely available databases that it searches against. The full *KEGG* database, however, requires a paid licence, and is not available as part of the NeSI module. Unfortunately the *DRAM* version 1.3 (e.g. NeSI module `DRAM/1.3.5-Miniconda3`) cannot be set to include the full *KEGG* database, even if you have a full *KEGG* licence in your group (as the config file pointing to database locations is a fixed setting in DRAM < v1.4). However, future versions of *DRAM* (from v1.4) are reportedly going to include an extra option to set your own config file (including options to copy the current config file to retrieve the NeSI paths to the rest of the available databases, which can then be modified to include your own database (e.g. the full *KEGG* database)). (*DRAM_1.4* includes several major updates, so will hopefully be upgraded in NeSI in the future).


#### Gene prediction and annotation via *DRAM*

Note: We may no longer have the full sequential set of 96 sampleIDs (due to some samples failing to assemble). To save wasting resources running jobs on missing samples, we previously suggested you could list the exact sampleID numbers in the `#SBATCH --array=` line. As an alternative, here we will:

- Inside the slurm script:
  - generate an array of all assembly file names (`ASSEMBLY_FILES_ARRAY`)
  - extract an individual file name from this array based on the `SLURM_ARRAY_TASK_ID`
  - use this individual file name for input and output names
- submitting slurm job:
  - add `--array=0-n` to the slurm submission based on the assembly file count.
  


In [None]:
#!/bin/bash
#SBATCH -A your_project_account
#SBATCH -J isolate_genomes_DRAM
#SBATCH --time 2-00:00:00
#SBATCH --mem 100GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=36
#SBATCH -e isolate_genomes_DRAM_%a.err
#SBATCH -o isolate_genomes_DRAM_%a.out

# Working directory
cd /working/dir
mkdir -p 3.isolate_genomes/3.gene_annotations/dram_annotation/isolate_subsets

# Load module 
module purge
module load DRAM/1.3.5-Miniconda3

# array of assembly files
ASSEMBLY_FILES_ARRAY=(2.assembly/2.assembly.LR_polished.fna_files.m1000/*.fna)                    
# Set variables
ASSEMBLY_FILE=$(echo ${ASSEMBLY_FILES_ARRAY[${SLURM_ARRAY_TASK_ID}]})
OUTPUT_FILE=$(basename ${ASSEMBLY_FILE} .fna)

## Run DRAM
# n.b. can add --gtdb_taxonomy twice *if* there's also an ar122 summary file available
DRAM.py annotate --threads 36 --use_uniref \
--input_fasta ${ASSEMBLY_FILE} \
--checkm_quality 3.isolate_genomes/1.checkm1/checkm_bin_summary.txt \
--gtdb_taxonomy 3.isolate_genomes/2.gtdbtk/gtdbtk.bac120.summary.tsv \
-o 3.isolate_genomes/3.gene_annotations/dram_annotation/isolate_subsets/${OUTPUT_FILE}


In [None]:
ASSEMBLY_FILES_ARRAY=( $(ls 2.assembly/2.assembly.LR_polished.fna_files.m1000/*.fna) )
ZNUM=$(( ${#ASSEMBLY_FILES_ARRAY[@]} - 1 ))
sbatch --array=0-${ZNUM} isolate_genomes_DRAM.sl

#### Merge *DRAM* annotations via *compile_dram_annotations.py*

The script `compile_dram_annotations.py` was written to recompile subsets of DRAM outputs, while allowing for cases where some results files were not generated (e.g. no tRNAs were identified for a given subset). It takes as input the directory path that contains each of the *DRAM* subsets outputs. This script is available in `../scripts/`.

In [None]:
# Working directory
cd /working/dir/3.isolate_genomes/3.gene_annotations

# Load python
module purge
module load Python/3.8.2-gimkl-2020a

# Run compile dram annotations
/path/to/scripts/compile_dram_annotations.py \
-i dram_annotation/isolate_subsets/ \
-o dram_annotation/collated_dram_


#### *DRAM* distill

`DRAM.py distill` can be used to generate summaries of annotations and some metabolic pathways.

For more information, see here: https://github.com/WrightonLabCSU/DRAM


In [None]:
# Working directory
cd /working/dir/3.isolate_genomes/3.gene_annotations

# Load modules
module purge
module load DRAM/1.3.5-Miniconda3

# Run DRAM
DRAM.py distill \
-i dram_annotation/collated_dramv_annotations.tsv \
-o dram_distillation


***