# Prokaryote genomics: Gene Calling and Annotation


***

## Introduction

We can now predict genes and the annotations of those genes for each of our assembled genomes.

NOTE: This process is comparable whether working with assembled genomes generated from either:

- culture isolate sequencing (c.f. `Data_processing_and_assembly/1C.Prokaryote_isolate_sequencing_Nanopore/`) 
- genomes recovered from mixed metagenome assemblies (metagenome-assembled genomes (MAGs)) (c.f. `Data_processing_and_assembly/1A.Metagenomics_HiSeq/` and `2.Prokaryote_metagenomics_Binning/`)

## Index

- [Gene calling and Annotation](#Gene-Calling-and-Annotation)

## Gene Calling and Annotation

#### Preamble

*prodigal* is a commonly used tool for calling genes. This can also be combined with other tools to call, for example, rRNA (*metaxa2*) and tRNA (*aragorn*). Identified genes can then be searched against a suite of gene annotation databases, such as *KEGG*, *UniProt*, *UniRef*, *pfam*, *tigrfam*, etc. to generate predicted annotations for each gene (e.g. searches via *blast*, *usearch*, *diamond*, or *hmmsearch*). Finally, these can be compiled to generate a single user friendly table of all annotation predictions (by each of the databases searched against) for each called gene.

*DRAM* is a convenient annotation tool that completes each of the above steps for a set of common annotation databases and compiles a user friendly output table of predicted gene annotations. In the example below, we will run gene calling and annotation via *DRAM*, which is installed as a NeSI module.

For more information on *DRAM*, see here: https://github.com/WrightonLabCSU/DRAM

NOTE:

- *DRAM* comes installed with a number of freely available databases that it searches against. The full *KEGG* database, however, requires a paid licence, and is not available as part of the NeSI module. Unfortunately the *DRAM* version 1.3 (e.g. NeSI module `DRAM/1.3.5-Miniconda3`) cannot be set to include the full *KEGG* database, even if you have a full *KEGG* licence in your group (as the config file pointing to database locations is a fixed setting in DRAM < v1.4). However, future versions of *DRAM* (from v1.4) are reportedly going to include an extra option to set your own config file (including options to copy the current config file to retrieve the NeSI paths to the rest of the available databases, which can then be modified to include your own database (e.g. the full *KEGG* database)). (*DRAM_1.4* includes several major updates, so will hopefully be upgraded in NeSI in the near future).


#### Gene prediction and annotation via *DRAM*

NOTE:

- The example below runs DRAM separately (in a slurm array) for *each* assembled genome and then merges the results via *compile_dram_annotations.py*. Alternatively, you could concatenate all genomes together and run at once if it is not an especially large dataset.
  - You may want to ensure that contig headers in each of your assembled genome files include the individual genomeID *prior* to running this annotation step (or at least having a contig2genome_lookupTable file prepared to be able to link gene annotations from specific contigs back to individual genomes).
- In case the assembled genome files are not labelled based on a sequential set of IDs, to run as a slurm array in this example we will:
  - run a slurm array for the total number of final assembled genome files (in this example, 10 genomes, so `#SBATCH --array=0-9`)
  - generate an array of all assembled genome file names (`GENOME_FILES_ARRAY`)
    - NOTE: if the file extensions of your assembled genome files end in something other than .fasta, ammend the `GENOME_FILES_ARRAY` and `OUTPUT_FILE` lines accordingly.
  - extract an individual file name from this array based on the `SLURM_ARRAY_TASK_ID`
  - use this individual file name for input and output names


In [None]:
#!/bin/bash
#SBATCH -A your_project_account
#SBATCH -J gene_annotation_DRAM
#SBATCH --time 2-00:00:00
#SBATCH --mem 100GB
#SBATCH --ntasks=1
#SBATCH --array=0-9
#SBATCH --cpus-per-task=36
#SBATCH -e gene_annotation_DRAM_%a.err
#SBATCH -o gene_annotation_DRAM_%a.out

# Working directory
cd /working/dir
mkdir -p 3.gene_annotations/dram_annotation/subsets

# Load module 
module purge
module load DRAM/1.3.5-Miniconda3

# array of assembly files
GENOME_FILES_ARRAY=(/path/to/assembled/genome/files/*.fasta)                    
# Set variables
GENOME_FILE=$(echo ${GENOME_FILES_ARRAY[${SLURM_ARRAY_TASK_ID}]})
OUTPUT_FILE=$(basename ${GENOME_FILE} .fasta)

## Run DRAM
# n.b. can add --gtdb_taxonomy twice *if* there's also an ar122 summary file available
# For DRAM_1.4 you can manaully set config file (e.g. to include KEGG path): --config_loc /path/to/DRAM_1.4_CONFIG_EDITED
DRAM.py annotate --threads 36 --use_uniref \
--input_fasta ${GENOME_FILE} \
--checkm_quality 1.checkm/checkm_bin_summary.txt \
--gtdb_taxonomy 2.gtdbtk/gtdbtk.bac120.summary.tsv \
-o 3.gene_annotations/dram_annotation/subsets/${OUTPUT_FILE}


#### Merge *DRAM* annotations via *compile_dram_annotations.py*

The script `compile_dram_annotations.py` was written to recompile subsets of DRAM outputs, while allowing for cases where some results files were not generated (e.g. no tRNAs were identified for a given subset). It takes as input the directory path that contains each of the *DRAM* subsets outputs. This script is available in `../scripts/`.

In [None]:
# Working directory
cd /working/dir/3.gene_annotations

# Load python
module purge
module load Python/3.8.2-gimkl-2020a

# Run compile dram annotations
/path/to/scripts/compile_dram_annotations.py \
-i dram_annotation/subsets/ \
-o dram_annotation/collated_dram_


#### *DRAM* distill

`DRAM.py distill` can be used to generate summaries of annotations and some metabolic pathways.

For more information, see here: https://github.com/WrightonLabCSU/DRAM


In [None]:
# Working directory
cd /working/dir/3.gene_annotations

# Load modules
module purge
module load DRAM/1.3.5-Miniconda3

# Run DRAM
DRAM.py distill \
-i dram_annotation/collated_dramv_annotations.tsv \
-o dram_distillation


***