# Gene prediction and annotation

***

*DRAM* is a useful tool for microbial gene prediction and annotation, and which also comes with a viral-focused equivalent, *DRAM-v*. *DRAM-v* outputs predicted annotations for genes based on several relevant databases, as well as predicted Auxiliary Metabolic Genes (AMG).

To run *DRAM-v* we must first re-run *VirSorter2* (with the filtering steps omitted this time) to prepare the necessary input files for *DRAM-v* (including to calculate metrics that *DRAM-v* uses for flagging Auxiliary Metabolic Genes (AMG)).

Further information on *DRAM* is available here: https://github.com/WrightonLabCSU/DRAM

NOTE:

- On large data sets, it can be necessary to split the vOTUs into subsets to run in parallel and then concatenate the results files.
- In the example below, we will: 
  1. split viral contigs input file into subsets (e.g. 100 subsets) to run DRAM in parallel
    - See [here](https://github.com/WrightonLabCSU/DRAM/issues/54) for more info on parallelisation and restarting runs
  1. prep files for *DRAM-v* via *VirSorter2* (with filtering switched off)
  1. Concatenate the files that *VirSorter2* prepared for *DRAM-v* for other downstream use (taxonomy, etc.)
  1. Run all `prepped-for-dramv` files through *DRAM-v* for annotations and AMG outputs
  1. Concatenate all results files together

***

## DRAM-v: vOTUs

#### DRAM-v Prep: Split fasta file

For large data sets, split the vOTUs fna file into subset chunks to run *DRAM-v* in parallel.

In [None]:
cd /working/dir/
mkdir -p 2.annotation/1.DRAMv/vsort2_prepfiles/split_input_fasta

module purge
module load BBMap/38.95-gimkl-2020a

partition.sh \
in=1.viral_identification/6.checkv_vOTUs/vOTUs.checkv_filtered.fna \
out=2.annotation/1.DRAMv/vsort2_prepfiles/split_input_fasta/vOTUs_filtered_subset_%.fna ways=100


#### DRAM-v Prep: VirSorter2

NOTE: 

- VirSorter2.2.3 **must** be run with `module unload XALT` after `module purge`
- Replace `/path/to/Databases/virsorter2_database/` with the appropriate path
- To omit filtering and to prepare for *DRAM-v*, the following flags are included: `--viral-gene-enrich-off --provirus-off --prep-for-dramv --min-score 0`


In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J 4_DRAMv_prep
#SBATCH --time 01:00:00
#SBATCH --mem=2GB
#SBATCH --array=0-99
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH -e 4_DRAMv_prep_%a.err
#SBATCH -o 4_DRAMv_prep_%a.out

# Load module
module purge
module unload XALT
module load VirSorter/2.2.3-gimkl-2020a-Python-3.8.2

# Set up working directories
cd /working/dir
mkdir -p 2.annotation/1.DRAMv/vsort2_prepfiles/vOTUs_filtered_subsets/

# run virsorter2
srun virsorter run -j 24 \
--seqname-suffix-off --viral-gene-enrich-off --provirus-off --prep-for-dramv \
-i 2.annotation/1.DRAMv/vsort2_prepfiles/split_input_fasta/vOTUs_filtered_subset_${SLURM_ARRAY_TASK_ID}.fna \
-d /path/to/Databases/virsorter2_database/ \
--min-score 0 --include-groups dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae \
-l vOTUs_filtered_subset_${SLURM_ARRAY_TASK_ID} \
-w 2.annotation/1.DRAMv/vsort2_prepfiles/vOTUs_filtered_subsets/vOTUs_filtered_subset_${SLURM_ARRAY_TASK_ID} \
--tmpdir ${SLURM_JOB_ID}.tmp \
--rm-tmpdir \
all \
--config LOCAL_SCRATCH=${TMPDIR:-/tmp}


#### DRAM-v Prep: Concatenate fasta files together for downstream use. 

In [None]:
cd /working/dir/2.annotation/1.DRAMv/vsort2_prepfiles
mkdir -p vOTUs_filtered_concatenated

# concatenate fasta files
> vOTUs_filtered_concatenated/final-viral-combined-for-dramv.fa
for i in {0..99}; do
    cat vOTUs_filtered_subsets/vOTUs_filtered_subset_${i}/vOTUs_filtered_subset_${i}-for-dramv/final-viral-combined-for-dramv.fa \
    >> vOTUs_filtered_concatenated/final-viral-combined-for-dramv.fa
done


#### DRAM-v: annotate

In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J 4_DRAMv_annotation_vOTUs
#SBATCH --time=01:00:00
#SBATCH --mem=80Gb
#SBATCH --ntasks=1
#SBATCH --array=0-99
#SBATCH --cpus-per-task=32
#SBATCH -e 4_DRAMv_annotation_vOTUs_%a.err 
#SBATCH -o 4_DRAMv_annotation_vOTUs_%a.out 

# set up
cd /working/dir/
mkdir -p 2.annotation/1.DRAMv/dramv_annotation

# Load modules
module purge
module load DRAM/1.3.5-Miniconda3

# Run DRAM
DRAM-v.py annotate --threads 32 \
--min_contig_size 1000 \
-i 2.annotation/1.DRAMv/vsort2_prepfiles/vOTUs_filtered_subsets/vOTUs_filtered_subset_${SLURM_ARRAY_TASK_ID}/vOTUs_filtered_subset_${SLURM_ARRAY_TASK_ID}-for-dramv/final-viral-combined-for-dramv.fa \
-v 2.annotation/1.DRAMv/vsort2_prepfiles/vOTUs_filtered_subsets/vOTUs_filtered_subset_${SLURM_ARRAY_TASK_ID}/vOTUs_filtered_subset_${SLURM_ARRAY_TASK_ID}-for-dramv/viral-affi-contigs-for-dramv.tab \
-o 2.annotation/1.DRAMv/dramv_annotation/dramv_annotation_subset_${SLURM_ARRAY_TASK_ID}


#### DRAM-v: Compile DRAM-v annotation subsets

The script `compile_dram_annotations.py` was written to recompile subsets of DRAM outputs, while allowing for cases where some results files were not generated (e.g. no tRNAs were identified for a given subset). It takes as input the directory path that contains each of the *DRAM-v* subsets outputs. This script is available in `../scripts/`.

In [None]:
# Working directory
cd /working/dir

# Load python
module purge
module load Python/3.8.2-gimkl-2020a

# Run compile dram annotations
/path/to/scripts/compile_dram_annotations.py \
-i 2.annotation/1.DRAMv/dramv_annotation \
-o 2.annotation/1.DRAMv/dramv_annotation/collated_dramv_


#### DRAM-v: distill

`DRAM-v.py distill` can be used to output predicted auxiliary metabolic genes. 

*Optional*: The flags `--remove_transposons` and `--remove_fs` can be included to exclude predicted AMGs on scaffolds with transposons and those that are near the ends of scaffolds (these situations increase the likelihood of false positives)


In [None]:
# Working directory
cd /working/dir

# Load modules
module purge
module load DRAM/1.3.5-Miniconda3

# Run DRAM
DRAM-v.py distill --remove_transposons --remove_fs \
-i 2.annotation/1.DRAMv/dramv_annotation/collated_dramv_annotations.tsv \
-o 2.annotation/1.DRAMv/dramv_distillation


***

## Additional resources

A great resource on standards in viromics, including specific discussion on auxiliary metabolic gene discovery and confirmation is available here: https://doi.org/10.7717/peerj.11447