# 3.4 Viruses - Gene prediction and annotation

## Software and versions used in this study

- DRAM v1.4.6
- prodigal-gv v2.9.0

## Additional custom scripts

Note: custom scripts have been tested in python v3.11.6 and R v4.2.1 and may not be stable in other versions.

- scripts/general/compile_dram_annotations.py
- scripts/general/dramv_compile_summary_table.py

*Required python packages: argparse, pandas, numpy, os, re, glob, Bio.SeqIO.FastaIO*

***

## Virus gene prediction and annotation

#### DRAM-v prep: Split fna file (via BBMap's partition.sh)

To speed up a run with a large number of viral contigs, you can: 

- first split the vOTUs into equal parts
- run each subset through VirSorter2 (DRAMv prep) and DRAMv annotate (i.e. parallel slurm array)
- compile results (*compile_dram_annotaions.py*)
- then run through DRAM distill.

In [None]:
mkdir DNA/3.viruses/7.gene_annotation

partition.sh \
in=DNA/3.viruses/5.checkv_vOTUs/vOTUs.fna \
out=DNA/3.viruses/7.gene_annotation/vsort2_prepfiles/split_input_fasta/vOTUs_subset_%.fna ways=100


#### DRAM-v Prep: VirSorter2

Re-run vOTU subsets through VirSorter2 with filtering switched off to generate required inputs for DRAMv 

Note: example below via slurm array w/ SLURM_ARRAY_TASK_ID and SLURM_JOB_ID variables. You can simplify this if running a single smaller set of vOTUs.

In [None]:
virsorter run -j 24 \
--seqname-suffix-off --viral-gene-enrich-off --provirus-off --prep-for-dramv \
-i DNA/3.viruses/7.gene_annotation/vsort2_prepfiles/split_input_fasta/vOTUs_subset_${SLURM_ARRAY_TASK_ID}.fna \
-d Databases/virsorter2/ \
--min-score 0 --include-groups dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae \
-l vOTUs_subset_${SLURM_ARRAY_TASK_ID} \
-w DNA/3.viruses/7.gene_annotation/vsort2_prepfiles/vOTUs_subsets/vOTUs_subset_${SLURM_ARRAY_TASK_ID} \
--tmpdir ${SLURM_JOB_ID}.tmp \
--rm-tmpdir \
all \
--config LOCAL_SCRATCH=${TMPDIR:-/tmp}


#### DRAM-v: annotate

Note: 

If you wish to run DRAMv in `--low_mem_mode` to speed up the run, `--low_mem_mode` currently excludes vogdb, but this is required for DRAM-v's *distill*. A rough work around is to temporarily edit DRAMv to also use vogdb with `--low_mem_mode`. 

The relevant script is *mag_annotator/database_handler.py*

Make the following edits:

- comment out the line: `dbs_to_use = [i for i in dbs_to_use if i not in ("uniref", "kegg", "vogdb")]`
- add line: `dbs_to_use = [i for i in dbs_to_use if i not in ("uniref", "kegg")]`


In [None]:
mkdir -p DNA/3.viruses/7.gene_annotation/dramv_annotation

DRAM-v.py annotate --threads 32 --min_contig_size 1000 \
-i DNA/3.viruses/7.gene_annotation/vsort2_prepfiles/vOTUs_subsets/vOTUs_subset_${SLURM_ARRAY_TASK_ID}/vOTUs_subset_${SLURM_ARRAY_TASK_ID}-for-dramv/final-viral-combined-for-dramv.fa \
-v DNA/3.viruses/7.gene_annotation/vsort2_prepfiles/vOTUs_subsets/vOTUs_subset_${SLURM_ARRAY_TASK_ID}/vOTUs_subset_${SLURM_ARRAY_TASK_ID}-for-dramv/viral-affi-contigs-for-dramv.tab \
-o DNA/3.viruses/7.gene_annotation/dramv_annotation/dramv_annotation_subset_${SLURM_ARRAY_TASK_ID}

#### DRAM-v: Compile DRAM-v annotation subsets

In [None]:
scripts/general/compile_dram_annotations.py \
-i DNA/3.viruses/7.gene_annotation/dramv_annotation \
-o DNA/3.viruses/7.gene_annotation/dramv_annotation/collated_dramv_

#### DRAM-v: distill

Note: the script below includes the following options: 

- `--remove_transposons` ("Do not consider genes on scaffolds with transposons as potential AMGs")
- `--remove_fs` ("Do not consider genes near ends of scaffolds as potential AMGs")

Note re: AMG prediction: "By default a gene is considered a potential AMG if it has an M flag, no V flag, no A flag and an auxiliary score of 3 or lower." (https://github.com/WrightonLabCSU/DRAM/wiki)

In [None]:
DRAM-v.py distill --remove_transposons --remove_fs \
-i DNA/3.viruses/7.gene_annotation/dramv_annotation/collated_dramv_annotations.tsv \
-o DNA/3.viruses/7.gene_annotation/dramv_distillation

#### DRAM-v: compile results

This script compiles DRAMv annotations, tRNA, rRNA, and AMG (distill) outputs, together with gene coords and contig (genome) lengths into one table for downstream use

Note: this script also trims "-cat_n" from contig/genome IDs (introduced by VirSorter2 during the prep-for-DRAMv step) to keep IDs consistent with other analyses.


In [None]:
scripts/general/dramv_compile_summary_table.py \
-a DNA/3.viruses/7.gene_annotation/dramv_annotation/collated_dramv_annotations.tsv \
-d DNA/3.viruses/7.gene_annotation/dramv_distillation/amg_summary.tsv \
-t DNA/3.viruses/7.gene_annotation/dramv_annotation/collated_dramv_trnas.tsv \
-r DNA/3.viruses/7.gene_annotation/dramv_annotation/collated_dramv_rrnas.tsv \
-f DNA/3.viruses/5.checkv_vOTUs/vOTUs.fna \
-o DNA/3.viruses/7.gene_annotation/DRAMv.summary_table.tsv

***