# Taxonomy and gene-sharing network

***

## Introduction

Viral taxonomy in general is a challenging and always evolving field. For example, 2022 saw a large overhaul of taxonomic idetification and ranking structures. For more information on viral taxonomy, see the website for the International Committee on Taxonomy of Viruses: https://ictv.global/

In addition to this, predicting viral taxonomy remains challenging in metagenomics-based studies.  

One leading approach currently involves generating a gene-sharing network of the similarity of all viral genomes (or contigs) compared to reference genomes based on the extent of shared genes. 

*vConTACT2* can generate a gene sharing network including the viral-RefSeq reference database, and which can be used to infer the potential taxonomy of your data set's viral contigs. *vConTACT2* generates this output based on predicted genes (e.g. *prodigal* output file). Taxonomy can then be *inferred* (note, this is note a definitive taxonomy assignment) based on clustering with reference genomes. 

Key outputs from *vConTACT2* are:

- `genome_by_genome_overview.csv`: clustering results of all included contigs (also including ViralRefSeq genomes). In some cases, taxanomy can be inferred based on how closely your query viral genome clusters with a reference genome(s). The script `tax_predict.py` was written to automate taxonomy predictions *en masse* based on this output file. However, note that in practice this results in only a very small percentage of taxonomy predictions overall (from metagenome-based data sets). It is generally necessary to examine the clustering statistics and visualise the network to assess the likely taxonomy of each viral genome of interest.
- `c1.ntw`: network file that can be visualised via, e.g. *cytoscape*

Further information on *vConTACT2* can be found here: https://bitbucket.org/MAVERICLab/vcontact2/src/master/

A helpful example (including visualisation in cytoscape) is available here: https://www.protocols.io/view/applying-vcontact-to-viral-sequences-and-visualizi-kqdg3pnql25z/v5?step=7

***

## Run vConTACT2

*vConTACT2* is not currently available in NeSI and must be installed as a conda environment (Note, you may also need to install `cluster_one-1.0.jar` separately). It is advisable to discuss conda installs with a member of the NeSI support team. 

The example below assumes a conda install is available, and runs all required steps in one slurm script: *prodigal* gene prediction; generates *vConTACT2* gene2genome lookup table; running *vConTACT2*; running tax_predict.py

NOTE:

- In this example, we are running *prodigal* to first predict all genes in our viral contig data set. It may be possible to simply take this directly from the *DRAM-v* output (assuming the formatting is appropriate for *vConTACT2*)
- Run `vcontact2 -h` to see the available RefSeq reference databases
- The version of `tax_predict.py` below was based on the output generated by vConTACT2.0.11.3, and may not be functional with other versions of vConTACT2

#### If required: Install ClusterONE into conda/envs/vContact2/bin/


In [None]:
# Install ClusterONE
cd /path/to/conda/envs/vContact2/bin/
wget https://paccanarolab.org/static_content/clusterone/cluster_one-1.0.jar

#### Run vConTACT2

In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J 6_Taxonomy_vcontact2
#SBATCH --time 03:00:00
#SBATCH --mem=20GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH -e 6_Taxonomy_vcontact2.err
#SBATCH -o 6_Taxonomy_vcontact2.out
#SBATCH --profile=task

# Set up working directories
cd /working/dir
mkdir -p 4.taxonomy/vConTACT2

### Prodigal gene prediction

# Load dependencies
module purge
module load Prodigal/2.6.3-GCC-9.2.0

# Run prodigal
srun prodigal -p meta -q \
-i 1.viral_identification/6.checkv_vOTUs/vOTUs.checkv_filtered.fna \
-a 4.taxonomy/vConTACT2/vOTUs.faa 

### vConTACT2 gene2genome

cd /working/dir

# Generate gene2genome mapping file for vConTACT2
module purge
module load Miniconda3
source activate vContact2
# Binary paths
export PATH="/path/to/conda/envs/vContact2/bin:$PATH"
# Dependency module loads
module load DIAMOND/0.9.32-GCC-9.2.0
module load MCL/14.137-gimkl-2020a

# run with prodigal output .faa file
vcontact2_gene2genome -p 4.taxonomy/vConTACT2/vOTUs.faa -o 4.taxonomy/vConTACT2/viral_genomes_g2g.csv -s 'Prodigal-FAA'

### Run vConTACT2

# Run vcontact2
srun vcontact2 \
-t 32 \
--raw-proteins 4.taxonomy/vConTACT2/vOTUs.faa \
--rel-mode Diamond \
--proteins-fp 4.taxonomy/vConTACT2/viral_genomes_g2g.csv \
--db 'ProkaryoticViralRefSeq211-Merged' \
--c1-bin /path/to/conda/envs/vContact2/bin/cluster_one-1.0.jar \
--output-dir 4.taxonomy/vConTACT2/vConTACT2_Results

### deactivate conda environment

conda deactivate

### tax_predict.py

# Load dependencies
module purge
module load Python/3.8.2-gimkl-2020a

# Set up working directories
cd /working/dir

# Run tax_predict.py
/path/to/scripts/tax_predict_vConTACT2.0.11.3.py \
-i 4.taxonomy/vConTACT2/vConTACT2_Results/genome_by_genome_overview.csv \
-o 4.taxonomy/vConTACT2/vOTUs_


***