# 3.3 Viruses - Taxonomy prediction

## Software and versions used in this study

- vConTACT2 v2.0.11.3
- Cytoscape v3.8.2

## Additional custom scripts

Note: custom scripts have been tested in python v3.11.6 and R v4.2.1 and may not be stable in other versions.

- scripts/viruses.identification/ictv_reconcile_refseq_taxonomy.py
- scripts/viruses.identification/vcontact2_update_refseq_taxonomy.py 
- scripts/viruses.identification/tax_predict_vConTACT2.0.11.3.tax_update_202309.py

*Required python packages: argparse, pandas, numpy, re*

## Additional databases used in this study

- IMG/VR v7.1
- viralRefSeq v211 (within vConTACT2)
- viralRefSeq v223 (for taxonomy updates)
- ICTV taxonomy MSL38.v3 (propagation of higher ranks for taxonomy updates to viralRefSeq)

***

## Virus taxonomy: Method 1. vConTACT2 clustering with viralRefSeq references

Predict virus taxonomy based on interactions and/or clustering within a gene-sharing network via vConTACT2.

Note: gene predictions can be generated here via *prodigal-gv* or taken from DRAMv output (which uses prodigal-gv in the background).

#### Run vConTACT2 with viralRefSeq references

In [None]:
mkdir -p DNA/3.viruses/6.taxonomy/vConTACT2

# gene prediction (can skip if you already have an appropriate proteins.faa file available)
prodigal-gv -p meta -q \
-i DNA/3.viruses/5.checkv_vOTUs/vOTUs.fna \
-a DNA/3.viruses/6.taxonomy/vConTACT2/proteins.faa 

# vConTACT2 gene2genome
vcontact2_gene2genome -p DNA/3.viruses/6.taxonomy/vConTACT2/proteins.faa -o DNA/3.viruses/6.taxonomy/vConTACT2/viral_genomes_g2g.csv -s 'Prodigal-FAA'

# Run vcontact2
vcontact2 -t 32 \
--raw-proteins DNA/3.viruses/6.taxonomy/vConTACT2/proteins.faa \
--rel-mode Diamond \
--proteins-fp DNA/3.viruses/6.taxonomy/vConTACT2/viral_genomes_g2g.csv \
--db 'ProkaryoticViralRefSeq211-Merged' \
--c1-bin /path/to/cluster_one-1.0.jar \
--output-dir DNA/3.viruses/6.taxonomy/vConTACT2/vConTACT2_Results

#### Update taxonomy in genome_by_genome_overview.csv

Virus taxonomy was overhauled in recent years (by [ICTV](https://ictv.global/)), and the version of vConTACT2 used here relies on a version of viralRefSeq (v211) that contains outdated taxonomy assignments. 

The following steps udpate viralRefSeq taxonomy assignments in the genome_by_genome_overview.csv file generated by vConTACT2 based on ICTV taxonomy updates. The process first updates taxonomy based on the latest viralRefSeq database (here, v223), then propogates higher rank assignments based on ICTV's taxonomy table.



The file viralRefSeq_ICTV_reconciled_taxonomy.tsv was generated in house by merging [viralRefSeq_metadata.csv](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&SourceDB_s=RefSeq) (viralRefSeq v223) with the [ICTV latest taxonomy table](https://ictv.global/taxonomy) (ICTV_Master_Species_List_2022_MSL38.v3.xlsx):


In [None]:
scripts/viruses.identification/ictv_reconcile_refseq_taxonomy.py \
-i DNA/3.viruses/6.taxonomy/viralRefSeq_taxonomy/ICTV_Master_Species_List_2022_MSL38.v3.xlsx \
-r DNA/3.viruses/6.taxonomy/viralRefSeq_taxonomy/viralRefSeq_metadata.csv \
-o DNA/3.viruses/6.taxonomy/viralRefSeq_taxonomy/viralRefSeq_ICTV_reconciled_taxonomy.tsv


Update viralRefSeq taxonomy assignments in the genome_by_genome_overview.csv file generated by vConTACT2 based on ICTV taxonomy updates (using viralRefSeq_ICTV_reconciled_taxonomy.tsv generated via *ictv_reconcile_refseq_taxonomy.py*). 

In [None]:
scripts/viruses.identification/vcontact2_update_refseq_taxonomy.py \
-i DNA/3.viruses/6.taxonomy/vConTACT2/vConTACT2_Results/genome_by_genome_overview.csv' \
-t DNA/3.viruses/6.taxonomy/viralRefSeq_taxonomy/viralRefSeq_ICTV_reconciled_taxonomy.tsv \
-o DNA/3.viruses/6.taxonomy/vConTACT2/vConTACT2_Results/genome_by_genome_overview.tax_update.csv


#### Predict taxonomy of viruses based on vConTACT2 clustering with references 

*tax_predict.py* was written to generate taxonomy predictions for DNA virus genomes based on summaries of the taxonomy of all reference genomes that cluster together with your query viruses via vConTACT2 clustering analysis.

Note:

- Due to the parameters of vConTACT2 clustering, only viruses related at *approximately* the rank of genus cluster together. Therefore only closely-related viruses will receive taxonomy predictions here. To predict relatedness at higher ranks, other approches are required (see below).
- The original script *tax_predict.py* breaks with updates made in vConTACT2 version ~2.0.11.3. *tax_predict_vConTACT2.0.11.3.tax_update_202309.py* is a modified version for vConTACT2.0.11.3 with additional taxonomy ranks added to also take into account the taxonomy reconciliation with ICTV.

In [None]:
scripts/viruses.identification/tax_predict_vConTACT2.0.11.3.tax_update_202309.py \
-i DNA/3.viruses/6.taxonomy/vConTACT2/vConTACT2_Results/genome_by_genome_overview.tax_update.csv \
-o DNA/3.viruses/6.taxonomy/vConTACT2/

## Virus taxonomy: Method 2. Manual taxonomy prediction for higher ranks

The steps above generate taxonomy predictions for viruses when they cluster with reference genomes. But this only works for viruses with ~genus-level relatedness to reference sequences, which is often a small subset of environmental metagenomics datasets.

For those that don't cluster, an alternative/additional approach is to infer taxonomy affiliations at higher ranks based on examining vConTACT2's protein-sharing network in cytoscape and predicting taoxnomy based on nearest neighbours and/or reference genomes that share an interaction with them (e.g. for each vOTU, if all it's interactions in the network are with various references from the same class or family, you might predict that the vOTU is also from that class or family).

For this study, manual predictions were made this way and then, for all vOTUs without taxonomy assignments after the vConTACT2 clustering step above, these manual taxonomy predictions were added to the table output from *tax_predict_vConTACT2.0.11.3.tax_update_202309.py* above to generate `tax_predict_table.with_manual_curation.tsv`. 

## Virus taxonomy: Method 3. vConTACT2 clustering with IMG/VR references

An additional approach used in this study was to predict taxonomy affiliations based on relatedness to genomes in the IMG/VR database (filtered for high-quality viruses). Due to the size of the dataset, it was first necessary to split the data into 50 parts (partition.sh from BBMap), run each partition through vConTACT2, and then from all 50 runs subset any IMG/VR sequence that had any interaction with any vOTU (i.e. shared an edge). Finally, this subset was run back through vConTACT2 with the vOTUs again. From this, taxonomy predictions were made based on the taxonomy of IMG/VR genomes that clustered with each vOTU.

## Virus taxonomy: final set of vOTU taxonomy assigments

In this study, final taxonomy assignments were made based on the three methods outlined above, in the following order of priority: 

1. vConTACT2 clustering with viralRefSeq references
1. manual prediction at higher taxonomy ranks based on interactions observed within the visualised protein-sharing network
1. vConTACT2 clustering with IMG/VR genomes.

***