# 1B. Viral Identification

### Simplified method: single tool (*VirSorter2*)

***

## NOTE: 

All examples are based on multiple individual assemblies (in this case, nine sets of data assembled separately, labelled S1-S9 (originating from nine separate samples)). Scipts are based on loops and arrays to accomodate all nine data sets, based on the labelling schema S1-S9.

If you *only* have a single co-assembly, then modify appropriately to only run on that one data set.

If you have *both* individual assemblies and a co-assembly (and/or mini-co-assemblies) then modify the scripts appropriately to run on all data sets (you can treat the co-assembly as simply an additional assembly data set; so in this case, if you had nine sample assemblies and one co-assembly, you can run these as if you had 10 individual assemblies).

The examples that follow are based on assembled environmental metagenomics (Illumina HiSeq) data form multiple samples. However, the process is comparable for data generated from Oxford Nanopore long read sequencing of isolates.

In the case of data generated from Oxford Nanopore long read sequencing of isolates, note that:

- DNA extractions in the study used to generate the Nanopore data processing docs were from bacterial isolates grown in liquid media and spun down into a pellet for extraction. When it comes to viral identification, note that this process will therefore predominantly target *prophage* that are integrated into the host genome *at the time of DNA extraction*. Lytic viruses are less likely to be caught here. However, in our experience, there may also be some cases where assembled contigs are fully circular viral genomes, which may represent intracellular viruses that are not integrated into the genome (e.g. replicating at the time; or viruses that are intracellular but do not integrate into the genome (akin to a plasmid)), or extracellular viral particles caught within the pellet during centrifugation and DNA extraction.
- It may be prefereable to skip the final dereplication step (via `Cluster_genomes.pl`). For individual isolate genomes, you will likely want to retain closely related viruses as distinct genomes rather than clustering them together (especially if they originated from different isolates, or are from the same isolate but one sequence is excised from an integrated prophage, and another identical sequence is a circular genome (suggesting both integration and replication in the host))

***

## Index

- [1B.1 Introduction](#1B.1-Introduction)
- [1B.2 Prepare assembly files](#1B.2-Prepare-assembly-files)
- 1B.3 Identifying Viral Contigs
  - [1B.3.1 VirSorter2](#1B.3.1-Identifying-viral-contigs:-VirSorter2)
- [1B.4 Summary tables for each assembly](#1B.4-Summary-tables-for-each-assembly)
- [1B.5 Viral contigs: Filter by size](#1B.5-Filter-contigs)
- [1B.6 Viral contigs: Per sample QC and filtering](#1B.6-Per-sample-quality-assessment-and-additional-filtering)
- [1B.7 Viral contigs: Multiple assembly dereplication](#1B.7-Dereplication-across-samples)
- [1B.8 vOTUs: QC and filtering](#1B.8-vOTUs-assessment-and-additional-filtering)
- [1B.9 Final dereplicated vOTUs](#1B.9-Final-set-of-dereplicated-viral-contigs)
- [1B.10 Additional Resources](#1B.10-Additional-Resources)

***

## 1B.1 Introduction

Viral metagenomics is a rapidly progressing field, and new software are constantly being developed and released each year that aim to better identify and characterise viral genomic sequences from assembled metagenomic sequence reads.

Currently, the most commonly used methods are *VirSorter2*, *VIBRANT*, and *VirFinder* (or the machine learning implementation of this, *DeepVirFinder*). Each tool has strengths and weaknesses. And given this is an evolving field, none are perfect. A number of recent studies use either one of these tools, or a combination of several at once.

**The examples below use *VirSorter2*.**

For a more comprehensive method that uses *VirSorter2*, *VIBRANT*, and/or *DeepVirFinder* in combination, see the alternative document, **Viromics_WGS_3A_Viral_Identification_Full**

##### *VirSorter2*

- *VirSorter2* uses a predicted protein homology reference database-based approach, together with searching for a number of pre-defined metrics based on known viral genomic features. *VirSorter2* has been designed to target dsDNAphage, ssDNA and RNA viruses, and the viral groups *Nucleocytoviricota* and *lavidaviridae*.
- paper: https://peerj.com/articles/985/
- github: https://github.com/jiarong/VirSorter2

##### *VIBRANT*

- *VIBRANT* uses a machine learning approach based on protein similarity (non-reference-based similarity searches with multiple HMM sets), and is in principle applicable to bacterial and archaeal DNA and RNA viruses, integrated proviruses (which are excised from contigs by *VIBRANT*), and eukaryotic viruses.
- paper: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-020-00867-0
- github: https://github.com/AnantharamanLab/VIBRANT

##### *DeepVirFinder*

- *DeepVirFinder* uses a machine learning based approach based on k-mer frequencies. Having developed a database of the differences in k-mer frequencies between prokaryote and viral genomes, *VirFinder* examines assembled contigs and identifies whether their k-mer frequencies are comparable to known viruses in the database, using this to predict viral genomic sequence.
- This method has some limitations based on the viruses that were included when building the database (bacterial DNA viruses, but very few archaeal viruses, and, at least in some versions of the software, no eukaryotic viruses). However, tools are also provided to build your own database should you wish to develop an expanded one. 
- Due to its distinctive k-mer frequency-based approach, *VirFinder* may also have the capability of identifying some novel viruses overlooked by tools such as *VIBRANT* or *VirSorter*. However, it will also likely have many more false-positives, and so requires more careful curation. 
- *DeepVirFinder* also appears to no longer be in development or support, so may be outdated compared to, for example, *VirSorter2*.
- Original VirFinder paper: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-017-0283-5
- github: https://github.com/jessieren/DeepVirFinder

**At the time of writing, if you were to pick one tool, we would recommend *VirSorter2***, as this appears to have the most recent ongoing development work and also ties in nicely with *DRAM-v* (for viral annotation).

NOTE: 

- As this version (Viromics_WGS_3B_Viral_Identification_Simplified) only uses one tool to identify viral sequences, dereplication *across the multiple tools* is *not* necessary here. However, dereplication *across assemblies* (via `Cluster_genomes_5.1.pl`) still is.


***

## 3A.2 Prepare assembly files


#### Optional: Add assembly ID to contig headers

If you now have data from multiple assemblies, it can be useful to add the assembly ID to contig headers to avoid conflicts downstream (on very rare occasions, contigs from different assemblies can end up with identical contig headers), and to make it easier to spot where contigs of interest originated from after dereplication. 

In [None]:
cd /working/dir
mkdir -p /path/to/wgs/assembly/2.spades_assembly_edit

# Individual sample assemblies
for i in {1..9}; do
    sed "s/>/>S${i}_/g" /path/to/wgs/assembly/1.spades_assembly_S${i}/scaffolds.fasta > /path/to/wgs/assembly/2.spades_assembly_edit/S${i}.assembly.fasta
done 


#### Filter out short contigs

For downstream processing, it can be a good idea to filter out short contigs (for example, those less than 1000 or 2000 bp). *VIBRANT*, for example, recommends removing contigs < 1000 bp, as it then filters based on presence of 4 identified putative genes, rather than contig length. 

If you wish to filter out short contigs, you can do so via `seqmagick`:

In [None]:
# Set up working directories
cd /working/dir
mkdir -p /path/to/wgs/assembly/2.spades_assembly_edit_m1000

# Load seqmagick
module purge
module load seqmagick/0.7.0-gimkl-2017a-Python-3.6.3

## Filter out contigs < 1000 bp using seqmagick
for i in {1..9}; do
    seqmagick convert --min-length 1000 /path/to/wgs/assembly/2.spades_assembly_edit/S${i}.assembly.fasta /path/to/wgs/assembly/2.spades_assembly_edit_m1000/S${i}.assembly.m1000.fasta
done


***

## 1B.3.1 Identifying viral contigs: VirSorter2

In the steps below, we first run VirSorter2 with a min score threshold setting of 0.75. A python script is provided to then filter these results to only retain contigs with a score > 0.9 *or* if they have a viral hallmark gene identified. There are no set rules on how best to set filter thresholds here, but the latter roughly follow the screening thresholds discussed in the VirSorter2 protocols page here: https://www.protocols.io/view/viral-sequence-identification-sop-with-virsorter2-5qpvoyqebg4o/v3

NOTE: 

- **Due to an issue with the current install in NeSI, `module unload XALT` *must* be run before loading the VirSorter module!**
- While *VirSorter2* is available as a NeSI module, the reference databases must be downloaded separately (~10 GB).
- In the current version (2.2.3) `--include_groups ...` must be included with all available groups listed. I believe in later versions an include all option will be added to replace this.
- The last line (`--config  LOCAL_SCRATCH=${TMPDIR:-/tmp}`) is something that's recommended to include when running on NeSI (to do with how it handles some of the temp files for the HMM profiles). 

If you don't already have the databases available, download these via `virsorter setup`

In [None]:
# Set up database directory (you may want to name the database directory with the date downloaded)
cd /path/to/Databases/
mkdir -p virsorter2_database

# Load module
module purge
module unload XALT
module load VirSorter/2.2.3-gimkl-2020a-Python-3.8.2

# Download databases
virsorter setup -d virsorter2_database

Run *VirSorter2* on each assembly file


In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J 3_vsort2
#SBATCH --time 12:00:00
#SBATCH --mem=20GB
#SBATCH --array=1-9
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH -e 3_vsort2_%a.err
#SBATCH -o 3_vsort2_%a.out

# Load module
module purge
module unload XALT
module load VirSorter/2.2.3-gimkl-2020a-Python-3.8.2

# Set up working directories
cd /working/dir
mkdir -p 1.viral_identification/1.virsorter2
 
## run virsorter2
srun virsorter run -j 32 \
-i /path/to/wgs/assembly/2.spades_assembly_edit_m1000/S${SLURM_ARRAY_TASK_ID}.assembly.m1000.fasta \
-d /nesi/project/uoa02469/Databases/virsorter2_20210909/ \
--min-score 0.75 --include-groups dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae \
-w 1.viral_identification/1.virsorter2/S${SLURM_ARRAY_TASK_ID} -l S${SLURM_ARRAY_TASK_ID} \
--rm-tmpdir \
all \
--config LOCAL_SCRATCH=${TMPDIR:-/tmp}


Filter *VirSorter2* results to only retain contigs with a score > 0.9 *or* if they have a viral hallmark gene identified

In [None]:
cd /working/dir

# Load python3
module purge
module load Python/3.8.2-gimkl-2020a
python3
import pandas as pd
import numpy as np

## Filter results (score >= 0.9 OR hallmark > 0)
# Loop through all samples, and output new 'SampleX-final-viral-score_filt_0.9.tsv' file for each.
for number in range(1, 10):
    # Load ...final-viral-score.tsv file
    vsort_score = pd.read_csv('1.viral_identification/1.virsorter2/S'+str(number)+'/S'+str(number)+'-final-viral-score.tsv', sep='\t')
    # Filter by score threshold and/or hallmark gene (e.g. score >= 0.9 OR hallmark > 0)
    vsort_score = vsort_score[np.logical_or.reduce((vsort_score['max_score'] >= 0.9, vsort_score['hallmark'] > 0))]
    # Write out filtered file
    vsort_score.to_csv('1.viral_identification/1.virsorter2/S'+str(number)+'/S'+str(number)+'-final-viral-score_filt_0.9.tsv', sep='\t', index=False)

quit()


***

## 1B.4 Summary tables for each assembly

Generate summary tables all contigs putatively identified as 'viral' (or containing viral sequence) for each assembly

The script `summarise_viral_contigs.py` is available in `.../scripts/`

- This takes the files output from *VirSorter2*, *VIBRANT*, and/or *DeepVirFinder* and generates a summary table based on contig IDs.
- Not all inputs are required (e.g. if you excluded running *DeepVirFinder*, this can be omitted here)

In [None]:
# Working dir
cd /working/dir/1.viral_identification/
mkdir -p 2.summary_tables

# modules
module purge
module load Python/3.9.9-gimkl-2020a

# Run script
for i in {1..9}; do
    /path/to/scripts/summarise_viral_contigs.py \
    --virsorter2 1.virsorter2/S${i}/S${i}-final-viral-score_filt_0.9.tsv \
    --out_prefix 2.summary_tables/S${i}.viral_contigs.summary_table
done


***

## 1B.5 Filter contigs 

Filter putative 'viral' contigs to remove those < 5000 bp

Confidence in the viral calls of each of the tools generally increases with contig length. As such, various studies have only retained contigs greater than a set threshold (e.g. 3,000 bp, 5,000 bp, or 10,000 bp). This also can assist with reducing the dataset to a manageable size in large complex metagenome data sets.

The example below filters to only retain contigs > 5000 bp

In [None]:
# Load dependencies
module purge
module load seqmagick/0.7.0-gimkl-2017a-Python-3.6.3

# Working directory
cd /working/dir

for i in {1..9}; do
    seqmagick convert --min-length 5000 \
    1.viral_identification/1.virsorter2/S${i}/S${i}-final-viral-combined.fa \
    1.viral_identification/1.virsorter2/S${i}/S${i}-final-viral-combined.filt.fa
done


***

## 1B.6 Per-sample quality assessment and additional filtering

#### CheckV perSample: All samples

*checkV* is a tool that has been developed as an analogue to *checkM*. *checkV* provides various statisics about the putative viral contigs data set, including length, gene count, viral and host gene counts, and estimated completeness and contamination. 

We can run *checkV* on the ouput from dereplication of contigs identified by the three methods, and use the results from *checkV* as an additional filtering step prior to our final dereplication.

Run *checkV*:

In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J 3_checkv_perSample
#SBATCH --time 00:20:00
#SBATCH --mem=4GB
#SBATCH --ntasks=1
#SBATCH --array=1-9
#SBATCH --cpus-per-task=16
#SBATCH -e 3_checkv_perSample_%a.err
#SBATCH -o 3_checkv_perSample_%a.out
#SBATCH --profile=task

# Load dependencies
module purge
module load CheckV/0.7.0-gimkl-2020a-Python-3.8.2

# Set up working directories
cd /working/dir
mkdir -p 1.viral_identification/4.perSample_checkv

# Run main analyses 
checkv_in="1.viral_identification/1.virsorter2/S${i}/S${i}-final-viral-combined.filt.fa"
checkv_out="1.viral_identification/4.perSample_checkv/S${SLURM_ARRAY_TASK_ID}"
srun checkv end_to_end ${checkv_in} ${checkv_out} -t 16 --quiet


#### Concatenate output fasta files (viruses.fna and proviruses.fna) for downstream use

- Note: this script also modifies contig headers for readability in any cases where *checkV* has trimmed any residual host sequence off the end of integrated prophage sequence (this is separate and additional to previous prophage excision by *VIBRANT* or *VirSorter2*).

In [None]:
cd /working/dir

for i in {1..9}; do
    # concatenate viruses and prophage files
    cat 1.viral_identification/4.perSample_checkv/S${i}/viruses.fna 1.viral_identification/4.perSample_checkv/S${i}/proviruses.fna > 1.viral_identification/4.perSample_checkv/S${i}/viral_contigs.fna 
    # modify checkv prophage contig headers
    sed -i -e "s/\s/__excised_start_/g" -e "s/-/_end_/g" -e "s/\//_len_/g" -e "s/|/_/" -e "s/|//g" 1.viral_identification/4.perSample_checkv/S${i}/viral_contigs.fna
done


#### Filter putative 'viral' contigs by checkv results

The script `checkv_filter_contigs.py` (available in `../scipts/`) further filters the sets of viral contigs based on *checkV* results. By default, this retains only contigs where: ((viral_genes>0) OR (viral_genes=0 AND host_genes=0). This script takes the *checkV* outputs (including the proviruses and viruses fna files, and quality summary), and returns fna and quality summary files with 'filtered' appended to the file name.


In [None]:
# Load dependencies
module purge
module load Python/3.8.2-gimkl-2020a

# Working directory
cd /working/dir

for i in {1..9}; do
    /path/to/scripts/checkv_filter_contigs.py \
    --checkv_dir_input 1.viral_identification/4.perSample_checkv/S${i}/ \
    --output_prefix 1.viral_identification/4.perSample_checkv/S${i}/viral_contigs
done


***

## 1B.7 Dereplication across samples

Contigs identified so far now need to be dereplicated into a single final set of viral contigs. This final set of representative (clustered) contigs can be referred to as 'viral operational taxonomic units' (vOTUs), representing distinct viral 'populations'. 

Here we can use the `Cluster_genomes_5.1.pl` script developed by Simon Roux's group: https://github.com/simroux/ClusterGenomes

This script clusters contigs based on sequence similarity thresholds, returning a representative (vOTU) sequence for each cluster. The following paper recommends a threshold of 95% similarity over 85% of the sequence length, based on currently available data: https://doi.org/10.1038/nbt.4306

In the case where multiple assemblies have been analysed, this step is necessary to reduce the viral data down to one representative set for all sample assemblies (which is important for read mapping to assess differntial coverage across assemblies, for example). Where only one assembly data set has been processed, this step is still useful to reduce the data down into meaningful units for downstream analyses (i.e. viral 'populations' rather than unique sequences).

Note: 

- Download the `Cluster_genomes_5.1.pl` script from https://github.com/simroux/ClusterGenomes
- *mummer* is also required. Download the latest version and add the path to the bin directory in the `Cluster_genomes_5.1.pl` script below.

#### If required: Install *Cluster_genomes_5.1.pl* and *mummer*

In [None]:
# Install Cluster_genomes_5.1.pl
mkdir -p /path/to/Software/Cluster_genomes
cd -p /path/to/Software/Cluster_genomes
wget https://raw.githubusercontent.com/simroux/ClusterGenomes/master/Cluster_genomes_5.1.pl
chmod 777 Cluster_genomes_5.1.pl

# Install mummer4
mkdir -p /path/to/Software/mummer_v4.0.0/
cd /path/to/Software/mummer_v4.0.0
wget https://github.com/mummer4/mummer/releases/download/v4.0.0rc1/mummer-4.0.0rc1.tar.gz
tar -xzf mummer-4.0.0rc1.tar.gz
cd mummer-4.0.0rc1/
./configure --prefix=/path/to/Software/mummer_v4.0.0
make
make install


#### File prep: Concatenate multiple sample fasta files together for `cluster_genomes.pl`

In [None]:
cd /working/dir
mkdir -p 1.viral_identification/5.cluster_vOTUs/

> 5.cluster_genomes/viral_contigs_allSamples.fna
for i in {1..9}; do
    cat 1.viral_identification/4.perSample_checkv/S${i}/viral_contigs_filtered.fna >> 1.viral_identification/5.cluster_vOTUs/viral_contigs_allSamples.fna
done

# Sort by sequence size
module purge
module load BBMap/38.95-gimkl-2020a 
sortbyname.sh in=1.viral_identification/5.cluster_vOTUs/viral_contigs_allSamples.fna out=1.viral_identification/5.cluster_vOTUs/viral_contigs_allSamples.sorted.fna length descending


#### Run cluster_genomes.pl

Run cluster_genomes.pl at min identity = 95% similarity over at least 85% of the shortest contig


In [None]:
# Load dependencies
module purge

# Set up working directories
cd /working/dir/1.viral_identification/5.cluster_vOTUs

# Run
/path/to/scripts/Cluster_genomes_5.1.pl \
-f viral_contigs_allSamples.sorted.fna \
-d /path/to/Software/mummer_v4.0.0/bin/ \
-t 20 \
-c 85 \
-i 95


#### Check total number of clustered contigs (vOTUs)

In [None]:
cd /working/dir

# count contigs in cluster output file
grep -c ">" 1.viral_identification/5.cluster_vOTUs/viral_contigs_allSamples.sorted_95-85.fna


#### Optional: Modify derep contig headers to be *vOTU_n*

- It can be useful for downstream processing to standardise the contig headers of the cluster representative sequences; for example, to replace all headers with *vOTU_n*. 
- The script below replaces all headers with *vOTU_n* and create a table file of *vOTU_n* ids against the original full contig headers (of the *representative* sequences from each cluster).
  - Note: Cluster_genomes_5.1.pl also outputs a file matching cluster representative sequences to each of the sequences that are contained in the cluster.
- *Optional*: you may also wish to omit this step here and instead run it after having calculated differential coverage across sample assmeblies (via read mapping), to enable first ordering the contigs by abundance (coverage), and *then* generating *vOTU_n* headers
- This may ultimately be put into a script for ease of use. But for now we can use the python code below

In [None]:
cd /working/dir/1.viral_identification/5.cluster_vOTUs

# LOAD PYTHON
module purge
module load Python/3.8.2-gimkl-2020a
python3
import os
import pandas as pd
import numpy as np
import re
from Bio.SeqIO.FastaIO import SimpleFastaParser

fasta_in = 'viral_contigs_allSamples.sorted_95-85.fna'
fasta_out = 'vOTUs.fna'
lookup_table_out = 'vOTUs_lookupTable.txt'

# Read in fasta file, looping through each contig
# rename contig headers with incrementing vOTU_n headers
# write out new vOTUs.fna file and tab-delimited table file of matching vOTU_n and contigID headers.
i=1
with open(fasta_in, 'r') as read_fasta:
    with open(fasta_out, 'w') as write_fasta:
        with open (lookup_table_out, 'w') as write_table:
            write_table.write("vOTU" + "\t" + "cluster_rep_contigID" + "\n")
            for name, seq in SimpleFastaParser(read_fasta):
                write_table.write("vOTU_" + str(i) + "\t" + name + "\n")
                write_fasta.write(">" + "vOTU_" + str(i) + "\n" + str(seq) + "\n")
                i += 1

quit()


***


## 1B.8 vOTUs assessment and additional filtering

### CheckV on vOTUs

Re-run *CheckV*, this time on the dereplicated contig set (vOTUs) to output *checkV* stats

In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J 3_checkv_vOTUs
#SBATCH --time 01:00:00
#SBATCH --mem=4GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH -e 3_checkv_vOTUs.err
#SBATCH -o 3_checkv_vOTUs.out
#SBATCH --profile=task

# Load dependencies
module purge
module load CheckV/0.7.0-gimkl-2020a-Python-3.8.2

# Set up working directories
cd /working/dir/1.viral_identification
mkdir -p 6.checkv_vOTUs

# Run main analyses 
checkv_in="5.cluster_vOTUs/vOTUs.fna"
checkv_out="6.checkv_vOTUs"
srun checkv end_to_end ${checkv_in} ${checkv_out} -t 16 --quiet


#### CheckV: Concatenate output fasta files (viruses.fna and proviruses.fna)

In [None]:
cd /working/dir/1.viral_identification

# concatenate viruses and prophage files
cat 6.checkv_vOTUs/viruses.fna  6.checkv_vOTUs/proviruses.fna >  6.checkv_vOTUs/vOTUs.checkv.fna 
# modify checkv prophage contig headers
sed -i -e "s/\s/__excised_start_/g" -e "s/-/_end_/g" -e "s/\//_len_/g" -e "s/|/_/" -e "s/|//g" 6.checkv_vOTUs/vOTUs.checkv.fna


#### Filter vOTUs based on *checkV* results

**NOTE**:

- As we have run `checkv_filter_contigs.py` on all individual contigs (in the per-sample QC step previously), this may be redundant here? 
- In which case, simply proceed with the concatenated file above, rather than the `..._filtered...` files output by `checkv_filter_contigs.py`


In [None]:
# Load dependencies
module purge
module load Python/3.8.2-gimkl-2020a
export PATH="/nesi/project/uoa02469/custom-scripts/MikeH/:$PATH"

# Set up working directories
cd /working/dir/1.viral_identification

# Run for vOTUs
/path/to/scripts/checkv_filter_contigs.py \
    --checkv_dir_input 6.checkv_vOTUs/ \
    --output_prefix 6.checkv_vOTUs/vOTUs.checkv


***

## 1B.9 Final set of dereplicated viral contigs

At this stage we have a final set of dereplicated viral contigs for all downstream analyses.

Key files include:

- Final dereplicated viral contig data set: `/working/dir/1.viral_identification/6.checkv_vOTUs/vOTUs.checkv_filtered.fna` 
  - (Or `vOTUs.checkv.fna`, if the final `checkv_filter_contigs.py` step was not run)
- *checkV* stats for dereplicated viral contigs: `/working/dir/1.viral_identification/6.checkv_vOTUs/vOTUs.checkv_filtered_quality_summary.txt`

***

## 1B.10 Additional Resources

#### Notes on manual curation of vOTUs and general resources

Some very helpful notes on manual curation are available at the *VirSorter2* protocols page here: https://doi.org/10.17504/protocols.io.bwm5pc86

Further valuable reading on Minimum Information about an Uncultivated Virus Genome (MIUViG): https://doi.org/10.1038/nbt.4306

A great resource on standards in viromics: https://doi.org/10.7717/peerj.11447

***