# 1A. Viral Identification

### Full method: *VirSorter2*, *VIBRANT*, and/or *DeepVirFinder*

***

## NOTE: 

All examples are based on multiple individual assemblies (in this case, nine sets of data assembled separately, labelled S1-S9 (originating from nine separate samples)). Scipts are based on loops and arrays to accomodate all nine data sets, based on the labelling schema S1-S9.


If you *only* have a single co-assembly, then modify appropriately to only run on that one data set.


If you have *both* individual assemblies and a co-assembly (and/or mini-co-assemblies) then modify the scripts appropriately to run on all data sets (you can treat the co-assembly as simply an additional assembly data set; so in this case, if you had nine sample assemblies and one co-assembly, you can run these as if you had 10 individual assemblies).

***

## Index

- [1A.1 Introduction](#1A.1-Introduction)
- [1A.2 Prepare assembly files](#1A.2-Prepare-assembly-files)
- 1A.3 Identifying Viral Contigs
  - [1A.3.1 VirSorter2](#1A.3.1-Identifying-viral-contigs:-VirSorter2)
  - [1A.3.2 VIBRANT](#1A.3.2-Identifying-viral-contigs:-VIBRANT)
  - [1A.3.2 DeepVirFinder](#1A.3.3-Identifying-viral-contigs:-DeepVirFinder)
- [1A.4 Summary tables for each assembly](#1A.4-Summary-tables-for-each-assembly)
- [1A.5 Viral contigs: per sample dereplication](#1A.5-Viral-contigs:-per-sample-dereplication)
- [1A.6 Viral contigs: Per sample QC and filtering](#1A.6-Per-sample-quality-assessment-and-additional-filtering)
- [1A.7 Viral contigs: Multiple assembly dereplication](#1A.7-Dereplication-across-samples)
- [1A.8 vOTUs: QC and filtering](#1A.8-vOTUs-assessment-and-additional-filtering)
- [1A.9 Final dereplicated vOTUs](#1A.9-Final-set-of-dereplicated-viral-contigs)
- [1A.10 Additional Resources](#1A.10-Additional-Resources)

***

## 1A.1 Introduction

Viral metagenomics is a rapidly progressing field, and new software are constantly being developed and released each year that aim to better identify and characterise viral genomic sequences from assembled metagenomic sequence reads.

Currently, the most commonly used methods are *VirSorter2*, *VIBRANT*, and *VirFinder* (or the machine learning implementation of this, *DeepVirFinder*). Each tool has strengths and weaknesses. And given this is an evolving field, none are perfect. A number of recent studies use either one of these tools, or a combination of several at once.

**The examples below use a combination of three tools: *VirSorter2*, *VIBRANT*, and/or *DeepVirFinder***

For a simplified method that only uses *VirSorter2*, see the alternative document, **Viromics_WGS_3B_Viral_Identification_Simplified**

##### *VirSorter2*

- *VirSorter2* uses a predicted protein homology reference database-based approach, together with searching for a number of pre-defined metrics based on known viral genomic features. *VirSorter2* has been designed to target dsDNAphage, ssDNA and RNA viruses, and the viral groups *Nucleocytoviricota* and *lavidaviridae*.
- paper: https://peerj.com/articles/985/
- github: https://github.com/jiarong/VirSorter2

##### *VIBRANT*

- *VIBRANT* uses a machine learning approach based on protein similarity (non-reference-based similarity searches with multiple HMM sets), and is in principle applicable to bacterial and archaeal DNA and RNA viruses, integrated proviruses (which are excised from contigs by *VIBRANT*), and eukaryotic viruses.
- paper: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-020-00867-0
- github: https://github.com/AnantharamanLab/VIBRANT

##### *DeepVirFinder*

- *DeepVirFinder* uses a machine learning based approach based on k-mer frequencies. Having developed a database of the differences in k-mer frequencies between prokaryote and viral genomes, *VirFinder* examines assembled contigs and identifies whether their k-mer frequencies are comparable to known viruses in the database, using this to predict viral genomic sequence.
- This method has some limitations based on the viruses that were included when building the database (bacterial DNA viruses, but very few archaeal viruses, and, at least in some versions of the software, no eukaryotic viruses). However, tools are also provided to build your own database should you wish to develop an expanded one. 
- Due to its distinctive k-mer frequency-based approach, *VirFinder* may also have the capability of identifying some novel viruses overlooked by tools such as *VIBRANT* or *VirSorter*. However, it will also likely have many more false-positives, and so requires more careful curation. 
- *DeepVirFinder* also appears to no longer be in development or support, so may be outdated compared to, for example, *VirSorter2*.
- Original VirFinder paper: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-017-0283-5
- github: https://github.com/jessieren/DeepVirFinder

**At the time of writing, if you were to pick one tool, we would recommend *VirSorter2***, as this appears to have the most recent ongoing development work and also ties in nicely with *DRAM-v* (for viral annotation).

NOTE: 

- If you apply more than one tool at this step, dereplication *across the multiple tools* will be necessary first, before proceeding to dereplication *across assemblies* 
- In the example below, the former (across all tools for *each* assembly) is acheived via `summarise_viral_contigs.py` and `virome_per_sample_derep.py`, and the latter (across all assemblies) via `Cluster_genomes_5.1.pl`.


***

## 1A.2 Prepare assembly files


#### Optional: Add assembly ID to contig headers

If you now have data from multiple assemblies, it can be useful to add the assembly ID to contig headers to avoid conflicts downstream (on very rare occasions, contigs from different assemblies can end up with identical contig headers), and to make it easier to spot where contigs of interest originated from after dereplication. 

In [None]:
cd /working/dir
mkdir -p /path/to/wgs/assembly/2.spades_assembly_edit

# Individual sample assemblies
for i in {1..9}; do
    sed "s/>/>S${i}_/g" /path/to/wgs/assembly/1.spades_assembly_S${i}/scaffolds.fasta > /path/to/wgs/assembly/2.spades_assembly_edit/S${i}.assembly.fasta
done 


#### Filter out short contigs

For downstream processing, it can be a good idea to filter out short contigs (for example, those less than 1000 or 2000 bp). *VIBRANT*, for example, recommends removing contigs < 1000 bp, as it then filters based on presence of 4 identified putative genes, rather than contig length. 

If you wish to filter out short contigs, you can do so via `seqmagick`:

In [None]:
# Set up working directories
cd /working/dir
mkdir -p /path/to/wgs/assembly/2.spades_assembly_edit_m1000

# Load seqmagick
module purge
module load seqmagick/0.7.0-gimkl-2017a-Python-3.6.3

## Filter out contigs < 1000 bp using seqmagick
for i in {1..9}; do
    seqmagick convert --min-length 1000 /path/to/wgs/assembly/2.spades_assembly_edit/S${i}.assembly.fasta /path/to/wgs/assembly/2.spades_assembly_edit_m1000/S${i}.assembly.m1000.fasta
done


***

## 1A.3.1 Identifying viral contigs: VirSorter2

In the steps below, we first run VirSorter2 with a min score threshold setting of 0.75. A python script is provided to then filter these results to only retain contigs with a score > 0.9 *or* if they have a viral hallmark gene identified. There are no set rules on how best to set filter thresholds here, but the latter roughly follow the screening thresholds discussed in the VirSorter2 protocols page here: https://www.protocols.io/view/viral-sequence-identification-sop-with-virsorter2-5qpvoyqebg4o/v3

NOTE: 

- **Due to an issue with the current install in NeSI, `module unload XALT` *must* be run before loading the VirSorter module!**
- While *VirSorter2* is available as a NeSI module, the reference databases must be downloaded separately (~10 GB).
- In the current version (2.2.3) `--include_groups ...` must be included with all available groups listed. I believe in later versions an include all option will be added to replace this.
- The last line (`--config  LOCAL_SCRATCH=${TMPDIR:-/tmp}`) is something that's recommended to include when running on NeSI (to do with how it handles some of the temp files for the HMM profiles). 

If you don't already have the databases available, download these via `virsorter setup`

In [None]:
# Set up database directory (you may want to name the database directory with the date downloaded)
cd /path/to/Databases/
mkdir -p virsorter2_database

# Load module
module purge
module unload XALT
module load VirSorter/2.2.3-gimkl-2020a-Python-3.8.2

# Download databases
virsorter setup -d virsorter2_database

Run *VirSorter2* on each assembly file


In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J 3_vsort2
#SBATCH --time 12:00:00
#SBATCH --mem=20GB
#SBATCH --array=1-9
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH -e 3_vsort2_%a.err
#SBATCH -o 3_vsort2_%a.out

# Load module
module purge
module unload XALT
module load VirSorter/2.2.3-gimkl-2020a-Python-3.8.2

# Set up working directories
cd /working/dir
mkdir -p 1.viral_identification/1.virsorter2
 
## run virsorter2
srun virsorter run -j 32 \
-i /path/to/wgs/assembly/2.spades_assembly_edit_m1000/S${SLURM_ARRAY_TASK_ID}.assembly.m1000.fasta \
-d /nesi/project/uoa02469/Databases/virsorter2_20210909/ \
--min-score 0.75 --include-groups dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae \
-w 1.viral_identification/1.virsorter2/S${SLURM_ARRAY_TASK_ID} -l S${SLURM_ARRAY_TASK_ID} \
--rm-tmpdir \
all \
--config LOCAL_SCRATCH=${TMPDIR:-/tmp}


Filter *VirSorter2* results to only retain contigs with a score > 0.9 *or* if they have a viral hallmark gene identified

In [None]:
cd /working/dir

# Load python3
module purge
module load Python/3.8.2-gimkl-2020a
python3
import pandas as pd
import numpy as np

## Filter results (score >= 0.9 OR hallmark > 0)
# Loop through all samples, and output new 'SampleX-final-viral-score_filt_0.9.tsv' file for each.
for number in range(1, 10):
    # Load ...final-viral-score.tsv file
    vsort_score = pd.read_csv('1.viral_identification/1.virsorter2/S'+str(number)+'/S'+str(number)+'-final-viral-score.tsv', sep='\t')
    # Filter by score threshold and/or hallmark gene (e.g. score >= 0.9 OR hallmark > 0)
    vsort_score = vsort_score[np.logical_or.reduce((vsort_score['max_score'] >= 0.9, vsort_score['hallmark'] > 0))]
    # Write out filtered file
    vsort_score.to_csv('1.viral_identification/1.virsorter2/S'+str(number)+'/S'+str(number)+'-final-viral-score_filt_0.9.tsv', sep='\t', index=False)

quit()


***

## 1A.3.2 Identifying viral contigs: VIBRANT

Run *VIBRANT*

In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J 3_vibrant
#SBATCH --time 12:00:00
#SBATCH --mem=20GB
#SBATCH --ntasks=1
#SBATCH --array=1-9
#SBATCH --cpus-per-task=16
#SBATCH -e 3_vibrant_%a.err
#SBATCH -o 3_vibrant_%a.out
#SBATCH --profile=task

# Load dependencies
module purge
module load VIBRANT/1.2.1-gimkl-2020a

# Set up working directories
cd /nesi/nobackup/ga02676/2022_Hotsprings_virome_MH/wgs
mkdir -p 1.viral_identification/1.vibrant

# Run main analyses 
srun VIBRANT_run.py -t 16 \
-i /path/to/wgs/assembly/2.spades_assembly_edit_m1000/S${SLURM_ARRAY_TASK_ID}.assembly.m1000.fasta \
-d $DB_PATH \
-folder 1.viral_identification/1.vibrant/


***

## 1A.3.3 Identifying viral contigs: DeepVirFinder

Note:

- *DeepVirFinder* is not available as a NeSI module, and must be installed separately (modify the *DeepVirFinder* path in the script below
- The *DeepVirFinder* github also recommends calculately FDR values from the *DeepVirFinder* output to then filter by. This can be done via the script `dvfind_add_fdr.R` (using the R package *qvalue*), which is available in `...../scripts/`
  - This script by default also filters to retain only results meeting the following thresholds: score >= 0.9 & pvalue <= 0.05 & p.adj <= 0.1
- *DeepVirFinder* runs the risk of identifying eukaryotic sequences as putative viral sequences (due to kmer profiling model training being done on bacterial (and maybe archeael?) vs. viral sequences in databases). 
  - It is important to include a filter step on *DeepVirFinder* outputs to remove any contigs that match eukaryotc sequences.
  - Below, Eukaryotic contigs are identified via *Kraken2* and then filtered out of the *DeepVirFinder* results
  - The required scripts (*except* `extract_kraken_reads.py`) are available in `...../scripts/`
  - The script `extract_kraken_reads.py` comes from KrakenTools, and is available from https://github.com/jenniferlu717/KrakenTools

If required: Install *DeepVirFinder* and dependencies (python packages theano and keras, and R package qvalue)

In [None]:
## Installing DeepVirFinder
cd /path/to/Software/

# Clone git and add permissions
git clone https://github.com/jessieren/DeepVirFinder
chmod -R 777 DeepVirFinder

# Load python and R
module purge
module load Python/3.8.2-gimkl-2020a
module load R/3.6.2-gimkl-2020a

# Requires additional of python packages theano and keras (installs some into usr/.local/bin, so this also needs to be added to PATH when running DeepVirFinder). 
pip install theano keras

# Set up to use with R package "qvalue" to calculate FDR q-values. Need to install this locally (not installed on R module)
R --vanilla
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("qvalue")
q()


Run *DeepVirFinder*

In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J 3_dvf
#SBATCH --time 08:00:00
#SBATCH --mem=20GB
#SBATCH --array=1-9
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH -e 3_dvf_%a.err
#SBATCH -o 3_dvf_%a.out

# Update this path to your own local bin path (due to additional python packages required (theano and keras)
export PATH="/home/<your_home_directory>/.local/bin:$PATH"
# Variables for DeepVirFinder and required scripts directories 
DeepVirFinder_PATH="/path/to/Software/DeepVirFinder:$PATH"
SCRIPTS_PATH="/path/to/scripts:$PATH"

# Set up working directories
cd /working/dir

# Load modules
module purge
module load Python/3.8.2-gimkl-2020a
module load R/3.6.2-gimkl-2020a

# After an issue with previous DeepVirFinder run, I found a fix online that suggested running this first:
export MKL_THREADING_LAYER=GNU

# run DeepVirFinder
mkdir -p 1.viral_identification/1.deepvirfinder/
srun ${DeepVirFinder_PATH}/dvf.py \
-i /path/to/wgs/assembly/2.spades_assembly_edit_m1000/S${SLURM_ARRAY_TASK_ID}.assembly.m1000.fasta \
-m ${DeepVirFinder_PATH}/models \
-o 1.viral_identification/1.deepvirfinder/ \
-l 1000 \
-c 20

# Calculate fdr q values
srun ${SCRIPTS_PATH}/dvfind_add_fdr.R "1.viral_identification/1.deepvirfinder/S${SLURM_ARRAY_TASK_ID}.assembly.m1000.fasta_gt1000bp_dvfpred.txt"


Filter *DeepVirFinder* results to remove eukaryotic contigs

Method: 

1. Extract sequences for contigs identified as putatively 'viral' by DeepVirFinder (`dvfpred_extract_fasta.py`)
1. Assign taxonomy via Kraken2
1. Extract sequences matching Eukaryota (`extract_kraken_reads.py`, from KrakenTools)
1. Filter Eukaryota sequences out of DeepVirFinder results (`dvfpred_filter_euk.py`)


In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J 3_dvf_filter
#SBATCH --time 00:10:00
#SBATCH --mem=180GB
#SBATCH --array=1-9
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH -e 3_dvf_filter_%a.err
#SBATCH -o 3_dvf_filter_%a.out

# Update this path to your own local bin path
export PATH="/home/<your_home_directory>/.local/bin:$PATH"
# Add required custom scripts to PATH
export PATH="/path/to/scripts:$PATH"

# Set up working directory
cd /working/dir

# Load modules
module purge
module load Python/3.8.2-gimkl-2020a
module load R/3.6.2-gimkl-2020a
module load Kraken2/2.1.2-GCC-9.2.0

## Run filtering steps
# Run dvfpred_extract_fasta.py
dvfpred_extract_fasta.py \
    --deepvirfinder_results 1.viral_identification/1.deepvirfinder/S${SLURM_ARRAY_TASK_ID}.assembly.m1000.fasta_gt1000bp_dvfpred.txt \
    --assembly_fasta /path/to/wgs/assembly/2.spades_assembly_edit_m1000/S${SLURM_ARRAY_TASK_ID}.assembly.m1000.fasta \
    --output 1.viral_identification/1.deepvirfinder/S${SLURM_ARRAY_TASK_ID}_dvfpred.fna \
# Run Kraken2 taxonomy classification of dvfpred results
mkdir -p 1.viral_identification/1.deepvirfinder/dvfpred_kraken
srun kraken2 \
    --threads 20 \
    --db nt \
    --use-names \
    --report 1.viral_identification/1.deepvirfinder/dvfpred_kraken/S${SLURM_ARRAY_TASK_ID}.kraken_report.txt \
    --output 1.viral_identification/1.deepvirfinder/dvfpred_kraken/S${SLURM_ARRAY_TASK_ID}.kraken_output.txt \
    1.viral_identification/1.deepvirfinder/S${SLURM_ARRAY_TASK_ID}_dvfpred.fna
# Run extract_kraken_reads.py (n.b. -t 2759 = Eukaryota)
extract_kraken_reads.py \
    -k 1.viral_identification/1.deepvirfinder/dvfpred_kraken/S${SLURM_ARRAY_TASK_ID}.kraken_output.txt \
    -r 1.viral_identification/1.deepvirfinder/dvfpred_kraken/S${SLURM_ARRAY_TASK_ID}.kraken_report.txt \
    -s 1.viral_identification/1.deepvirfinder/S${SLURM_ARRAY_TASK_ID}_dvfpred.fna \
    -t 2759 --include-children \
    -o 1.viral_identification/1.deepvirfinder/dvfpred_kraken/S${SLURM_ARRAY_TASK_ID}.kraken_Euk.fna
# Run dvfpred_filter_euk.py
dvfpred_filter_euk.py \
    --deepvirfinder_results 1.viral_identification/1.deepvirfinder/S${SLURM_ARRAY_TASK_ID}.assembly.m1000.fasta_gt1000bp_dvfpred.txt \
    --Euk_fasta 1.viral_identification/1.deepvirfinder/dvfpred_kraken/S${SLURM_ARRAY_TASK_ID}.kraken_Euk.fna \
    --output 1.viral_identification/1.deepvirfinder/S${SLURM_ARRAY_TASK_ID}.dvfpred_filtered.txt



Note:

In the event that samples fail these filtering steps, this may be due to no "viral" contigs being identified by *DeepVirFinder* for this assembly file. 

In these cases, to enable downstream steps to run smoothly, simply copy (the empty file) `S*.assembly.m1000.fasta_gt1000bp_dvfpred.txt` to `S*.dvfpred_filtered.txt` (replace the `*` with the appropirate sample number)

For example, assuming samples 3 and 7 failed:

In [None]:
cd /working/fir

for i in 3 7; do
    cp 1.viral_identification/1.deepvirfinder/S${i}.assembly.m1000.fasta_gt1000bp_dvfpred.txt 1.viral_identification/1.deepvirfinder/S${i}.dvfpred_filtered.txt
done

***

## 1A.4 Summary tables for each assembly

Generate summary tables all contigs putatively identified as 'viral' (or containing viral sequence) by each of the tools for each assembly

The script `summarise_viral_contigs.py` is available in `.../scripts/`

- This takes the files output from *VirSorter2*, *VIBRANT*, and/or *DeepVirFinder* and generates a summary table based on contig IDs.
- Not all inputs are required (e.g. if you excluded running *DeepVirFinder*, this can be omitted here)

In [None]:
# Working dir
cd /working/dir/1.viral_identification/
mkdir -p 2.summary_tables

# modules
module purge
module load Python/3.9.9-gimkl-2020a

# Run script
for i in {1..9}; do
    /path/to/scripts/summarise_viral_contigs.py \
    --vibrant 1.vibrant/VIBRANT_S${i}.m1000/VIBRANT_results_S${i}.m1000/VIBRANT_summary_results_S${i}.m1000.tsv \
    --virsorter2 1.virsorter2/S${i}/S${i}-final-viral-score_filt_0.9.tsv \
    --deepvirfinder 1.deepvirfinder/S${i}.dvfpred_filtered.txt \
    --out_prefix 2.summary_tables/S${i}.viral_contigs.summary_table
done


***

## 1A.5 Viral contigs: per-sample dereplication

For *each* assembly data set, dereplicate *across the separate tools used* to identify putative viral contigs, and generate new fasta files of 'dereplicated' identified viral contigs. (I.e. creating fasta files of combined results from all three methods (based on the summary table generated above).

NOTE:

- Both *VirSorter2* and *VIBRANT* can return prophage sequences that have been excised out of the original assembly contigs (which may also contain contaminating host sequence on either end of the prophage sequence). 
- With how `virome_per_sample_derep.py` is currently written, all prophage genomic regions excised from contigs (i.e. after trimming off host regions) by either *VIBRANT* or *VirSorter2* are all still retained (*VIBRANT* IDs for excised prophage have 'fragment' in the contig header, *VirSorter2* IDs include 'partial'). If both tools excise a prophage from the same contig, both will be retained here (i.e. there will likely be a number of duplicates of prophage sequences retained). But this is ok, as they will be dereplicated downstream in the `Cluster_genomes_5.1.pl` step.
- In cases where any tool (*VirSorter2* or *VIBRANT*) has excised a prophage from a contig, this script also ensures that the original full contig is *not* retained, even if it as been identified as 'viral' by one of the other tools (e.g. *DeepVirFinder*). This is to ensure that the excised prophage is not simply re-integrated into the full contig (with contaminating host sequence on either end) and lost during the downstream `Cluster_genomes_5.1.pl` dereplication step.

In [None]:
# Working dir
cd /working/dir
mkdir -p 1.viral_identification/3.perSample_derep

# modules
module purge
module load Python/3.9.9-gimkl-2020a

# Run script 
for i in {1..9}; do
    /path/to/scripts/virome_per_sample_derep.py \
    --assembly_fasta /path/to/wgs/assembly/2.spades_assembly_edit_m1000/S${i}.assembly.m1000.fasta \
    --summary_table 1.viral_identification/2.summary_tables/S${i}.viral_contigs.summary_table_VIRUSES.txt \
    --vibrant 1.viral_identification/1.vibrant/VIBRANT_S${i}.assembly.m1000/VIBRANT_phages_S${i}.assembly.m1000/S${i}.assembly.m1000.phages_combined.fna \
    --virsorter2 1.viral_identification/1.virsorter2/S${i}/S${i}-final-viral-combined.fa \
    --output 1.viral_identification/3.perSample_derep/S${i}.viral_contigs.fna
done


#### Filter dereplicated contigs to remove those < 3000 bp

Confidence in the viral calls of each of the tools generally increases with contig length. As such, various studies have only retained contigs greater than a set threshold (e.g. 3,000 bp, 5,000 bp, or 10,000 bp). This also can assist with reducing the dataset to a manageable size in large complex metagenome data sets.

The example below filters to only retain contigs > 5000 bp

In [None]:
# Load dependencies
module purge
module load seqmagick/0.7.0-gimkl-2017a-Python-3.6.3

# Working directory
cd /working/dir

for i in {1..9}; do
    seqmagick convert --min-length 5000 \
    1.viral_identification/3.perSample_derep/S${i}.viral_contigs.fna \
    1.viral_identification/3.perSample_derep/S${i}.viral_contigs.filt.fna
done


***

## 1A.6 Per-sample quality assessment and additional filtering

#### CheckV perSample: All samples

*checkV* is a tool that has been developed as an analogue to *checkM*. *checkV* provides various statisics about the putative viral contigs data set, including length, gene count, viral and host gene counts, and estimated completeness and contamination. 

We can run *checkV* on the ouput from dereplication of contigs identified by the three methods, and use the results from *checkV* as an additional filtering step prior to our final dereplication.

Run *checkV*:

In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J 3_checkv_perSample
#SBATCH --time 00:20:00
#SBATCH --mem=4GB
#SBATCH --ntasks=1
#SBATCH --array=1-9
#SBATCH --cpus-per-task=16
#SBATCH -e 3_checkv_perSample_%a.err
#SBATCH -o 3_checkv_perSample_%a.out
#SBATCH --profile=task

# Load dependencies
module purge
module load CheckV/0.7.0-gimkl-2020a-Python-3.8.2

# Set up working directories
cd /working/dir
mkdir -p 1.viral_identification/4.perSample_checkv

# Run main analyses 
checkv_in="1.viral_identification/3.perSample_derep/S${SLURM_ARRAY_TASK_ID}.viral_contigs.filt.fna"
checkv_out="1.viral_identification/4.perSample_checkv/S${SLURM_ARRAY_TASK_ID}"
srun checkv end_to_end ${checkv_in} ${checkv_out} -t 16 --quiet


#### Concatenate output fasta files (viruses.fna and proviruses.fna) for downstream use

- Note: this script also modifies contig headers for readability in any cases where *checkV* has trimmed any residual host sequence off the end of integrated prophage sequence (this is separate and additional to previous prophage excision by *VIBRANT* or *VirSorter2*).

In [None]:
cd /working/dir

for i in {1..9}; do
    # concatenate viruses and prophage files
    cat 1.viral_identification/4.perSample_checkv/S${i}/viruses.fna 1.viral_identification/4.perSample_checkv/S${i}/proviruses.fna > 1.viral_identification/4.perSample_checkv/S${i}/viral_contigs.fna 
    # modify checkv prophage contig headers
    sed -i -e "s/\s/__excised_start_/g" -e "s/-/_end_/g" -e "s/\//_len_/g" -e "s/|/_/" -e "s/|//g" 1.viral_identification/4.perSample_checkv/S${i}/viral_contigs.fna
done


#### Add checkv results to summary table

This may ultimately be put into a script for ease of use. But for now we can use the python code below.

In [None]:
cd /working/dir/1.viral_identification

# LOAD PYTHON
module purge
module load Python/3.8.2-gimkl-2020a
python3
import pandas as pd
import numpy as np

# Loop through all sample summary_tables, add key checkv results columns and write out.
for i in range(1, 10):
    # Import summary table (manually set dtypes)
    summary_table = pd.read_csv('2.summary_tables/S'+str(i)+'.viral_contigs.summary_table_VIRUSES.txt', sep='\t', \
                                dtype={'contig_ID': str, 'vibrant_contig_ID': str, 'vibrant_total_genes': float, \
                                       'vibrant_KEGG_genes': float, 'vibrant_KEGG_v_score': float, 'vibrant_Pfam_genes': float, \
                                       'vibrant_Pfam_v_score': float, 'vibrant_VOG_genes': float, 'vibrant_VOG_v_score': float, \
                                       'vsort2_contig_ID': str, 'vsort2_max_score': float, 'vsort2_max_score_group': str, \
                                       'vsort2_hallmark_genes': float, 'vsort2_viral_component': float, 'vsort2_cellular_component': float, \
                                       'dvfind_contig_ID': str, 'dvfind_score': float, 'dvfind_p_adj': float})        
    # Import checkv summary results
    checkv = pd.read_csv('4.perSample_checkv/S'+str(i)+'/quality_summary.tsv', sep='\t', \
                        dtype={'contig_id': str, 'contig_length': float, 'provirus': str, 'proviral_length': float, \
                               'gene_count': float, 'viral_genes': float, 'host_genes': float, 'checkv_quality': str, \
                               'miuvig_quality': str, 'completeness': float, 'completeness_method': str, \
                               'contamination': float, 'kmer_freq': float, 'warnings': str})
    checkv = checkv.add_prefix('checkv_')
    checkv['contig_ID'] = checkv['checkv_contig_id']
    # Merge with summary_table
    summary_table = pd.merge(summary_table, checkv, left_on="contig_ID", right_on="contig_ID", how='outer')
    # Output summary table
    summary_table.to_csv('2.summary_tables/S'+str(i)+'.viral_contigs.summary_table_VIRUSES_checkv.txt', sep='\t', index=False)

quit()


#### Filter putative 'viral' contigs by checkv results

The script `checkv_filter_contigs.py` (available in `../scipts/`) further filters the sets of viral contigs based on *checkV* results. By default, this retains only contigs where: ((viral_genes>0) OR (viral_genes=0 AND host_genes=0). This script takes the *checkV* outputs (including the proviruses and viruses fna files, and quality summary), and returns fna and quality summary files with 'filtered' appended to the file name.


In [None]:
# Load dependencies
module purge
module load Python/3.8.2-gimkl-2020a

# Working directory
cd /working/dir

for i in {1..9}; do
    /path/to/scripts/checkv_filter_contigs.py \
    --checkv_dir_input 1.viral_identification/4.perSample_checkv/S${i}/ \
    --output_prefix 1.viral_identification/4.perSample_checkv/S${i}/viral_contigs
done


***

## 1A.7 Dereplication across samples

Contigs identified so far now need to be dereplicated into a single final set of viral contigs. This final set of representative (clustered) contigs can be referred to as 'viral operational taxonomic units' (vOTUs), representing distinct viral 'populations'. 

Here we can use the `Cluster_genomes_5.1.pl` script developed by Simon Roux's group: https://github.com/simroux/ClusterGenomes

This script clusters contigs based on sequence similarity thresholds, returning a representative (vOTU) sequence for each cluster. The following paper recommends a threshold of 95% similarity over 85% of the sequence length, based on currently available data: https://doi.org/10.1038/nbt.4306

In the case where multiple assemblies have been analysed, this step is necessary to reduce the viral data down to one representative set for all sample assemblies (which is important for read mapping to assess differntial coverage across assemblies, for example). Where only one assembly data set has been processed, this step is still useful to reduce the data down into meaningful units for downstream analyses (i.e. viral 'populations' rather than unique sequences).

Note: 

- Download the `Cluster_genomes_5.1.pl` script from https://github.com/simroux/ClusterGenomes
- *mummer* is also required. Download the latest version and add the path to the bin directory in the `Cluster_genomes_5.1.pl` script below.

#### If required: Install *Cluster_genomes_5.1.pl* and *mummer*

In [None]:
# Install Cluster_genomes_5.1.pl
mkdir -p /path/to/Software/Cluster_genomes
cd -p /path/to/Software/Cluster_genomes
wget https://raw.githubusercontent.com/simroux/ClusterGenomes/master/Cluster_genomes_5.1.pl
chmod 777 Cluster_genomes_5.1.pl

# Install mummer4
mkdir -p /path/to/Software/mummer_v4.0.0/
cd /path/to/Software/mummer_v4.0.0
wget https://github.com/mummer4/mummer/releases/download/v4.0.0rc1/mummer-4.0.0rc1.tar.gz
tar -xzf mummer-4.0.0rc1.tar.gz
cd mummer-4.0.0rc1/
./configure --prefix=/path/to/Software/mummer_v4.0.0
make
make install


#### File prep: Concatenate multiple sample fasta files together for `cluster_genomes.pl`

In [None]:
cd /working/dir
mkdir -p 1.viral_identification/5.cluster_vOTUs/

> 5.cluster_genomes/viral_contigs_allSamples.fna
for i in {1..9}; do
    cat 1.viral_identification/4.perSample_checkv/S${i}/viral_contigs_filtered.fna >> 1.viral_identification/5.cluster_vOTUs/viral_contigs_allSamples.fna
done

# Sort by sequence size
module purge
module load BBMap/38.95-gimkl-2020a 
sortbyname.sh in=1.viral_identification/5.cluster_vOTUs/viral_contigs_allSamples.fna out=1.viral_identification/5.cluster_vOTUs/viral_contigs_allSamples.sorted.fna length descending


#### Run cluster_genomes.pl

Run cluster_genomes.pl at min identity = 95% similarity over at least 85% of the shortest contig


In [None]:
# Load dependencies
module purge

# Set up working directories
cd /working/dir/1.viral_identification/5.cluster_vOTUs

# Run
/path/to/scripts/Cluster_genomes_5.1.pl \
-f viral_contigs_allSamples.sorted.fna \
-d /path/to/Software/mummer_v4.0.0/bin/ \
-t 20 \
-c 85 \
-i 95


#### Check total number of clustered contigs (vOTUs)

In [None]:
cd /working/dir

# count contigs in cluster output file
grep -c ">" 1.viral_identification/5.cluster_vOTUs/viral_contigs_allSamples.sorted_95-85.fna


#### Optional: Modify derep contig headers to be *vOTU_n*

- It can be useful for downstream processing to standardise the contig headers of the cluster representative sequences; for example, to replace all headers with *vOTU_n*. 
- The script below replaces all headers with *vOTU_n* and create a table file of *vOTU_n* ids against the original full contig headers (of the *representative* sequences from each cluster).
  - Note: Cluster_genomes_5.1.pl also outputs a file matching cluster representative sequences to each of the sequences that are contained in the cluster.
- *Optional*: you may also wish to omit this step here and instead run it after having calculated differential coverage across sample assmeblies (via read mapping), to enable first ordering the contigs by abundance (coverage), and *then* generating *vOTU_n* headers
- This may ultimately be put into a script for ease of use. But for now we can use the python code below

In [None]:
cd /working/dir/1.viral_identification/5.cluster_vOTUs

# LOAD PYTHON
module purge
module load Python/3.8.2-gimkl-2020a
python3
import os
import pandas as pd
import numpy as np
import re
from Bio.SeqIO.FastaIO import SimpleFastaParser

fasta_in = 'viral_contigs_allSamples.sorted_95-85.fna'
fasta_out = 'vOTUs.fna'
lookup_table_out = 'vOTUs_lookupTable.txt'

# Read in fasta file, looping through each contig
# rename contig headers with incrementing vOTU_n headers
# write out new vOTUs.fna file and tab-delimited table file of matching vOTU_n and contigID headers.
i=1
with open(fasta_in, 'r') as read_fasta:
    with open(fasta_out, 'w') as write_fasta:
        with open (lookup_table_out, 'w') as write_table:
            write_table.write("vOTU" + "\t" + "cluster_rep_contigID" + "\n")
            for name, seq in SimpleFastaParser(read_fasta):
                write_table.write("vOTU_" + str(i) + "\t" + name + "\n")
                write_fasta.write(">" + "vOTU_" + str(i) + "\n" + str(seq) + "\n")
                i += 1

quit()


***


## 1A.8 vOTUs assessment and additional filtering

### CheckV on vOTUs

Re-run *CheckV*, this time on the dereplicated contig set (vOTUs) to output *checkV* stats

In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J 3_checkv_vOTUs
#SBATCH --time 01:00:00
#SBATCH --mem=4GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH -e 3_checkv_vOTUs.err
#SBATCH -o 3_checkv_vOTUs.out
#SBATCH --profile=task

# Load dependencies
module purge
module load CheckV/0.7.0-gimkl-2020a-Python-3.8.2

# Set up working directories
cd /working/dir/1.viral_identification
mkdir -p 6.checkv_vOTUs

# Run main analyses 
checkv_in="5.cluster_vOTUs/vOTUs.fna"
checkv_out="6.checkv_vOTUs"
srun checkv end_to_end ${checkv_in} ${checkv_out} -t 16 --quiet


#### CheckV: Concatenate output fasta files (viruses.fna and proviruses.fna)

In [None]:
cd /working/dir/1.viral_identification

# concatenate viruses and prophage files
cat 6.checkv_vOTUs/viruses.fna  6.checkv_vOTUs/proviruses.fna >  6.checkv_vOTUs/vOTUs.checkv.fna 
# modify checkv prophage contig headers
sed -i -e "s/\s/__excised_start_/g" -e "s/-/_end_/g" -e "s/\//_len_/g" -e "s/|/_/" -e "s/|//g" 6.checkv_vOTUs/vOTUs.checkv.fna


#### Filter vOTUs based on *checkV* results

**NOTE**:

- As we have run `checkv_filter_contigs.py` on all individual contigs (in the per-sample QC step previously), this may be redundant here? 
- In which case, simply proceed with the concatenated file above, rather than the `..._filtered...` files output by `checkv_filter_contigs.py`


In [None]:
# Load dependencies
module purge
module load Python/3.8.2-gimkl-2020a
export PATH="/nesi/project/uoa02469/custom-scripts/MikeH/:$PATH"

# Set up working directories
cd /working/dir/1.viral_identification

# Run for vOTUs
/path/to/scripts/checkv_filter_contigs.py \
    --checkv_dir_input 6.checkv_vOTUs/ \
    --output_prefix 6.checkv_vOTUs/vOTUs.checkv


***

## 1A.9 Final set of dereplicated viral contigs

At this stage we have a final set of dereplicated viral contigs for all downstream analyses.

Key files include:

- Final dereplicated viral contig data set: `/working/dir/1.viral_identification/6.checkv_vOTUs/vOTUs.checkv_filtered.fna` 
  - (Or `vOTUs.checkv.fna`, if the final `checkv_filter_contigs.py` step was not run)
- *checkV* stats for dereplicated viral contigs: `/working/dir/1.viral_identification/6.checkv_vOTUs/vOTUs.checkv_filtered_quality_summary.txt`

***

## 1A.10 Additional Resources

#### Notes on manual curation of vOTUs and general resources

Some very helpful notes on manual curation are available at the *VirSorter2* protocols page here: https://doi.org/10.17504/protocols.io.bwm5pc86

Further valuable reading on Minimum Information about an Uncultivated Virus Genome (MIUViG): https://doi.org/10.1038/nbt.4306

A great resource on standards in viromics: https://doi.org/10.7717/peerj.11447

***