# 3.1 Viruses - DNA virus identification

## Software and versions used in this study

- seqmagick v0.7.0
- VirSorter2 v2.2.3
- VIBRANT v1.2.1
- DeepVirFinder v1.0
- Kraken2 v2.0.9
- extract_kraken_reads.py (from KrakenTools)

## Additional custom scripts

Note: custom scripts have been tested in python v3.11.6 and R v4.2.1 and may not be stable in other versions.

- scripts/viruses.identification/dvfind_add_fdr.R
- scripts/viruses.identification/dvfpred_extract_fasta.py
- scripts/viruses.identification/dvfpred_filter_euk.py

*Required python packages: argparse, pandas, numpy, os, Bio.SeqIO.FastaIO*

*Required R libraries: dplyr, readr, qvalue*

***

## Virus identification

Identify putative DNA viruses in metagenome assemblies via VirSorter2, VIBRANT, and DeepVirFinder

#### Data prep: filter out short contigs

In [None]:
mkdir -p DNA/1.assembly.m1000
for i in {1..9}; do
    seqmagick convert --min-length 1000 DNA/1.assembly/S${i}.spades/assembly.fasta DNA/1.assembly.m1000/S${i}.assembly.m1000.fasta
done

#### VirSorter2

Identify virus contigs via VirSorter2

In [None]:
for i in {1..9}; do
    virsorter run -j 32 \
    -i DNA/1.assembly.m1000/S${i}].assembly.m1000.fasta \
    -d virsorter2_database/ \
    --min-score 0.75 --include-groups dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae \
    -w DNA/3.viruses/1.identification/1.virsorter2/S${i} -l S${i} \
    --rm-tmpdir \
    all \
    --config LOCAL_SCRATCH=${TMPDIR:-/tmp}
done

Filter results to only retain contigs with a score > 0.9 or if they have a viral hallmark gene identified via python

In [None]:
python3
import pandas as pd
import numpy as np

for i in range(1,10):
    vsort_score = pd.read_csv('DNA/3.viruses/1.identification/1.virsorter2/S'+str(i)+'/S'+str(i)+'-final-viral-score.tsv', sep='\t')
    vsort_score = vsort_score[np.logical_or.reduce((vsort_score['max_score'] >= 0.9, vsort_score['hallmark'] > 0))]
    vsort_score.to_csv('DNA/3.viruses/1.identification/1.virsorter2/S'+str(i)+'/S'+str(i)+'-final-viral-score_filt_0.9.tsv', sep='\t', index=False)

quit()

#### VIBRANT

Identify virus contigs via VIBRANT

In [None]:
for i in {1..9}; do
    VIBRANT_run.py -t 16 \
    -i DNA/1.assembly.m1000/S${i}.assembly.m1000.fasta \
    -d $DB_PATH \
    -folder DNA/3.viruses/1.identification/1.vibrant/
done

#### DeepVirFinder

Identify virus contigs via DeepVirFinder

In [None]:
for i in {1..9}; do
    ${DeepVirFinder_PATH}/dvf.py \
    -i DNA/1.assembly.m1000/S${i}.assembly.m1000.fasta \
    -m ${DeepVirFinder_PATH}/models \
    -o DNA/3.viruses/1.identification/1.deepvirfinder/ \
    -l 1000 \
    -c 20
    # Calculate fdr q values and filter by (score >= 0.9) & (pvalue <=0.05) & (FDR.p.adj <= 0.1)
    scripts/viruses.identification/dvfind_add_fdr.R "DNA/3.viruses/1.identification/1.deepvirfinder/S${i}.assembly.m1000.fasta_gt1000bp_dvfpred.txt"
done

Filter DeepVirFinder results to remove eukaryotic contigs

In [None]:
for i in {1..9}; do
    # Extract sequences for contigs identified as putatively 'viral' by DeepVirFinder
    scripts/viruses.identification/dvfpred_extract_fasta.py \
    --deepvirfinder_results DNA/3.viruses/1.identification/1.deepvirfinder/S${i}.assembly.m1000.fasta_gt1000bp_dvfpred.txt \
    --assembly_fasta DNA/1.assembly.m1000/S${i}.assembly.m1000.fasta \
    --output DNA/3.viruses/1.identification/1.deepvirfinder/S${i}_dvfpred.fna \
    # Assign taxonomy via Kraken2
    kraken2 --threads 20 --db nt --use-names \
    --report DNA/3.viruses/1.identification/1.deepvirfinder/dvfpred_kraken/S${i}.kraken_report.txt \
    --output DNA/3.viruses/1.identification/1.deepvirfinder/dvfpred_kraken/S${i}.kraken_output.txt \
    DNA/3.1.viruses.identification/1.deepvirfinder/S${i}_dvfpred.fna
    # Extract sequences matching Eukaryota (extract_kraken_reads.py, from KrakenTools) (n.b. -t 2759 = Eukaryota)
    extract_kraken_reads.py \
    -k DNA/3.viruses/1.identification/1.deepvirfinder/dvfpred_kraken/S${i}.kraken_output.txt \
    -r DNA/3.viruses/1.identification/1.deepvirfinder/dvfpred_kraken/S${i}.kraken_report.txt \
    -s DNA/3.viruses/1.identification/1.deepvirfinder/S${i}_dvfpred.fna \
    -t 2759 --include-children \
    -o DNA/3.viruses/1.identification/1.deepvirfinder/dvfpred_kraken/S${i}.kraken_Euk.fna
    # Filter Eukaryota sequences out of DeepVirFinder results (dvfpred_filter_euk.py)
    scripts/viruses.identification/dvfpred_filter_euk.py \
    --deepvirfinder_results DNA/3.viruses/1.identification/1.deepvirfinder/S${i}.assembly.m1000.fasta_gt1000bp_dvfpred.txt \
    --Euk_fasta DNA/3.viruses/1.identification/1.deepvirfinder/dvfpred_kraken/S${i}.kraken_Euk.fna \
    --output DNA/3.viruses/1.identification/1.deepvirfinder/S${i}.dvfpred_filtered.txt
done

***