# 3.2 Viruses - data processing and vOTU clustering

## Software and versions used in this study

- seqmagick v0.7.0
- CheckV v0.7.0
- BBMap v39.01
- Cluster_genomes_5.1.pl

## Additional custom scripts

Note: custom scripts have been tested in python v3.11.6 and R v4.2.1 and may not be stable in other versions.

- scripts/viruses.identification/summarise_viral_contigs.py
- scripts/viruses.identification/virome_per_sample_derep.py
- scripts/viruses.identification/checkv_filter_contigs.py
- scripts/viruses.identification/filter_viruses_by_completeness.py

*Required python packages: argparse, pandas, numpy, os, re, Bio.SeqIO.FastaIO*

***

## Dereplication, quality filtering, and vOTU clustering

#### Generate summary table

In [None]:
for i in {1..9}; do
    scripts/viruses.identification/summarise_viral_contigs.py \
    --virsorter2 DNA/3.viruses/1.identification/1.virsorter2/S${i}/S${i}-final-viral-score_filt_0.9.tsv \
    --vibrant DNA/3.viruses/1.identification/1.vibrant/VIBRANT_S${i}.m1000/VIBRANT_results_S${i}.m1000/VIBRANT_summary_results_S${i}.m1000.tsv \
    --deepvirfinder DNA/3.viruses/1.identification/1.deepvirfinder/S${i}.dvfpred_filtered.txt \
    --out_prefix DNA/3.viruses/1.identification/2.summary_tables/S${i}.viral_contigs.summary_table
done

#### Dereplication per assembly



Dereplicate virus contigs identified by multiple tools

In [None]:
for i in {1..9}; do
    scripts/viruses.identification/virome_per_sample_derep.py \
    --assembly_fasta DNA/1.assembly.m1000/S${i}.assembly.m1000.fasta \
    --summary_table DNA/3.viruses/1.identification/2.summary_tables/S${i}.viral_contigs.summary_table_VIRUSES.txt \
    --vibrant DNA/3.viruses/1.identification/1.vibrant/VIBRANT_S${i}.assembly.m1000/VIBRANT_phages_S${i}.assembly.m1000/S${i}.assembly.m1000.phages_combined.fna \
    --virsorter2 DNA/3.viruses/1.identification/1.virsorter2/S${i}/S${i}-final-viral-combined.fa \
    --output DNA/3.viruses/2.perSample_derep/S${i}.viral_contigs.fna
done

#### Filter contigs < 3000 bp

In [None]:
for i in {1..9}; do
    seqmagick convert --min-length 3000 \
    DNA/3.viruses/2.perSample_derep/S${i}.viral_contigs.fna \
    DNA/3.viruses/2.perSample_derep/S${i}.viral_contigs.filt.fna
done

#### Quality assessment via CheckV

In [None]:
for i in {1..9}; do
    checkv end_to_end DNA/3.viruses/2.perSample_derep/S${i}.viral_contigs.filt.fna DNA/3.viruses/3.perSample_checkv/S${i} -t 16 --quiet
    # Concatenate output fasta files
    cat DNA/3.viruses/3.perSample_checkv/S${i}/viruses.fna DNA/3.viruses/3.perSample_checkv/S${i}/proviruses.fna > DNA/3.viruses/3.perSample_checkv/S${i}/viral_contigs.fna 
done

#### Filter virus contigs based on CheckV results

Filtering to retain contigs that meet the following criteria: ((viral_genes>0) OR (viral_genes=0 AND host_genes=0))

In [None]:
for i in {1..9}; do
    scripts/viruses.identification/checkv_filter_contigs.py \
    --checkv_dir_input DNA/3.viruses/3.perSample_checkv/S${i}/ \
    --output_prefix DNA/3.viruses/3.perSample_checkv/S${i}/viral_contigs
done

#### Cluster vOTUs

Concatenate virus contigs from individual assemblies

In [None]:
cat DNA/3.viruses/3.perSample_checkv/S*/viral_contigs_filtered.fna > DNA/3.viruses/4.cluster_vOTUs/viral_contigs_allSamples.fna

Sort by sequence size (sortbyname.sh from BBMap tools)

In [None]:
sortbyname.sh in=DNA/3.viruses/4.cluster_vOTUs/viral_contigs_allSamples.fna \
out=DNA/3.viruses/4.cluster_vOTUs/viral_contigs_allSamples.sorted.fna length descending

Dereplicate virus contigs across assemblies into virus populations (viral operational taxonomic units; vOTUs). 

*Note: The developers of Cluster_genomes_5.1.pl have since recommended using checkv's anicalc and aniclust for this step instead.*

In [None]:
Cluster_genomes_5.1.pl -t 20 -c 85 -i 95 \
-f DNA/3.viruses/4.cluster_vOTUs/viral_contigs_allSamples.sorted.fna \
-d /path/to/Software/mummer_v4.0.0/bin/

Modify derep contig headers to be vOTU_n via python

In [None]:
python3
import os
import pandas as pd
import numpy as np
import re
from Bio.SeqIO.FastaIO import SimpleFastaParser

i=1
with open('DNA/3.viruses/4.cluster_vOTUs/viral_contigs_allSamples.sorted_95-85.fna', 'r') as read_fasta:
    with open('DNA/3.viruses/4.cluster_vOTUs/vOTUs.fna', 'w') as write_fasta:
        with open ('DNA/3.viruses/4.cluster_vOTUs/vOTUs_lookupTable.txt', 'w') as write_table:
            write_table.write("vOTU" + "\t" + "cluster_rep_contigID" + "\n")
            for name, seq in SimpleFastaParser(read_fasta):
                write_table.write("vOTU_" + str(i) + "\t" + name + "\n")
                write_fasta.write(">" + "vOTU_" + str(i) + "\n" + str(seq) + "\n")
                i += 1

quit()

#### Run CheckV on clustered vOTUs

In [None]:
checkv end_to_end DNA/3.viruses/4.cluster_vOTUs/vOTUs.fna DNA/3.viruses/5.checkv_vOTUs -t 16 --quiet
# Concatenate output fasta files
cat DNA/3.viruses/5.checkv_vOTUs/viruses.fna  DNA/3.viruses/5.checkv_vOTUs/proviruses.fna >  DNA/3.viruses/5.checkv_vOTUs/vOTUs.fna 
# modify checkv-exicised prophage contig headers
sed -i -e "s/\s/__excised_start_/g" -e "s/-/_end_/g" -e "s/\//_len_/g" -e "s/|/_/" -e "s/|//g" DNA/3.viruses/5.checkv_vOTUs/vOTUs.fna

### Filter by predicted completeness threshold (50%)

In [None]:
scripts/viruses.identification/filter_viruses_by_completeness.py \
-i DNA/3.viruses/5.checkv_vOTUs/vOTUs.fna \
-c DNA/3.viruses/5.checkv_vOTUs/quality_summary.tsv \
-t 50 \
-o DNA/3.viruses/5.checkv_vOTUs/vOTUs.completeness_50.fna

***