# 3.5 Inferred *Caudoviricetes* phylogeny

## Software and versions used in this study

- CheckV v0.7.0
- prodigal-gv v2.9.0
- HMMER v3.3.2
- Clustal-Omega v1.2.4
- IQ-TREE v2.2.2.2

## Additional custom scripts

Note: custom scripts have been tested in python v3.11.6 and R v4.2.1 and may not be stable in other versions.

- scripts/viruses.caudo_phylogeny/filter_refseq_by_taxonomy.py
- scripts/viruses.caudo_phylogeny/identify_core_genes.py
- scripts/viruses.caudo_phylogeny/collate_core_genes.py
- scripts/viruses.caudo_phylogeny/concatenate_protein_alignments.py

*Required python packages: argparse, pandas, numpy, re, glob, os, Bio*

## Databases used

- viralRefSeq v223
- VOGdb v222

***

## *Caudoviricetes* "core genes" phylogeny

Workflow for inference of Caudoviricetes phylogeny via concatenated protein alignments of putative single copy core genes.

For this study, trees of inferred *Caudoviricetes* phylogeny were generated via concatenated protein alignment of putatuve "core genes" based on the method described in Low *et al.*, 2019 (doi: 10.1038/s41564-019-0448-z).

Notes: 

- Previous analyses (Low *et al.* 2019) used 2017 version of VOG and the identified IDs are no longer compatible with the latest version
- in brief, the workflow below: 
  - re-identifies putative single copy core genes in *Caudoviricetes* viruses broadly following the method of Low *et al.* (2019), using latest viralRefSeq references and the latest VOG database
  - extracts the identified putative core genes from viralRefSeq references, Waiwera vOTUs, and IMG/VR vOTUs
  - generates filtered concatenated core gene protein alignments for all included *Caudoviricetes* viruses
  - builds trees to infer phlyogeny and for visualisation in iTol

Analysed data included: Waiwera vOTUs; high-quality sequences from the IMG/VR database; viralRefSeq references

***

## 1. Identify putative core genes

*Caudoviricetes* "core genes" identified using the viral RefSeq database (v223), based on the method described Low *et al.*, 2019 (doi: 10.1038/s41564-019-0448-z).

Note: a subsampled set of viralRefSeq *Caudoviricetes* genomes (*n*=50) is provided for basic workflow testing (*data/refseq.Caudoviricetes.n50.genomic.fna*). Runtimes provided are based on this *n*=50 test set.

#### Filter RefSeq by taxonomy string (*Caudoviricetes*)

Note: for workflow testing, a subset of viralRefSeq *Caudoviricetes* genomes (*n*=50) are provided to use in place of *viral.1.genomic.gbff* below: *data/refseq.Caudoviricetes.n50.genomic.fna*


In [None]:
mkdir -p DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes.RefSeq

module purge
module load Python/3.11.6-foss-2023a

scripts/viruses.caudo_phylogeny/filter_refseq_by_taxonomy.py \
-i Databases/viralRefSeq/viral.1.genomic.gbff \
-t Caudoviricetes \
-o DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes


#### CheckV on viralRefSeq *Caudoviricetes* references

In [None]:
mkdir -p DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes.RefSeq/0.checkv_out

# Run main analyses 
checkv end_to_end -t 16 --quiet \
DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes/Caudoviricetes.genomic.fna \
DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes.RefSeq/0.checkv_out

*n=50 runtime < 3 min, MaxRSS < 3 GB (Full viralRefSeq: <40 hr, <8 GB)*

#### Predict genes via prodigal-gv 

Note: 

- If you are running DRAMv for annotations already, you can also use the genes.faa file generated by DRAMv (which runs prodigal-gv in the background) rather than running prodigal-gv separately here.
- For a large dataset (e.g. Caudoviricetes from full viralRefSeq or IMG/VR), you can speed up this step by first splitting the data into equal parts via BBMap's `partition.sh` and running as a slurm array 

In [None]:
mkdir -p DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes.RefSeq/1.prodigal_gv

prodigal-gv -p meta -q \
-i DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes/Caudoviricetes.genomic.fna \
-f "gff" -o DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes.RefSeq/1.prodigal_gv/Caudoviricetes.genomic.prod.gff \
-a DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes.RefSeq/1.prodigal_gv/Caudoviricetes.genomic.prod.faa \
-d DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes.RefSeq/1.prodigal_gv/Caudoviricetes.genomic.prod.fna


*n=50 runtime ~1 min*


#### hmmsearch of VOGdb HMMs against RefSeq references

Notes: 

- requires downloaded VOGdb file *vog.hmm.tar.gz* (Compressed archive of the HMMER3 compatible Hidden Markov Models obtained from the multiple sequence alignments for each VOG)
- For large datasets, to speed up the process you can split the initial sequences (e.g. via BBMap's partition.sh), run prodigal-gv on the subsets, then run hmmsearch on the individual prodigal-gv genes.faa files as slurm array 
  - In this case, you need to set `-Z` in the hmmsearch command based on the *total number of protein sequences* (i.e. sum the count of protein sequences in all subset prodigal-gv genes.faa files)

In [None]:
mkdir -p DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes.RefSeq/2.VOGdb_hmmsearch

hmmsearch -E 1e-3 --cpu 24 \
--tblout DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes.RefSeq/2.VOGdb_hmmsearch/Caudoviricetes.genomic.vogdb \
--domtblout DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes.RefSeq/2.VOGdb_hmmsearch/Caudoviricetes.genomic.domain_hits.vogdb \
Databases/vogdb_v222/vogdb_all_hmm.hmm \
DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes.RefSeq/1.prodigal_gv/Caudoviricetes.genomic.prod.faa > /dev/null


*n=50 runtime 20 min, MaxRSS < 0.5 GB (Full viralRefSeq: <16 hr, <8 GB; Waiwera vOTUs: <12 hr, <4 GB)*

#### Assess *Caudoviricetes* viralRefSeq VOGdb gene hits for putative single copy core genes

Note: requires downloaded VOGdb file *vog.annotations.tsv*

Method:

- References filtered for >= x% completeness (predicted via CheckV) (`-t 95`; default = 95)
- putative core genes selected based on the following criteria:
  - present in >= 10% of referece virus genomes
  - average gene copy number <= 1.2
  - average predicted protein length > 100 amino acid residues


In [None]:
mkdir -p DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes.RefSeq/3.predict_core_genes

scripts/viruses.caudo_phylogeny/identify_core_genes.py \
-v DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes.RefSeq/2.VOGdb_hmmsearch/Caudoviricetes.genomic.vogdb \
-a Databases/vogdb_v222/vog.annotations.tsv \
-p DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes.RefSeq/1.prodigal_gv/Caudoviricetes.genomic.prod.faa \
-c DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes.RefSeq/0.checkv_out/quality_summary.tsv \
-t 95 \
-o DNA/3.viruses/8.Caudoviricetes_phylogeny/0.core_genes.RefSeq/3.predict_core_genes


*n50 runtime ~1 s*

***

## 2. Caudoviricetes phylogeny inference via concatenated protein alignments of putative core genes

Overview: 

- Identify protein sequences for putative *Caudoviricetes* "core genes" in all datasets of interest
- Generate filtered contatenated protein alignments
- Build and visualise tree

Concatenated alignments and filtering based on the method described Low *et al.*, 2019 (doi: 10.1038/s41564-019-0448-z).

For this study, analysed data included: Waiwera vOTUs; high-quality sequences from the IMG/VR database; viralRefSeq references. 

#### Collate core genes for viralRefSeq references

Notes: 

- Generates protein sequence files for each "core gene" (based on VOG ID)
- `-t` sets the minimum threshold of completeness (predicted via checkV) for a genome to be included
- The reference list of core genes that were identified in this study are provided in `data/refseq.Caudoviricetes.core_genes.vogdb.annotations.tsv` (IDs based on VOGdb v222)
  - Note: for the IDs to match appropriately, all annotations must be generated based on same version of VOGdb used to identify "core genes"
- To include other sequences (i.e. your own data or other database sequences (e.g. IMG/VR)):
  - First run hmmsearch on those sequences against the VOG database (as per above for RefSeq sequences), then run separately through *collate_core_genes.py*, then concatenate .faa files together for each VOG ID.
  - Alternatively, combine datasets and run as one dataset through VOGdb hmmsearch, checkV, and prodigal-gv, and run as one batch through *collate_core_genes.py*. 


In [None]:
# This example assumes three sets of files have been previously generated based on: viralRefSeq references; IMG/VR vOTUs; Waiwera vOTUs
for dataset in viralRefseq imgvr Ww_vOTUs; do
    mkdir -p DNA/3.viruses/8.Caudoviricetes_phylogeny/1.core_genes.faa_files.${dataset}
    scripts/viruses.caudo_phylogeny/collate_core_genes.py \
    -r data/refseq.Caudoviricetes.core_genes.vogdb.annotations.tsv \
    -v DNA/3.viruses/8.Caudoviricetes_phylogeny/${dataset}/VOGdb_hmmsearch/${dataset}.vogdb \
    -p DNA/3.viruses/8.Caudoviricetes_phylogeny/${dataset}/prodigal_gv/${dataset}.prod.faa \
    -c DNA/3.viruses/8.Caudoviricetes_phylogeny/${dataset}/checkv_out/quality_summary.tsv \
    -t 85 \
    -o DNA/3.viruses/8.Caudoviricetes_phylogeny/1.core_genes.faa_files.${dataset}
done

mkdir -p DNA/3.viruses/8.Caudoviricetes_phylogeny/1.core_genes.faa_files.combined
# concatenate faa files from each data subset above into one set: DNA/3.viruses/8.Caudoviricetes_phylogeny/1.core_genes.faa_files.combined
# note: the end results should be a single *set* of protein sequence .faa files; one file per "core gene" VOG ID.

*n50 runtime ~20 s*

#### Protein alignments for each marker gene

Note: protein alignments at this step are independent for each "core gene", so can be run in parallel (e.g. slurm array) rather than a loop for large datasets.

In [None]:
mkdir -p DNA/3.viruses/8.Caudoviricetes_phylogeny/2.core_genes.alignments

for file in DNA/3.viruses/8.Caudoviricetes_phylogeny/1.core_genes.faa_files.combined/*.faa; do
    echo "Running protein alignment: ${file}"
    out_id=$(basename ${file} .faa)
    clustalo --threads=8 --force --outfmt=fa \
    -i ${file} \
    -o DNA/3.viruses/8.Caudoviricetes_phylogeny/2.core_genes.alignments/aln.${out_id}.faa \
    --log=DNA/3.viruses/8.Caudoviricetes_phylogeny/2.core_genes.alignments/aln.${out_id}.log 
done


#### Concatenate protein alignments

Alignment filtering based on the method described Low *et al.*, 2019 (doi: 10.1038/s41564-019-0448-z)

In brief:

- marker MSAs individually trimmed by removing columns represented in <50% of taxa
- individual alignments concatenated by introducing gaps in positions where markers were absent from a genome
- concatenated MSA further filtered to remove genomes with <5% amino acid representation of the total alignment length

In [None]:
mkdir -p DNA/3.viruses/8.Caudoviricetes_phylogeny/3.concatenated_alignment

scripts/viruses.caudo_phylogeny/concatenate_protein_alignments.py \
-r data/Refseq_vogdb.core_genes.annotations.tsv \
-a DNA/3.viruses/8.Caudoviricetes_phylogeny/2.core_genes.alignments \
-o DNA/3.viruses/8.Caudoviricetes_phylogeny/3.concatenated_alignment


*n50 runtime ~10s*

#### Build tree

In [None]:
cd DNA/3.viruses/8.Caudoviricetes_phylogeny/3.concatenated_alignment

iqtree2 -T 32 -m TEST -B 1000 -s concatenated_alignment.faa

*n=50 runtime 1 hr, MaxRSS < 1GB*

## Visualise tree

Visualise tree file (e.g. in iTol): `concatenated_alignment.faa.contree`

***