# 3.7 Virus-Host matching

## Software and versions used in this study

- crass v1.0.1
- blast v2.13.0

## Additional custom scripts

Note: custom scripts have been tested in python v3.11.6 and R v4.2.1 and may not be stable in other versions.

- scripts/viruses.general/hostmatch_crispr_summary_table.py

*Required python packages: argparse, pandas, numpy, re, glob*

***

## Virus-host matching

For this study, putative prokaryote hosts for vOTUs were predicted using a combination of approaches, including CRISPR-spacer, tRNA, and genome homology via pairwise blast searches (BLAST v2.13.0), analysis of vOTUs co-binned with prokaryote MAGs, oligonucleotide frequency similarity via VirHostMatcher v1.0.0, and machine learning-based methods RaFAH v0.3 and HostG (accessed 06 Dec 2021).

As fine scale information (e.g. crispr spacers, tRNA) might be lost (or reduced to one representative) during MAG dereplication, the sets of prokaryote MAGs and viral contigs prior to dereplication across assembiles (i.e. the pre-dRep MAGs and pre-vOTU-clustering viral contigs) were used for these analyses, with final results then summarised by representative (post-dRep) MAGs and (post-vOTU-clustering) vOTUs.

Inconsistent results, including up to the rank of phylum, were frequently observed between the different methods, and ultimately only results based on CRISPR-spacer matches were presented.

## Predict prokaryote hosts of vOTUs via CRISPR-spacer matching

#### Prep: Concatenate MAGs into one fna file 

(Note: for downstream analyses, it is helpful to ensure that unique MAG IDs are incorporated into the start of contig IDs for each MAG))

In [None]:
mkdir -p DNA/3.viruses/10.host_prediction/crispr/mags
cat DNA/2.prokaryote_mags/2.bin_dereplication_DAS_Tool/DASTool_All_bins/*.fa > DNA/3.viruses/10.host_prediction/crispr/mags/all.hosts.fna

#### Identify CRISPR spacers from quality filtered sequencing reads via *crass*

In [None]:
mkdir -p DNA/3.viruses/10.host_prediction/crispr

# Identify spacers
for i in {1..9}; do
    mkdir -p DNA/3.viruses/10.host_prediction/crispr/S${i}
    /path/to/crass/bin/crass \
    DNA/1.Qual_filtered_trimmomatic/S${i}_R1.fastq \
    DNA/1.Qual_filtered_trimmomatic/S${i}_R2.fastq \
    -o DNA/3.viruses/10.host_prediction/crispr/S${i}/
done

#### Extract CRISPR spacer sequences

In [None]:
mkdir -p DNA/3.viruses/10.host_prediction/crispr/spacer_seqs

# Extract spacers
for i in {1..9}; do
    /path/to/crass/bin/crisprtools extract \
    -o DNA/3.viruses/10.host_prediction/crispr/ \
    -O DNA/3.viruses/10.host_prediction/crispr/S${i}/ \
    -s DNA/3.viruses/10.host_prediction/crispr/S${i}/crass.crispr \
    > DNA/3.viruses/10.host_prediction/crispr/spacer_seqs/S${i}_spacers.fna
done

# Add sample info to sequence headers of spacers.fa files
for i in {1..9}; do
    sed -i "s/>/>S${i}_/g" DNA/3.viruses/10.host_prediction/crispr/spacer_seqs/S${i}_spacers.fna
done

# Concatenate spacer sequences
cat DNA/3.viruses/10.host_prediction/crispr/spacer_seqs/*.fna \
> DNA/3.viruses/10.host_prediction/crispr/spacer_seqs/all_spacer_seqs.fna

#### Search CRISPR spacer matches in MAGs and vOTUs via *blastn*

In [None]:
# build index
makeblastdb -dbtype nucl \
-in DNA/3.viruses/10.host_prediction/crispr/spacer_seqs/all_spacer_seqs.fna \
-out DNA/3.viruses/10.host_prediction/crispr/spacer_seqs/all_spacer_seqs.fna

# blast search against MAGs
blastn -num_threads 12 -dust no -word_size 7 \
-query DNA/3.viruses/10.host_prediction/crispr/mags/all.hosts.fna \
-db DNA/3.viruses/10.host_prediction/crispr/spacer_seqs/all_spacer_seqs.fna \
-outfmt "6 qseqid qlen sseqid slen pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs" \
-out DNA/3.viruses/10.host_prediction/crispr/blastn_crisprSpacers.mags.txt

# blast search against vOTUs
blastn -num_threads 12 \
-query DNA/3.viruses/5.checkv_vOTUs/vOTUs.fna \
-db DNA/3.viruses/10.host_prediction/crispr/spacer_seqs/all_spacer_seqs.fna \
-dust no -word_size 7 \
-outfmt "6 qseqid qlen sseqid slen pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs" \
-out DNA/3.viruses/10.host_prediction/crispr/blastn_crisprSpacers.votus.txt


#### Compile crispr-spacer host matching summary results table via script *hostmatch_crispr_summary_table.py*

Note: 

- blast results (as above) must be in the following format: `"6 qseqid qlen sseqid slen pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs"`
- `-m 0` sets the allowed number of mismatches (over full length of spacer sequence) (default = 0)
- `-b 2000` sets the maximum number of blast hits per vOTU to return (based on lowest evalue score) (default = 100)
- `-t 2` sets minimum number of *distinct* spacer matches that need to occur between individual viruses and hosts to retain results for each virus-host pair (e.g. for `-t 2`, virus-host pairs are excluded if they share <2 unique spacer sequences between them)

In [None]:
scripts/viruses.general/hostmatch_crispr_summary_table.py \
-m 0 -b 2000 -t 2 \
-p DNA/3.viruses/10.host_prediction/crispr/blastn_crisprSpacers.mags.txt \
-v DNA/3.viruses/10.host_prediction/crispr/blastn_crisprSpacers.votus.txt \
-o DNA/3.viruses/10.host_prediction/crispr/host_matching.crispr.summary_table.tsv

***