# 10_interesting_virus_host_pairs

This document explores which virus-host pairs are the most interesting based on the strength of the connection, the taxonomy of the host (biogeochemical significance), and quality of the viral sequence/MAG.

## Load packages and data

In [1]:
import pandas as pd
import os
import sys
import csv
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import glob
import seaborn as sns
from collections import Counter

pd.set_option('display.max_columns', None)

sags = pd.read_csv("~/Documents/Bigelow/Virus_Project/OMZ_MH_Analysis/Data/sag_data/MPvsag_info_230818.csv")
# rename classification column
sags.rename(columns={'classification_via_GTDBTk': 'classification'}, inplace=True) # rename classification column
# split the classification into tax level columns and rename
sags[['domain', 'phyla', 'class', 'order', 'family', 'genus', 'species']]=sags.classification.str.split(';', expand=True)
sags['domain'] = sags['domain'].str.replace('d__', '') # remove the d__ in front of all observations
sags['phyla'] = sags['phyla'].str.replace('p__', '')
sags['class'] = sags['class'].str.replace('c__', '')
sags['order'] = sags['order'].str.replace('o__', '')
sags['family'] = sags['family'].str.replace('f__', '')
sags['genus'] = sags['genus'].str.replace('g__', '')
sags['species'] = sags['species'].str.replace('s__', '')

vMAGs = pd.read_csv('~/Documents/Bigelow/Virus_Project/OMZ_MH_Analysis/Data/proximeta_viral_files/vMAG_associations.csv')

## SAGs

In [None]:
len(sags)

### Criteria: Taxonomy

**Orders with more SAGs than vMAGs:**
+ Flavobacteriales
+ Woesearchaeales
+ Pelagibacterales
+ Marinisomatales
+ Rhodospirillales
+ Phycisphaerales
+ Pedosphaerales
+ Dehalococcoidales
+ Arenicellales
+ MGIII
+ SAR324
+ HIMB59
+ PS1

In [2]:
sag_abund_list = ['Flavobacteriales', 'Woesearchaeales', 'Pelagibacterales', 'Marinisomatales', 'Rhodospirillales', 'Phycisphaerales', 'Pedosphaerales', 
                  'Dehalococcoidales', 'Arenicellales', 'MGIII', 'SAR324', 'HIMB59', 'PS1']
sag_abund = sags[sags['order'].isin(sag_abund_list)]
len(sag_abund)

33

### Criteria: Length

In [3]:
long_sags = sag_abund[sag_abund['contig_length'] >= 15000]
len(long_sags)

27

### Criteria: Number of Viral Genes

In [4]:
sag_viral_genes = long_sags[long_sags['viral_genes'] != 0]
sag_viral_genes = sag_viral_genes.sort_values(by='viral_genes',ascending=False)
len(sag_viral_genes)

21

In [5]:
sag_viral_genes

Unnamed: 0,vir_id,sag,contig_length,provirus,proviral_length,gene_count,viral_genes,host_genes,checkv_quality,miuvig_quality,completeness,completeness_method,contamination,kmer_freq,warnings,classification,plate,depth,domain,phyla,class,order,family,genus,species
4,vir_AM-654-B04,AM-654-B04,128345,Yes,116455.0,155,84,8,Medium-quality,Genome-fragment,60.33,AAI-based (high-confidence),9.26,1.0,,d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__...,AM-654,80,Bacteria,Bacteroidota,Bacteroidia,Flavobacteriales,BACL11,DUAL01,
22,vir_AM-654-E17,AM-654-E17,163891,No,,195,77,4,Medium-quality,Genome-fragment,88.22,AAI-based (medium-confidence),0.0,1.0,,d__Bacteria;p__Proteobacteria;c__Alphaproteoba...,AM-654,80,Bacteria,Proteobacteria,Alphaproteobacteria,Pelagibacterales,Pelagibacteraceae,AG-414-E02,
15,vir_AM-656-P04,AM-656-P04,39217,No,,45,28,1,Medium-quality,Genome-fragment,86.6,AAI-based (high-confidence),0.0,1.0,,d__Bacteria;p__Marinisomatota;c__Marinisomatia...,AM-656,95,Bacteria,Marinisomatota,Marinisomatia,Marinisomatales,TCS55,UBA2126,UBA2126 sp002730315
41,vir_AM-666-P13,AM-666-P13,129323,No,,178,25,19,Low-quality,Genome-fragment,47.53,HMM-based (lower-bound),0.0,1.0,,d__Bacteria;p__Proteobacteria;c__Alphaproteoba...,AM-666,400,Bacteria,Proteobacteria,Alphaproteobacteria,HIMB59,GCA-002718135,MarineAlpha5-Bin3,MarineAlpha5-Bin3 sp002938255
1,vir_AM-654-B17,AM-654-B17,172516,No,,226,25,8,Medium-quality,Genome-fragment,83.14,AAI-based (medium-confidence),0.0,1.0,,d__Bacteria;p__Proteobacteria;c__Alphaproteoba...,AM-654,80,Bacteria,Proteobacteria,Alphaproteobacteria,Pelagibacterales,Pelagibacteraceae,Pelagibacter,
3,vir_AM-654-C02,AM-654-C02,60274,Yes,40315.0,69,24,16,High-quality,High-quality,98.78,AAI-based (medium-confidence),33.11,1.0,,d__Bacteria;p__Proteobacteria;c__Alphaproteoba...,AM-654,80,Bacteria,Proteobacteria,Alphaproteobacteria,Pelagibacterales,Pelagibacteraceae,Pelagibacter,
16,vir_AM-656-A07,AM-656-A07,83878,Yes,54319.0,86,24,30,Complete,High-quality,100.0,Provirus (high-confidence),35.24,1.0,contig >1.5x longer than expected genome length,d__Bacteria;p__Proteobacteria;c__Alphaproteoba...,AM-656,95,Bacteria,Proteobacteria,Alphaproteobacteria,HIMB59,GCA-002718135,JAGFWP01,JAGFWP01 sp017639935
21,vir_AM-654-C23,AM-654-C23,108177,No,,114,20,5,Medium-quality,Genome-fragment,61.55,AAI-based (high-confidence),0.0,1.0,,d__Bacteria;p__Planctomycetota;c__Phycisphaera...,AM-654,80,Bacteria,Planctomycetota,Phycisphaerae,Phycisphaerales,SM1A02,GCA-002718515,GCA-002718515 sp018656645
26,vir_AM-662-D22,AM-662-D22,40167,No,,57,19,1,High-quality,High-quality,100.0,AAI-based (high-confidence),0.0,1.0,,d__Bacteria;p__Proteobacteria;c__Gammaproteoba...,AM-662,140,Bacteria,Proteobacteria,Gammaproteobacteria,PS1,Thioglobaceae,,
25,vir_AM-654-D05,AM-654-D05,52398,No,,81,19,0,Medium-quality,Genome-fragment,87.35,AAI-based (medium-confidence),0.0,1.0,,d__Bacteria;p__Proteobacteria;c__Alphaproteoba...,AM-654,80,Bacteria,Proteobacteria,Alphaproteobacteria,Pelagibacterales,Pelagibacteraceae,GCA-2704625,


### SAG Results

| row name | vir_id         | viral_genes | depth | order            | notes                                                    |
|----------|----------------|-------------| ----- | ---------------- | -------------------------------------------------------- |
| 4        | vir-AM-654-B04 | 84          | 80    | Flavobacteriales | highest num of viral genes overall                       |
| 15       | vir-AM-656-P04 | 28          | 95    | Marinisomatales  | highest num of viral genes for depth                     |
| 41       | vir-AM-666-P13 | 25          | 400   | HIMB59           | highest num of viral genes for depth                     |
| 26       | vir_AM-662-D22 | 19          | 140   | PS1              | highest num of viral genes for depth; completeness = 100 |
| 22       | vir_AM-654-E17 | 77          | 80    | Pelagibacterales | second highest num of viral genes overall



In [6]:
sag_candidates = sag_viral_genes.loc[[4, 15, 41, 26, 22]]
sag_candidates

Unnamed: 0,vir_id,sag,contig_length,provirus,proviral_length,gene_count,viral_genes,host_genes,checkv_quality,miuvig_quality,completeness,completeness_method,contamination,kmer_freq,warnings,classification,plate,depth,domain,phyla,class,order,family,genus,species
4,vir_AM-654-B04,AM-654-B04,128345,Yes,116455.0,155,84,8,Medium-quality,Genome-fragment,60.33,AAI-based (high-confidence),9.26,1.0,,d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__...,AM-654,80,Bacteria,Bacteroidota,Bacteroidia,Flavobacteriales,BACL11,DUAL01,
15,vir_AM-656-P04,AM-656-P04,39217,No,,45,28,1,Medium-quality,Genome-fragment,86.6,AAI-based (high-confidence),0.0,1.0,,d__Bacteria;p__Marinisomatota;c__Marinisomatia...,AM-656,95,Bacteria,Marinisomatota,Marinisomatia,Marinisomatales,TCS55,UBA2126,UBA2126 sp002730315
41,vir_AM-666-P13,AM-666-P13,129323,No,,178,25,19,Low-quality,Genome-fragment,47.53,HMM-based (lower-bound),0.0,1.0,,d__Bacteria;p__Proteobacteria;c__Alphaproteoba...,AM-666,400,Bacteria,Proteobacteria,Alphaproteobacteria,HIMB59,GCA-002718135,MarineAlpha5-Bin3,MarineAlpha5-Bin3 sp002938255
26,vir_AM-662-D22,AM-662-D22,40167,No,,57,19,1,High-quality,High-quality,100.0,AAI-based (high-confidence),0.0,1.0,,d__Bacteria;p__Proteobacteria;c__Gammaproteoba...,AM-662,140,Bacteria,Proteobacteria,Gammaproteobacteria,PS1,Thioglobaceae,,
22,vir_AM-654-E17,AM-654-E17,163891,No,,195,77,4,Medium-quality,Genome-fragment,88.22,AAI-based (medium-confidence),0.0,1.0,,d__Bacteria;p__Proteobacteria;c__Alphaproteoba...,AM-654,80,Bacteria,Proteobacteria,Alphaproteobacteria,Pelagibacterales,Pelagibacteraceae,AG-414-E02,


## vMAGs

In [2]:
len(vMAGs)

132

### Criteria: Taxonomy

All orders in common (4) have more vMAGs than SAGs.

In [3]:
vMAG_abund_list = ['Pirellulales', 'Pseudomonadales', 'Brocadiales', 'Verrucomicrobiales']
vMAG_abund = vMAGs[vMAGs['order'].isin(vMAG_abund_list)]
len(vMAG_abund)

34

### Criteria: Length

In [5]:
# Virus length
long_vMAG = vMAG_abund[vMAG_abund['virus_length'] >= 15000]
len(long_vMAG)

22

### Criteria: Number of Viral Genes

In [7]:
vMAG_viral_genes = long_vMAG[long_vMAG['viral_genes'] != 0] # none are 0
vMAG_viral_genes = vMAG_viral_genes.sort_values(by='viral_genes',ascending=False)
len(vMAG_viral_genes)

22

### vMAG Results

| row name | virus_name     | viral_genes | depth | order            | notes                                                    |
|----------|----------------|-------------| ----- | ---------------- | -------------------------------------------------------- |
| 21       | vMAG_29        | 29          | 400   | Pirellulales     | highest num of viral genes overall                       |
| 48       | vMAG_31        | 12          | 140   | Pirellulales     | highest num of viral genes for depth                     |
| 44       | vMAG_31        | 12          | 95    | Brocadiales      | highest num of viral genes for depth                     |
| 2        | vMAG_32        | 22          | 400   | Pseudomonadales  | second highest num of viral genes overall                |
| 127      | vMAG_44        | 11          | 140   | Pirellulales     |                                                          |



In [14]:
vMAG_candidates = vMAG_viral_genes.loc[[21, 48, 44, 2, 127]]
vMAG_candidates

Unnamed: 0,virus_name,virus_length,virus_read_count,virus_read_depth,virus_read_depth_in_host,host_name,host_length,host_read_count,host_read_depth,intra_read_count,intra_linkage_density,inter_read_count,raw_inter_linkage_density,raw_inter_vs_intra_ratio,viral_copies_per_cell,adjusted_inter_linkage_density,adjusted_inter_vs_intra_ratio,sample_name,virus_type,sample_depth,classification,fastani_reference,fastani_reference_radius,fastani_taxonomy,fastani_ani,fastani_af,closest_placement_reference,closest_placement_radius,closest_placement_taxonomy,closest_placement_ani,closest_placement_af,pplacer_taxonomy,classification_method,note,"other_related_references(genome_id,species_name,radius,ANI,AF)",msa_percent,translation_table,red_value,warnings_x,domain,phyla,class,order,family,genus,species,contig_id,contig_length,provirus,proviral_length,gene_count,viral_genes,host_genes,checkv_quality,miuvig_quality,completeness,completeness_method,contamination,kmer_freq,warnings_y,N,L
21,vMAG_29,51030,32321,633.372526,564.179287,bin_1,10194662,6580708,645.505265,2228123,0.043011,22178,0.042631,0.991156,0.874012,0.048776,1.13403,JV119,vMAG,400,d__Bacteria;p__Planctomycetota;c__Planctomycet...,GCA_002685655.1,95.0,d__Bacteria;p__Planctomycetota;c__Planctomycet...,95.46,0.757,GCA_002685655.1,95.0,d__Bacteria;p__Planctomycetota;c__Planctomycet...,95.46,0.757,d__Bacteria;p__Planctomycetota;c__Planctomycet...,taxonomic classification defined by topology a...,topological placement and ANI have congruent s...,,82.13,11.0,,,Bacteria,Planctomycetota,Planctomycetia,Pirellulales,Pirellulaceae,ARS98,ARS98 sp002685655,vMAG_29|N=3|L=67397,67797,No,,70,59,0,Low-quality,Genome-fragment,35.89,AAI-based (high-confidence),0.0,1.0,,3,67397
48,vMAG_31,22679,11504,507.253406,427.711768,bin_1,9896228,4482963,452.997142,1398908,0.028659,7313,0.032584,1.136935,0.944182,0.03451,1.204148,JV154,vMAG,140,d__Bacteria;p__Planctomycetota;c__Planctomycet...,GCA_002685655.1,95.0,d__Bacteria;p__Planctomycetota;c__Planctomycet...,95.54,0.735,GCA_002685655.1,95.0,d__Bacteria;p__Planctomycetota;c__Planctomycet...,95.54,0.735,d__Bacteria;p__Planctomycetota;c__Planctomycet...,taxonomic classification defined by topology a...,topological placement and ANI have congruent s...,,82.92,11.0,,,Bacteria,Planctomycetota,Planctomycetia,Pirellulales,Pirellulaceae,ARS98,ARS98 sp002685655,vMAG_31|N=2|L=29948,30148,No,,30,12,1,Low-quality,Genome-fragment,16.44,AAI-based (medium-confidence),0.0,1.0,,2,29948
44,vMAG_31,55032,9477,172.208897,2.7173,bin_93,770815,6349,8.236736,326,0.001158,8,0.000189,0.162794,0.3299,0.000572,0.493464,JV121,vMAG,95,d__Bacteria;p__Planctomycetota;c__Brocadiia;o_...,,,,,,,,,,,d__Bacteria;p__Planctomycetota;c__Brocadiia;o_...,taxonomic classification defined by topology a...,classification based on placement in class-lev...,"GCA_024654465.1, s__Scalindua sp024654465, 95....",18.07,11.0,0.97387,Genome not assigned to closest species as it f...,Bacteria,Planctomycetota,Brocadiia,Brocadiales,Scalinduaceae,Scalindua,,vMAG_31|N=2|L=29948,30148,No,,30,12,1,Low-quality,Genome-fragment,16.44,AAI-based (medium-confidence),0.0,1.0,,2,29948
2,vMAG_32,21616,791,36.593264,0.741755,bin_23,3688193,23817,6.457634,357,5.3e-05,3,3.8e-05,0.703704,0.114865,0.000328,6.126361,JV119,vMAG,400,d__Bacteria;p__Pseudomonadota;c__Gammaproteoba...,,,,,,,,,,,d__Bacteria;p__Pseudomonadota;c__Gammaproteoba...,taxonomic classification fully defined by topo...,classification based on placement in class-lev...,,48.26,11.0,0.75999,,Bacteria,Pseudomonadota,Gammaproteobacteria,Pseudomonadales,HTCC2089,,,vMAG_32|N=2|L=59685,59885,No,,32,22,0,Low-quality,Genome-fragment,32.9,AAI-based (high-confidence),0.0,1.0,,2,59685
127,vMAG_44,67417,9735,144.39978,98.904902,bin_1,9896228,4482963,452.997142,1398908,0.028659,1437,0.002154,0.075154,0.218334,0.009865,0.344214,JV154,vMAG,140,d__Bacteria;p__Planctomycetota;c__Planctomycet...,GCA_002685655.1,95.0,d__Bacteria;p__Planctomycetota;c__Planctomycet...,95.54,0.735,GCA_002685655.1,95.0,d__Bacteria;p__Planctomycetota;c__Planctomycet...,95.54,0.735,d__Bacteria;p__Planctomycetota;c__Planctomycet...,taxonomic classification defined by topology a...,topological placement and ANI have congruent s...,,82.92,11.0,,,Bacteria,Planctomycetota,Planctomycetia,Pirellulales,Pirellulaceae,ARS98,ARS98 sp002685655,vMAG_44|N=3|L=67417,67817,No,,92,11,1,High-quality,High-quality,98.55,AAI-based (medium-confidence),0.0,1.01,,3,67417
