# 08_interesting_virus_host_pairs

This document explores which virus-host pairs are the most interesting based on the strength of the connection, the taxonomy of the host (biogeochemical significance), and quality of the viral sequence/MAG.

## Load packages and data

In [2]:
import pandas as pd
import os
import sys
import csv
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import glob
import seaborn as sns
from collections import Counter

df = pd.read_csv('~/Documents/Bigelow/Virus_Project/OMZ_MH_Analysis/Data/all_associations_gtdb.csv')

vMAG_119 = pd.read_csv("~/Documents/Bigelow/Virus_Project/OMZ_MH_Analysis/Data/proximeta_viral_files/jv-119_874814_Viral_Files/viral_MAGs/viral_mags_summary.tsv", sep = '\t')
vMAG_121 = pd.read_csv("~/Documents/Bigelow/Virus_Project/OMZ_MH_Analysis/Data/proximeta_viral_files/jv-121_874818_Viral_Files/viral_MAGs/viral_mags_summary.tsv", sep = '\t')
vMAG_132 = pd.read_csv("~/Documents/Bigelow/Virus_Project/OMZ_MH_Analysis/Data/proximeta_viral_files/jv-132_874826_Viral_Files/viral_MAGs/viral_mags_summary.tsv", sep = '\t')
vMAG_154 = pd.read_csv("~/Documents/Bigelow/Virus_Project/OMZ_MH_Analysis/Data/proximeta_viral_files/jv-154_874822_Viral_Files/viral_MAGs/viral_mags_summary.tsv", sep = '\t')

vMAGs = pd.concat([vMAG_119,vMAG_121,vMAG_132,vMAG_154])

In [3]:
vMAGs

Unnamed: 0,contig_id,contig_length,provirus,proviral_length,gene_count,viral_genes,host_genes,checkv_quality,miuvig_quality,completeness,completeness_method,contamination,kmer_freq,warnings
0,vMAG_7|N=2|L=36972,37172,No,,48,4,0,Medium-quality,Genome-fragment,66.90,AAI-based (medium-confidence),0.0,1.00,
1,vMAG_32|N=2|L=21616,21816,No,,37,7,2,Low-quality,Genome-fragment,5.53,AAI-based (medium-confidence),0.0,1.00,
2,vMAG_12|N=2|L=30847,31047,No,,52,7,4,Medium-quality,Genome-fragment,70.44,AAI-based (medium-confidence),0.0,1.00,
3,vMAG_36|N=2|L=13927,14127,No,,16,11,0,Low-quality,Genome-fragment,3.68,AAI-based (medium-confidence),0.0,1.03,
4,vMAG_1|N=10|L=116593,118393,No,,156,53,9,Medium-quality,Genome-fragment,54.91,AAI-based (medium-confidence),0.0,1.00,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43,vMAG_45|N=2|L=14520,14720,No,,15,6,0,Low-quality,Genome-fragment,36.45,AAI-based (medium-confidence),0.0,1.00,
44,vMAG_3|N=2|L=33980,34180,No,,28,2,0,Medium-quality,Genome-fragment,57.21,AAI-based (high-confidence),0.0,1.00,
45,vMAG_27|N=2|L=33970,34170,No,,50,11,2,Medium-quality,Genome-fragment,75.07,AAI-based (medium-confidence),0.0,1.00,
46,vMAG_17|N=3|L=25751,26151,No,,27,2,1,Low-quality,Genome-fragment,24.50,HMM-based (lower-bound),0.0,1.00,


In [3]:
df.columns

Index(['virus_name', 'virus_length', 'virus_read_count', 'virus_read_depth',
       'virus_read_depth_in_host', 'host_name', 'host_length',
       'host_read_count', 'host_read_depth', 'intra_read_count',
       'intra_linkage_density', 'inter_read_count',
       'raw_inter_linkage_density', 'raw_inter_vs_intra_ratio',
       'viral_copies_per_cell', 'adjusted_inter_linkage_density',
       'adjusted_inter_vs_intra_ratio', 'sample_name', 'virus_type',
       'sample_depth', 'classification', 'fastani_reference',
       'fastani_reference_radius', 'fastani_taxonomy', 'fastani_ani',
       'fastani_af', 'closest_placement_reference', 'closest_placement_radius',
       'closest_placement_taxonomy', 'closest_placement_ani',
       'closest_placement_af', 'pplacer_taxonomy', 'classification_method',
       'note',
       'other_related_references(genome_id,species_name,radius,ANI,AF)',
       'phyla', 'class', 'order', 'family', 'genus', 'species'],
      dtype='object')

## Criteria: Interesting Taxonomy

Readings? How do I determine this?

## Criteria: Quality

In [6]:
# Virus length
long_length = df[df['virus_length'] >= 1500]
len(long_length)

336

In [None]:
# Number of identified viral genes


## Connection Strength

Proposed criteria: Adjusted inter connective linkage density >= 0.05 reads/kbp^2

In [19]:
inter = df[df['adjusted_inter_linkage_density'] >= 0.05]
print(len(inter), 'associations are >= 0.05, which is about', round(len(inter)/len(df)*100), '%.')

87 associations are >= 0.05, which is about 26 %.


In [23]:
ratio = df[df['adjusted_inter_vs_intra_ratio'] >= 2.5]
print(len(ratio), 'associations are >= 1, which is about', round(len(ratio)/len(df)*100), '%.')

88 associations are >= 1, which is about 26 %.


In [31]:
both = df[(df['adjusted_inter_linkage_density'] >= 0.05) & (df['adjusted_inter_vs_intra_ratio'] >= 2.5)]
both

Unnamed: 0,virus_name,virus_length,virus_read_count,virus_read_depth,virus_read_depth_in_host,host_name,host_length,host_read_count,host_read_depth,intra_read_count,...,translation_table,red_value,warnings,domain,phyla,class,order,family,genus,species
209,k141_3298423,5405,231,42.738205,15.114731,bin_110,155303,4643,29.896396,142,...,,,No bacterial or archaeal marker,Unclassified,,,,,,
286,k141_2834672,16753,366,21.846833,8.516562,bin_135,110072,2484,22.567047,106,...,,,No bacterial or archaeal marker,Unclassified,,,,,,


There are only 2 associations that have adjusted_inter_linkage_density >= 0.05 and adjusted_inter_vs_intra_ratio >= 2.5. So I recommend picking one of the criteria, not both.