# Merged Metadata Analysis

In this analysis we are going to try and merge:
- Prophage counts
- GenBank data
- RAST data
- GTDB data
- CheckV predictions

So that we can filter by different things and identify interesting characters.

In [1]:
import os
import sys
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import pandas as pd
import seaborn as sns
import numpy as np

import math
import re

from PhiSpyAnalysis import theils_u, DateConverter, file_to_accession

from scipy.stats import pearsonr, f_oneway
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, LeaveOneGroupOut
from sklearn import metrics

import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd, tukeyhsd, MultiComparison
from statsmodels.multivariate.manova import MANOVA
from sklearn import decomposition
from sklearn.ensemble import RandomForestClassifier

# for parsing collection dates
from dateutil.parser import parse, ParserError
import pytz

import subprocess
import gzip

# this is a neat trick for getting markdown in our output
# see https://stackoverflow.com/questions/23271575/printing-bold-colored-etc-text-in-ipython-qtconsole
# for the inspiration
from IPython.display import Markdown, display
def printmd(string, color="black"):
    colorstr = "<span style='color:{}'>{}</span>".format(color, string)
    display(Markdown(colorstr))

# Read the phage data. Check the version!

We have two data sets: `small` is just 99 genomes and 1,561 phages and should run quickly for development. `not small` is all the data!

Here we also convert all file names to just the `accession` and name the column `assembly_accession` so we can merge everything as needed.

In [9]:
use_small_data = False

In [10]:
if use_small_data:
    phagesdf = pd.read_csv("../small_data/phages_per_genome.tsv.gz", compression='gzip', header=0, delimiter="\t")
else:
    phagesdf = pd.read_csv("../data/phages_per_genome.tsv.gz", compression='gzip', header=0, delimiter="\t")
phagesdf['assembly_accession'] = phagesdf['Contig'].apply(file_to_accession)
phagesdf

Unnamed: 0,Contig,Genome length,Contigs,Phage Contigs,Total Predicted Prophages,Kept,No phage genes,Not enough genes,bp prophage,assembly_accession
0,GCA_000003135.1_ASM313v1_genomic.gbff.gz,2396359,114,10,16,2,1,13,48916,GCA_000003135.1
1,GCA_000003645.1_ASM364v1_genomic.gbff.gz,5269725,1,1,31,1,10,20,40297,GCA_000003645.1
2,GCA_000003925.1_ASM392v1_genomic.gbff.gz,5561906,1,1,38,6,13,19,268081,GCA_000003925.1
3,GCA_000003955.1_ASM395v1_genomic.gbff.gz,5790501,1,1,46,6,11,29,166286,GCA_000003955.1
4,GCA_000005825.2_ASM582v2_genomic.gbff.gz,4249248,3,3,33,3,9,21,93416,GCA_000005825.2
...,...,...,...,...,...,...,...,...,...,...
553077,GCA_902860175.1_LMG_5997_genomic.gbff.gz,7197255,38,21,33,2,14,17,69051,GCA_902860175.1
553078,GCA_902860185.1_LMG_6103_genomic.gbff.gz,6497464,13,8,22,0,10,12,0,GCA_902860185.1
553079,GCA_902860195.1_LMG_7053_genomic.gbff.gz,6702936,200,148,33,1,11,21,12819,GCA_902860195.1
553080,GCA_902860205.1_LMG_6001_genomic.gbff.gz,6320373,36,19,35,2,21,12,41572,GCA_902860205.1


In [11]:
githash = subprocess.check_output(["git", "describe", "--always"]).strip().decode()
print(f"Please note that this was run with git commit {githash} that has {phagesdf.shape[0]:,} genomes parsed and {phagesdf['Total Predicted Prophages'].sum():,} total prophages")

Please note that this was run with git commit 71da118 that has 553,082 genomes parsed and 20,946,107 total prophages


In [8]:
if use_small_data:
    metadf = pd.read_csv("../small_data/patric_genome_metadata.tsv.gz", compression='gzip', header=0, delimiter="\t")
else:
    metadf = pd.read_csv("../data/patric_genome_metadata.tsv.gz", compression='gzip', header=0, delimiter="\t")
dc = DateConverter()
metadf['isolation_date'] = metadf.collection_date.apply(dc.convert_date)
metadf

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,genome_id,genome_name,organism_name,taxon_id,genome_status,strain,serovar,biovar,pathovar,mlst,...,sporulation,temperature_range,optimal_temperature,salinity,oxygen_requirement,habitat,disease,comments,additional_metadata,isolation_date
0,469009.4,"""'Brassica napus' phytoplasma strain TW1""",,469009,WGS,TW1,,,,,...,,,,,,,,Genome sequence of a strain of bacteria that c...,sample_type:metagenomic assembly;collected_by:...,2017.577687
1,1309411.5,"""'Deinococcus soli' Cha et al. 2014 strain N5""",,1309411,Complete,N5,,,,,...,,,,,,,,Genome sequencing of a Gamma-Radiation-Resista...,sample_type:bacterial,2013.260096
2,1123738.3,"""'Echinacea purpurea' witches'-broom phytoplas...",,1123738,WGS,NCHU2014,,,,,...,,,C,,,,,'Echinacea purpurea' witches'-broom phytoplasm...,lab_host:Catharanthus roseus,2014.371663
3,551115.6,"""'Nostoc azollae' 0708""",'Nostoc azollae' 0708,551115,Complete,708,,,,,...,,Mesophilic,-,,Aerobic,Multiple,,"Nostoc azollae 0708. Nostoc azollae 0708, also...",,
4,1856298.3,"""'Osedax' symbiont bacterium Rs2_46_30_T18 str...",,1856298,WGS,Rs2_46_30_T18,,,,,...,,,,,,,,"In this study, we simulate the Deepwater Horiz...",sample_type:metagenomic assembly,2013.525667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
433517,1131286.3,zeta proteobacterium SCGC AB-137-J06,zeta proteobacterium SCGC AB-137-J06,1131286,WGS,SCGC AB-137-J06,,,,,...,,,,,,,,Single cell genome sequencing of biomineralizi...,,
433518,1131287.3,zeta proteobacterium SCGC AB-602-C20,zeta proteobacterium SCGC AB-602-C20,1131287,WGS,SCGC AB-602-C20,,,,,...,,,,,,,,Single cell genome sequencing of biomineralizi...,,
433519,1131288.3,zeta proteobacterium SCGC AB-602-E04,zeta proteobacterium SCGC AB-602-E04,1131288,WGS,SCGC AB-602-E04,,,,,...,,,,,,,,Single cell genome sequencing of biomineralizi...,,
433520,1131289.3,zeta proteobacterium SCGC AB-604-B04,zeta proteobacterium SCGC AB-604-B04,1131289,WGS,SCGC AB-604-B04,,,,,...,,,,,,,,Single cell genome sequencing of biomineralizi...,,
