# PhiSpy

This is a Jupyter Notebook that shows how to run PhiSpy manually. You can run through all the steps that PhiSpy takes to determine whether a genome contains a prophage, and inspect all of the data generated by PhiSpy.

You will need to install [PhiSpy](https://github.com/linsalrob/PhiSpy#installation), [Jupyter Notebooks](https://jupyter.org/install), and [pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html)

<small>Note: PhiSpy does not normally use pandas, but we use it here to visualize the data!</small>

In [1]:
# set up the environment
import os
import sys
import gzip
from functools import reduce
import tempfile

import pandas as pd

from Bio import SeqIO

import PhiSpyModules

### Check the PhiSpy version

We recommend at least version 4.0.3 but preferable 4.1 or higher

In [2]:
print("PhiSpy version: " + PhiSpyModules.__version__)

PhiSpy version: 4.1rc12


### Define your genbank file here.

You may need to set the full path to the file. You can use gzip compressed file or an uncompressed file. Obviously, you will know whether it is compressed or not, but this demonstrates how PhiSpy determines when a file is compressed.

**This should be the only line you need to change to run PhiSpy completely!**

In [3]:
genbankfile = "../test_genbank_files/Yersinia_pestis_KIM.gb.gz"

### Parse the file

We use BioPython to parse the file, but we also add a few additional mehtods to the standard BioPython object to ease parsing. We also merge or split compound features (those with more than one location along the chromosome) to appropriately handle them.

Our `record` object is an extended `SeqIO.parse` object.

In [4]:
min_contig_size = 1000

if PhiSpyModules.is_gzip_file(genbankfile):
    handle = gzip.open(genbankfile, 'rt')
else:
    handle = open(genbankfile, 'r')

record  = PhiSpyModules.SeqioFilter(filter(lambda x: len(x.seq) > min_contig_size, SeqIO.parse(handle, "genbank")))
handle.close()

# we check to make sure there are some contigs left to process
ncontigs = reduce(lambda sum, element: sum + 1, record, 0)
print(f"There are {ncontigs} contigs to predict prophages on!")

There are 6 contigs to predict prophages on!


## Define the parameters that we will use

These are normally provided as command line options, but for jupyter we set them here

paramter | meaning | options | default value
--- | --- | --- | --
kmers_type | What do we count kmers with? | `all`, `codon`, `simple` | `all`
window_size | How many consecutive ORFs to include? | an integer | 30
record | the `Bio.SeqIO` object with all the sequences | a `Bio.SeqIO` object | `record`
expand_slope | whether to use the square of the slope of the Shannon scores | `True` or `False` | `False`
number | Number of consecutive genes in a region of window size that must be prophage genes | an integer | 5
nonprophage_genegaps | The number of non phage genes betweeen prophages | an integer | 10
quiet | Don't make additional outputs | `True` or `False` | `True`


*Note*: You can add an additional paramter, `make_training_data` here (its actual value doesn't matter) that will append an additional column to the output for each ORF that includes a `1` (`True`) if the ORF is thought or stated to be a phage gene or `0` (`False`) otherwise.

In [5]:
parameters = {
    'kmers_type': "all",
    'window_size': 30,
    'record': record,
    'expand_slope': False,
    'training_set': "data/trainSet_genericAll.txt",
    'randomforest_trees': 5,
    'threads': 4,
    'quiet': True,
    'nonprophage_genegaps': 10,
    'number': 5,
    'color' : True,
    'evaluate':False,
    'make_training_data':None,
    'skip_search':False,
    'phmms':None,
    'phage_genes' : 2,
}

## Generate the test data

This is the step that actually does all the measurements!

In this example, we convert the output to a pandas dataframe for visualization and exploration.

In [6]:
parameters['test_data'] = PhiSpyModules.measure_features(**parameters)
# note that if you include make_training_data you will need to add an "is_phage" column here
test_df = pd.DataFrame(parameters['test_data'], columns = ['orf_length_med', 'shannon_slope', 'at_skew',
                                             'gc_skew', 'max_direction', 'phmms'])
test_df.head()

Unnamed: 0,orf_length_med,shannon_slope,at_skew,gc_skew,max_direction,phmms
0,-36.0,0.077197,0.801749,0.537848,7,0.0
1,-94.5,0.077016,0.820857,0.646716,7,0.0
2,-39.0,0.076645,0.818783,0.691991,7,0.0
3,-37.5,0.075699,0.795715,0.633846,7,0.0
4,-33.0,0.075065,0.900836,0.598348,7,0.0


### Run the random forest

Here we run the random forest to identify the phages, and combine that into our initial table as the `rank` column.

In [7]:
parameters['rfdata'] = PhiSpyModules.call_randomforest(**parameters)
parameters['initial_tbl'] = PhiSpyModules.make_initial_tbl(**parameters)
parameters['output_dir'] = tempfile.mkdtemp()
initial_table_df = pd.DataFrame(parameters['initial_tbl'], columns = ['gene id', 'function', 'contig', 'start', 'stop', 'position', 'rank', 'my status', 'pp'])
initial_table_df.head()

Unnamed: 0,gene id,function,contig,start,stop,position,rank,my status,pp
0,[2:368](+),Mobile element protein,NC_004839,3,368,0,0.2125,0,1.0
1,[664:1033](+),Plasmid conjugative transfer endonuclease,NC_004839,665,1033,1,0.0,0,0.0
2,[1170:1425](+),Replication regulatory protein repA2 (Protein ...,NC_004839,1171,1425,2,0.244444,0,0.0
3,[1467:1587](-),FIG01222423: hypothetical protein,NC_004839,1587,1468,3,0.252632,0,0.5
4,[1733:2600](+),DNA replication protein,NC_004839,1734,2600,4,0.26,0,0.0


### Refine the predictions

Finally, we refine the predictions from the random forest and other metrics, and then predict the *att* sequences.

In [8]:
parameters['pp'] = PhiSpyModules.fixing_start_end(**parameters)
pp_df = pd.DataFrame.from_dict(parameters['pp']).transpose()
pp_df

Unnamed: 0,contig,start,stop,num genes,annotated_as_pp,phage_genes,att,atts
1,NC_004839,3,42351,39,False,11,"[1491, 1503, 39891, 39903, ATTTTTGGCTGG, ATTTT...",1491\t1503\t39891\t39903\tATTTTTGGCTGG\tATTTTT...
2,NC_004839,61369,70446,6,False,6,"[60751, 60763, 67104, 67116, AAATAAAAAAAT, AAA...",60751\t60763\t67104\t67116\tAAATAAAAAAAT\tAAAT...
3,NC_004838,4968,60134,39,False,15,"[3168, 3178, 58870, 58880, TAAACGTAAA, TTTACGT...",3168\t3178\t58870\t58880\tTAAACGTAAA\tTTTACGTT...
4,NC_004835,87,43700,50,False,12,"[1347, 1358, 41757, 41768, CATTCGCCACC, GGTGGC...",1347\t1358\t41757\t41768\tCATTCGCCACC\tGGTGGCG...
5,NC_004835,74658,88343,5,False,5,"[75243, 75255, 89094, 89106, CATTCGGGTATA, TAT...",75243\t75255\t89094\t89106\tCATTCGGGTATA\tTATA...
6,NC_004836,87,29743,34,False,5,"[2237, 2249, 29158, 29170, TCAACAACCGGT, ACCGG...",2237\t2249\t29158\t29170\tTCAACAACCGGT\tACCGGT...
7,NC_004836,48761,70502,12,False,12,"[49151, 49249, 67536, 67634, CGCCAGACATTCACGAC...",49151\t49249\t67536\t67634\tCGCCAGACATTCACGACT...
8,NC_004088,285945,305397,15,False,3,"[284448, 284459, 302596, 302607, CATCTGTTTGC, ...",284448\t284459\t302596\t302607\tCATCTGTTTGC\tG...
9,NC_004088,571205,603212,20,False,3,"[571699, 571711, 601754, 601766, GGACCGATGGGC,...",571699\t571711\t601754\t601766\tGGACCGATGGGC\t...
10,NC_004088,1131147,1146536,16,False,2,"[1132333, 1132346, 1145835, 1145848, TAGATAACG...",1132333\t1132346\t1145835\t1145848\tTAGATAACGA...


Our `pp_df` data frame has our final prophage predictions for this genome! 

### Make the final table

Here we just append the pp number of the prophage to the table to show what are pp regions in the data frame.

In [9]:
parameters['final_tbl'] = []
for i in parameters['initial_tbl']:
    my_fs = PhiSpyModules.evaluation.check_pp(i[2], i[3], i[4], parameters['pp'])
    parameters['final_tbl'].append(i + [my_fs])

    final_df  = pd.DataFrame(parameters['final_tbl'], columns = ['gene id', 'function', 'contig', 'start', 'stop', 'position', 'rank', 'my status', 'pp', 'final status'])
final_df.head()

Unnamed: 0,gene id,function,contig,start,stop,position,rank,my status,pp,final status
0,[2:368](+),Mobile element protein,NC_004839,3,368,0,0.2125,0,1.0,1
1,[664:1033](+),Plasmid conjugative transfer endonuclease,NC_004839,665,1033,1,0.0,0,0.0,1
2,[1170:1425](+),Replication regulatory protein repA2 (Protein ...,NC_004839,1171,1425,2,0.244444,0,0.0,1
3,[1467:1587](-),FIG01222423: hypothetical protein,NC_004839,1587,1468,3,0.252632,0,0.5,1
4,[1733:2600](+),DNA replication protein,NC_004839,1734,2600,4,0.26,0,0.0,1
