# Biological Pathways Notebook

In [None]:
import sys
import os
import pandas as pd
import numpy as np

sys.path.append("../.")
from phenolog.datasets.dataset import Dataset
from phenolog.datasets.pathways import Pathways

### Creating a dataset of descriptions and corresponding genes
The methods used in the following cell create a dataset object which contains all of the text descriptions, associated genes, and corresponding data that makes up all the data of interest for different species. The created dataset object has a function to provide a high level description of its contents.

In [None]:
dataset = Dataset()
dataset.add_data(pd.read_csv("../data/reshaped/arabidopsis_phenotypes.csv", lineterminator="\n"))
dataset.add_data(pd.read_csv("../data/reshaped/maize_phenotypes.csv", lineterminator="\n"))
dataset.add_data(pd.read_csv("../data/reshaped/ppn_phenotypes.csv", lineterminator="\n"))
dataset.add_data(pd.read_csv("../data/reshaped/ppn_phenes.csv", lineterminator="\n"))
dataset.describe()

### Accessing information about biological pathways
A object containing all the information about biological pathways relevant to this dataset of this dataset can be generated as demonstrated in the next cell. The required input for building this object is a dictionary which maps species names (here specified using the KEGG three-letter species codes) to files which contain pathway data specific to each species, and string which determines whether KEGG or the Plant Metabolic Network (PMN) resources are used as the source of the data. In the first case, the filenames in the required dictionary are not used. In the second case they are used, and should be the PlantCyc data files available for each species. The created pathways object has a function to describe the contents of the object after creating it.

In [None]:
# Species of interest and files related to each.
species_dict = {
    "ath":"../data/pathways/plantcyc/aracyc_pathways.20180702", 
    "zma":"../data/pathways/plantcyc/corncyc_pathways.20180702", 
    "mtr":"../data/pathways/plantcyc/mtruncatulacyc_pathways.20180702", 
    "osa":"../data/pathways/plantcyc/oryzacyc_pathways.20180702", 
    "gmx":"../data/pathways/plantcyc/soycyc_pathways.20180702",
    "sly":"../data/pathways/plantcyc/tomatocyc_pathways.20180702"}

# Building an object to contain information from the Plant Metabolic Network databases.
pathways = Pathways(species_dict, source="pmn")
pathways.describe()
pathways.write_to_csv("pmn_pathways.csv")

In [None]:
# For KEGG, the files aren't needed, only the species codes.
species_dict = {
    "ath":"", 
    "zma":"", 
    "mtr":"", 
    "osa":"", 
    "gmx":"",
    "sly":""}

# Building an object to contain information from the KEGG database.
pathways = Pathways(species_dict, source="kegg")
pathways.describe()
pathways.write_to_csv("kegg_pathways.csv")

### What biological pathways do the genes in this dataset belong to?
The primary function of the pathways object is to provide a dictionary which maps any ID values to lists of pathway IDs, given a dictionary that maps those same ID values to gene objects. In the next cell, a dictionary mapping ID values to gene objects is obtained from the dataset object. Then the values of the dictionary are used to check how many of the genes in this dataset were successfully mapped to atleast one pathway in the pathways database for these species.

In [None]:
genes = dataset.get_gene_dictionary()
pathway_membership = pathways.get_pathway_dict(genes)
num_found = len([x for x in pathway_membership.values() if len(x)>0])
num_missing = len([x for x in pathway_membership.values() if len(x)==0])
num_total = len(pathway_membership.values())

print("Number of genes genes mapped to atleast one pathway: {}".format(num_found))
print("Number of genes genes mapped to no pathways:         {}".format(num_missing))
print("Number of total genes that were looked for:          {}".format(num_total))