## Extraction of Pathways (Reactome)

REACTOME is an open-source, open access, manually curated and peer-reviewed pathway database. OuREACTOME is an open-source, open access, manually curated and peer-reviewed pathway database. Our goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic and clinical research, genome analysis, modeling, systems biology and education. Founded in 2003, the Reactome project is led by Lincoln Stein of OICR, Peter D’Eustachio of NYULMC, Henning Hermjakob of EMBL-EBI, and Guanming Wu of OHSU.r goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic and clinical research, genome analysis, modeling, systems biology and education. Founded in 2003, the Reactome project is led by Lincoln Stein of OICR, Peter D’Eustachio of NYULMC, Henning Hermjakob of EMBL-EBI, and Guanming Wu of OHSU.  [Source](https://reactome.org/)

This notebook takes pathways from Reactome and parses its data

In [1]:
import json as json
import pandas as pd

#### Pathway Parsing

Each pathway has 3 components: RID, a name, and an associated species. We also distinguish all pathways related to humans.

In [2]:
DATA = []
HUMAN = []
with open("ReactomePathways.txt",'r') as f1:
    for line in f1:
        #print(line.split("\t"))
        sl = line.split("\t")
        RID = sl[0]
        name = sl[1]
        species = sl[2]
        if species[-1] == "\n":
            species = species[0:-1]
        
        DATA.append({"RID":RID, "name":name, "species":species})
        if species == "Homo sapiens":
            HUMAN.append({"RID":RID, "name":name, "species":species})          

In [3]:
len(DATA)

20751

In [4]:
len(HUMAN)

2255

In [5]:
df = pd.DataFrame(DATA)
hdf = pd.DataFrame(HUMAN)

In [6]:
df.head()

Unnamed: 0,RID,name,species
0,R-BTA-73843,5-Phosphoribose 1-diphosphate biosynthesis,Bos taurus
1,R-BTA-1971475,A tetrasaccharide linker sequence is required ...,Bos taurus
2,R-BTA-1369062,ABC transporters in lipid homeostasis,Bos taurus
3,R-BTA-382556,ABC-family proteins mediated transport,Bos taurus
4,R-BTA-9033807,ABO blood group biosynthesis,Bos taurus


In [7]:
hdf.head()

Unnamed: 0,RID,name,species
0,R-HSA-164843,2-LTR circle formation,Homo sapiens
1,R-HSA-73843,5-Phosphoribose 1-diphosphate biosynthesis,Homo sapiens
2,R-HSA-1971475,A tetrasaccharide linker sequence is required ...,Homo sapiens
3,R-HSA-5619084,ABC transporter disorders,Homo sapiens
4,R-HSA-1369062,ABC transporters in lipid homeostasis,Homo sapiens


#### Creates pathway dictionary
- A dictionary for each pathway is created
- In the format:
                        [{"rid": XXXX,\
                          "name" : XXXX,\
                          "species": XXXX}]
- A list of dictionaries (aka list of pathways) is written to a file
- Two dictionaries are written, one for all pathways and one for pathways only in humans

In [8]:
allPathways = []
for r,n,s in zip(df['RID'],df['name'], df['species']):
    allPathways.append({"rid":r, "name":n.lower(), "species":s.lower()})

In [9]:
allPathways[0]

{'name': '5-phosphoribose 1-diphosphate biosynthesis',
 'rid': 'R-BTA-73843',
 'species': 'bos taurus'}

In [10]:
with open("pathway_dict.json", 'w') as pd:
    json.dump(allPathways, pd)

In [11]:
humanPathways = []
for r,n,s in zip(hdf['RID'],hdf['name'], hdf['species']):
    humanPathways.append({"rid":r, "name":n.lower(), "species":s.lower()})

In [12]:
with open("human_pathway_dict.json", 'w') as hpd:
    json.dump(humanPathways, hpd)