<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Read-pathbank-data" data-toc-modified-id="Read-pathbank-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Read pathbank data</a></span><ul class="toc-item"><li><span><a href="#Read-Pathbank-Pathway-metadata" data-toc-modified-id="Read-Pathbank-Pathway-metadata-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Read Pathbank Pathway metadata</a></span></li><li><span><a href="#Generate-species-pathway-ID" data-toc-modified-id="Generate-species-pathway-ID-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Generate species pathway ID</a></span></li><li><span><a href="#Generate-human-pathway-graph" data-toc-modified-id="Generate-human-pathway-graph-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Generate human pathway graph</a></span><ul class="toc-item"><li><span><a href="#Read-human-pathway" data-toc-modified-id="Read-human-pathway-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Read human pathway</a></span></li><li><span><a href="#Analysis-of-protein-classes-and-metabolite-types" data-toc-modified-id="Analysis-of-protein-classes-and-metabolite-types-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Analysis of protein classes and metabolite types</a></span></li></ul></li></ul></li></ul></div>

**Run the notebook in conda tf env** <br>
**Conda activate tf**

In [1]:
import pandas as pd

import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import scipy.sparse as sp
import csv
import seaborn as sns
sns.set_style('whitegrid')
import glob
from rdkit import Chem
from rdkit.Chem import MACCSkeys
import json
from pathlib import Path
import pickle
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs



In [2]:
import pandas as pd
import pickle
import networkx as nx
import glob
from rdkit import Chem
from rdkit.Chem import MACCSkeys
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
from sklearn.decomposition import PCA
from Bio import SeqIO
import seaborn as sns
import matplotlib.pyplot as plt

# Generation of MPI network

## Read Pathbank pathway metadata

In [3]:
meta_data_dir = '../Training_data/pathbank/Pathbank_Meta_Data/'
meta_data = pd.read_csv(meta_data_dir+'pathbank_pathways.csv')
meta_data.head(2)

Unnamed: 0,SMPDB ID,PW ID,Name,Subject,Description
0,SMP0000055,PW000001,Alanine Metabolism,Metabolic,Alanine (L-Alanine) is an α-amino acid that is...
1,SMP0000067,PW000002,Aspartate Metabolism,Metabolic,Aspartate is synthesized by transamination of ...


## Read Pathway protein/metabolite data

In [4]:
metabolite_data = pd.read_csv(meta_data_dir+'pathbank_all_metabolites.csv') 
protein_data = pd.read_csv(meta_data_dir+'pathbank_all_proteins.csv')
main_metabolite_data = pd.read_csv(meta_data_dir+'pathbank_primary_metabolites.csv') 
main_protein_data = pd.read_csv(meta_data_dir+'pathbank_primary_proteins.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [5]:
metabolite_data.head(2)

Unnamed: 0,PathBank ID,Pathway Name,Pathway Subject,Species,Metabolite ID,Metabolite Name,HMDB ID,KEGG ID,ChEBI ID,DrugBank ID,CAS,Formula,IUPAC,SMILES,InChI,InChI Key
0,SMP0000055,Alanine Metabolism,Metabolic,Homo sapiens,PW_C000414,Adenosine triphosphate,HMDB0000538,C00002,15422.0,DB00171,56-65-5,C10H16N5O13P3,"({[({[(2R,3S,4R,5R)-5-(6-amino-9H-purin-9-yl)-...",NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(O)(=O)OP(O...,InChI=1S/C10H16N5O13P3/c11-8-5-9(13-2-12-8)15(...,ZKHQWZAMYRWXGA-KQYNXXCUSA-N
1,SMP0000055,Alanine Metabolism,Metabolic,Homo sapiens,PW_C000105,L-Alanine,HMDB0000161,C00041,16977.0,DB00160,56-41-7,C3H7NO2,(2S)-2-aminopropanoic acid,C[C@H](N)C(O)=O,"InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5...",QNAYBMKLOCPYGJ-REOHCLBHSA-N


In [6]:
protein_data.head(2)

Unnamed: 0,PathBank ID,Pathway Name,Pathway Subject,Species,Uniprot ID,Protein Name,HMDBP ID,DrugBank ID,GenBank ID,Gene Name,Locus
0,SMP0000055,Alanine Metabolism,Metabolic,Homo sapiens,P49588,"Alanine--tRNA ligase, cytoplasmic",HMDBP00625,,AC012184,AARS,16q22
1,SMP0000055,Alanine Metabolism,Metabolic,Homo sapiens,P24298,Alanine aminotransferase 1,HMDBP00850,,U70732,GPT,8q24.3


In [7]:
for i in protein_data['Gene Name'].tolist()[:10]:
    print(i)

AARS
GPT
PC
AGXT
AARS2
MPC1
ABAT
GAD1
ASNS
NARS


In [8]:
import requests

## Obtain features of metabolites and proteins

In [9]:
mets_dict = dict(zip(metabolite_data['Metabolite ID'].tolist(),metabolite_data['SMILES'].tolist()))
len(mets_dict.keys())
mets_vec = {}
mets_proc = []
mets_fp = []
c = 0
for k,v in mets_dict.items():
    smile = v
    try:
        mol = Chem.MolFromSmiles(smile)
        counts = mol.GetNumAtoms()
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=1024)
        array = np.zeros((0, ), dtype=np.int8)
        DataStructs.ConvertToNumpyArray(fp, array)
        mets_vec[k] = (array,counts)
        mets_proc.append(k)
        mets_fp.append(array)
        c += 1
    except:
        mets_vec[k] = None
        c += 1
mets_fp = np.asarray(mets_fp)
mets_fp.shape

RDKit ERROR: [19:13:03] Explicit valence for atom # 26 N, 4, is greater than permitted
RDKit ERROR: [19:13:03] Explicit valence for atom # 1 N, 4, is greater than permitted
RDKit ERROR: [19:13:03] Explicit valence for atom # 8 N, 4, is greater than permitted
RDKit ERROR: [19:13:03] Explicit valence for atom # 28 N, 4, is greater than permitted
RDKit ERROR: [19:13:03] Explicit valence for atom # 31 O, 3, is greater than permitted
RDKit ERROR: [19:13:03] Explicit valence for atom # 4 N, 4, is greater than permitted
RDKit ERROR: [19:13:04] Explicit valence for atom # 21 N, 4, is greater than permitted
RDKit ERROR: [19:13:05] Explicit valence for atom # 17 O, 3, is greater than permitted


(78261, 1024)

In [10]:
# Get PCA transformed data of metabolites
pca = PCA(n_components=1024)
pca.fit(mets_fp)
pca_mets_fp = pca.transform(mets_fp)
pca_mets_vec_1024 = {}
for i in range(len(mets_proc)):
    k = mets_proc[i]
    pca_mets_vec_1024[k] = pca_mets_fp[i]

pca = PCA(n_components=512)
pca.fit(mets_fp)
pca_mets_fp = pca.transform(mets_fp)
pca_mets_vec_512 = {}
for i in range(len(mets_proc)):
    k = mets_proc[i]
    pca_mets_vec_512[k] = pca_mets_fp[i]


pca = PCA(n_components=256)
pca.fit(mets_fp)
pca_mets_fp = pca.transform(mets_fp)
pca_mets_vec_256 = {}
for i in range(len(mets_proc)):
    k = mets_proc[i]
    pca_mets_vec_256[k] = pca_mets_fp[i]

pca = PCA(n_components=128)
pca.fit(mets_fp)
pca_mets_fp = pca.transform(mets_fp)
pca_mets_vec_128 = {}
for i in range(len(mets_proc)):
    k = mets_proc[i]
    pca_mets_vec_128[k] = pca_mets_fp[i]

pca = PCA(n_components=64)
pca.fit(mets_fp)
pca_mets_fp = pca.transform(mets_fp)
pca_mets_vec_64 = {}
for i in range(len(mets_proc)):
    k = mets_proc[i]
    pca_mets_vec_64[k] = pca_mets_fp[i]

pca = PCA(n_components=32)
pca.fit(mets_fp)
pca_mets_fp = pca.transform(mets_fp)
pca_mets_vec_32 = {}
for i in range(len(mets_proc)):
    k = mets_proc[i]
    pca_mets_vec_32[k] = pca_mets_fp[i]

pca = PCA(n_components=16)
pca.fit(mets_fp)
pca_mets_fp = pca.transform(mets_fp)
pca_mets_vec_16 = {}
for i in range(len(mets_proc)):
    k = mets_proc[i]
    pca_mets_vec_16[k] = pca_mets_fp[i]

pca = PCA(n_components=2)
pca.fit(mets_fp)
pca_mets_fp = pca.transform(mets_fp)
pca_mets_vec_2 = {}
for i in range(len(mets_proc)):
    k = mets_proc[i]
    pca_mets_vec_2[k] = pca_mets_fp[i]

## Get protein vector

In [11]:
protein_dir = '../Training_data/pathbank/pathbank_protein_data/'
protein_vec = pickle.load(open( protein_dir+"all_protein_vector.p", "rb" ))

## Get PCA transformed protein feature

In [12]:
protein_np = []
protein_proc = []
for k,v in protein_vec.items():
    protein_proc.append(k)
    protein_np.append(v)
protein_np = np.array(protein_np)
protein_np.shape

(8291, 1024)

In [13]:
# Get PCA transformed data of proteins
pca = PCA(n_components=1024)
pca.fit(protein_np)
pca_protein_np = pca.transform(protein_np)
pca_protein_np_1024 = {}
for i in range(len(protein_proc)):
    k = protein_proc[i]
    pca_protein_np_1024[k] = pca_protein_np[i]

pca = PCA(n_components=512)
pca.fit(protein_np)
pca_protein_np = pca.transform(protein_np)
pca_protein_np_512 = {}
for i in range(len(protein_proc)):
    k = protein_proc[i]
    pca_protein_np_512[k] = pca_protein_np[i]


pca = PCA(n_components=256)
pca.fit(protein_np)
pca_protein_np = pca.transform(protein_np)
pca_protein_np_256 = {}
for i in range(len(protein_proc)):
    k = protein_proc[i]
    pca_protein_np_256[k] = pca_protein_np[i]

pca = PCA(n_components=128)
pca.fit(protein_np)
pca_protein_np = pca.transform(protein_np)
pca_protein_np_128 = {}
for i in range(len(protein_proc)):
    k = protein_proc[i]
    pca_protein_np_128[k] = pca_protein_np[i]

pca = PCA(n_components=64)
pca.fit(protein_np)
pca_protein_np = pca.transform(protein_np)
pca_protein_np_64 = {}
for i in range(len(protein_proc)):
    k = protein_proc[i]
    pca_protein_np_64[k] = pca_protein_np[i]

pca = PCA(n_components=32)
pca.fit(protein_np)
pca_protein_np = pca.transform(protein_np)
pca_protein_np_32 = {}
for i in range(len(protein_proc)):
    k = protein_proc[i]
    pca_protein_np_32[k] = pca_protein_np[i]

pca = PCA(n_components=16)
pca.fit(protein_np)
pca_protein_np = pca.transform(protein_np)
pca_protein_np_16 = {}
for i in range(len(protein_proc)):
    k = protein_proc[i]
    pca_protein_np_16[k] = pca_protein_np[i]

pca = PCA(n_components=2)
pca.fit(protein_np)
pca_protein_np = pca.transform(protein_np)
pca_protein_np_2 = {}
for i in range(len(protein_proc)):
    k = protein_proc[i]
    pca_protein_np_2[k] = pca_protein_np[i]

In [14]:
protein_dir = '../Training_data/pathbank/pathbank_protein_data/'
protein_fasta = protein_dir+'pathbank_protein.fasta'

In [15]:
with open(protein_fasta) as fasta_file:  
    identifiers = {}
    for seq_record in SeqIO.parse(fasta_file, 'fasta'):  # (generator)
        description = seq_record.description
        description = description.split('(')
        uniport_id = description[-1][:-1]
        identifiers[uniport_id] = (seq_record.seq,len(seq_record.seq))

In [16]:
len(identifiers.keys())

8291

## Get species specific data
### Homo sapiens                20620
### Mus musculus                12273
### Rattus norvegicus           12207
### Escherichia coli             3680
### Bos taurus                   3196
### Pseudomonas aeruginosa       2736
### Arabidopsis thaliana         2006
### Saccharomyces cerevisiae     1757
### Drosophila melanogaster      1713
### Caenorhabditis elegans       1683

## Generation of species MPI network

In [17]:
def get_species_mpi_network(species,main_metabolite_data,main_protein_data):
    main_spec_mets_data = main_metabolite_data[main_metabolite_data['Species'] == species] 
    main_spec_protein_data = main_protein_data[main_protein_data['Species'] == species]
    spec_mets_data = metabolite_data[metabolite_data['Species'] == species]
    spec_mets_data_m = pd.merge(spec_mets_data,meta_data[['SMPDB ID','PW ID']],left_on='PathBank ID', right_on='SMPDB ID', how='left')
    spec_protein_data = protein_data[protein_data['Species'] == species]
    spec_protein_data_m = pd.merge(spec_protein_data,meta_data[['SMPDB ID','PW ID']],left_on='PathBank ID', right_on='SMPDB ID', how='left')
    'Get metabolite and protein involved pathway ID'
    spec_pw_id_met = list(spec_mets_data_m['PW ID'].unique())
    spec_pw_id_protn = list(spec_protein_data_m['PW ID'].unique())
    spec_pw_id = list(set(spec_pw_id_met+spec_pw_id_protn))
    spec_pw_id = [x for x in spec_pw_id if str(x) != 'nan']
    g = nx.empty_graph(0, create_using=nx.Graph)
    for pwid in spec_pw_id:
        pwfile = '../Training_data/pathbank/pathbank_pathway_csv/'+pwid+'*.csv'
        try:
            spwfile = glob.glob(pwfile)
            for spw in spwfile:
                df = pd.read_csv(spw,index_col = 0)
                g_df = nx.from_pandas_adjacency(df,create_using = nx.Graph())
                g = nx.compose(g, g_df)
        except:
            pass
    return g

In [18]:
meta_name_id_dict = dict(zip(metabolite_data['Metabolite Name'].tolist(),metabolite_data['Metabolite ID'].tolist()))
protein_name_id_dict = dict(zip(protein_data['Protein Name'].tolist(),protein_data['Uniprot ID'].tolist()))

## Generate features for MPI network

In [19]:
def get_mpi_feature(g):
    all_nodes_df = pd.DataFrame(columns=['node','dbid','class','length','features','pca_1024','pca_512','pca_256','pca_128','pca_64','pca_32','pca_16','pca_2'])
    g_nodes = list(g.nodes())
    row = [[]]*len(list(all_nodes_df))
    for i in range(len(g_nodes)):
        node = g_nodes[i]
        if node in meta_name_id_dict:
            name_id = meta_name_id_dict[node]
            row[0:3] = [node, name_id, 'metabolite']
            if mets_vec[name_id]:
                row[3:] = [mets_vec[name_id][1],mets_vec[name_id][0],pca_mets_vec_1024[name_id],pca_mets_vec_512[name_id],pca_mets_vec_256[name_id],pca_mets_vec_128[name_id],pca_mets_vec_64[name_id],pca_mets_vec_32[name_id],pca_mets_vec_16[name_id],pca_mets_vec_2[name_id]]
            else:
                row[3:] = [0,np.random.rand((1024)),np.random.rand((1024)),np.random.rand((512)),np.random.rand((256)),np.random.rand((128)),np.random.rand((64)),np.random.rand((32)),np.random.rand((16)),np.random.rand((2))]
        elif node in protein_name_id_dict:
            name_id = protein_name_id_dict[node]
            row[0:3] = [node, name_id, 'protein']
            if name_id in protein_vec:
                row[3:] = [identifiers[name_id][1],protein_vec[name_id],pca_protein_np_1024[name_id],pca_protein_np_512[name_id],pca_protein_np_256[name_id],pca_protein_np_128[name_id],pca_protein_np_64[name_id],pca_protein_np_32[name_id],pca_protein_np_16[name_id],pca_protein_np_2[name_id]]
            else:
                row[3:] = [0,np.random.rand((1024)),np.random.rand((1024)),np.random.rand((512)),np.random.rand((256)),np.random.rand((128)),np.random.rand((64)),np.random.rand((32)),np.random.rand((16)),np.random.rand((2))]
        else:
            row[0:3] = [node, None, 'other']
            row[3:] = [0,np.random.rand((1024)),np.random.rand((1024)),np.random.rand((512)),np.random.rand((256)),np.random.rand((128)),np.random.rand((64)),np.random.rand((32)),np.random.rand((16)),np.random.rand((2))]
        all_nodes_df.loc[i,:] = row
    return all_nodes_df

In [20]:
### Homo sapiens                20620
### Mus musculus                12273
### Rattus norvegicus           12207
### Escherichia coli             3680
### Bos taurus                   3196
### Pseudomonas aeruginosa       2736
### Arabidopsis thaliana         2006
### Saccharomyces cerevisiae     1757
### Drosophila melanogaster      1713
### Caenorhabditis elegans       1683
species = ['Homo sapiens','Mus musculus','Rattus norvegicus','Escherichia coli','Bos taurus','Pseudomonas aeruginosa',
'Arabidopsis thaliana','Saccharomyces cerevisiae','Drosophila melanogaster','Caenorhabditis elegans']
for sp in species:
    mpi = get_species_mpi_network(sp,main_metabolite_data,main_protein_data)
    nodes_df = get_mpi_feature(mpi)
    mpi_file_name = '../features/mpi_network/'+'pca_mpi_'+str(sp).replace(' ','_')+'.pkl'
    df_file_name = '../features/mpi_features/'+'pca_feature_df_'+str(sp).replace(' ','_')+'.pkl'
    mpi_data_file = open(mpi_file_name, 'wb') 
    pickle.dump(mpi,mpi_data_file)
    df_file = open(df_file_name,'wb')
    pickle.dump(nodes_df,df_file)

  arr_value = np.array(value)


In [12]:
def get_species_mpi_network_directed(species,main_metabolite_data,main_protein_data):
    main_spec_mets_data = main_metabolite_data[main_metabolite_data['Species'] == species] 
    main_spec_protein_data = main_protein_data[main_protein_data['Species'] == species]
    spec_mets_data = metabolite_data[metabolite_data['Species'] == species]
    spec_mets_data_m = pd.merge(spec_mets_data,meta_data[['SMPDB ID','PW ID']],left_on='PathBank ID', right_on='SMPDB ID', how='left')
    spec_protein_data = protein_data[protein_data['Species'] == species]
    spec_protein_data_m = pd.merge(spec_protein_data,meta_data[['SMPDB ID','PW ID']],left_on='PathBank ID', right_on='SMPDB ID', how='left')
    'Get metabolite and protein involved pathway ID'
    spec_pw_id_met = list(spec_mets_data_m['PW ID'].unique())
    spec_pw_id_protn = list(spec_protein_data_m['PW ID'].unique())
    spec_pw_id = list(set(spec_pw_id_met+spec_pw_id_protn))
    spec_pw_id = [x for x in spec_pw_id if str(x) != 'nan']
    g = nx.empty_graph(0, create_using=nx.Graph)
    for pwid in spec_pw_id:
        pwfile = '../Training_data/pathbank/pathbank_pathway_csv/'+pwid+'*.csv'
        try:
            spwfile = glob.glob(pwfile)
            for spw in spwfile:
                df = pd.read_csv(spw,index_col = 0)
                g_df = nx.from_pandas_adjacency(df,create_using = nx.DiGraph())
                g = nx.compose(g, g_df)
        except:
            pass
    return g

In [13]:
### Homo sapiens                20620
### Mus musculus                12273
### Rattus norvegicus           12207
### Escherichia coli             3680
### Bos taurus                   3196
### Pseudomonas aeruginosa       2736
### Arabidopsis thaliana         2006
### Saccharomyces cerevisiae     1757
### Drosophila melanogaster      1713
### Caenorhabditis elegans       1683
species = ['Homo sapiens','Mus musculus','Rattus norvegicus','Escherichia coli','Bos taurus','Pseudomonas aeruginosa',
'Arabidopsis thaliana','Saccharomyces cerevisiae','Drosophila melanogaster','Caenorhabditis elegans']
for sp in species[:1]:
    mpi = get_species_mpi_network_directed(sp,main_metabolite_data,main_protein_data)

In [15]:
print(nx.info(mpi))

Name: 
Type: Graph
Number of nodes: 2306
Number of edges: 5785
Average degree:   5.0173


In [139]:
species = ['Homo sapiens','Escherichia coli','Bos taurus','Pseudomonas aeruginosa','Arabidopsis thaliana']
seed_number = [12345, 22345, 32345, 42345, 52345]
index = 0
result = {}
for spcs in species[:1]:
    spcs_temp = {}
    for seed_num in seed_number[:1]:
        print(spcs,seed_num)
        # row = [[]]*len(list(result_charc))
        mpi_file_name = '../features/mpi_network/'+'mpi_'+str(spcs).replace(' ','_')+'.pkl'
        df_file_name = '../features/mpi_features/'+'feature_df_'+str(spcs).replace(' ','_')+'.pkl'
        g = pickle.load(open(mpi_file_name, "rb" ))
        node_feats = pickle.load(open(df_file_name, "rb" ))

Homo sapiens 12345


In [140]:
mets_id_select = metabolite_data[['Metabolite Name','Metabolite ID','HMDB ID','KEGG ID']]
mets_id_select = mets_id_select.drop_duplicates()
protn_id_select = protein_data[['Protein Name','Uniprot ID','Gene Name']]
protn_id_select = protn_id_select.drop_duplicates()

In [142]:
node_feats[node_feats['node'] == 'Glucose-6-phosphate isomerase']

Unnamed: 0,node,dbid,class,length,features,pca_1024,pca_512,pca_256
728,Glucose-6-phosphate isomerase,A0A072ZX40,protein,554,"[0.053579077, -0.013836206, -0.013869764, -0.0...","[-0.3121425, -0.2838221, 0.60420144, -0.609301...","[-0.31214222, -0.2838224, 0.60420084, -0.60930...","[-0.31214264, -0.2838218, 0.60420203, -0.60930..."


In [143]:
node_feats = node_feats.merge(mets_id_select,left_on='dbid',right_on='Metabolite ID',how='left')
print(len(node_feats))
node_feats = node_feats.merge(protn_id_select,left_on='dbid',right_on='Uniprot ID',how='left')
print(len(node_feats))

2306
4968


In [144]:
node_feats.drop_duplicates(subset='node', keep="last",inplace=True)
len(node_feats)

2306

In [145]:
protein_list = list(set(node_feats['Gene Name'].tolist()))[1:]
proteins = '%0d'.join(protein_list)
url = 'https://string-db.org/api/tsv/network?identifiers=' + proteins + '&species=9606'
r = requests.get(url)
lines = r.text.split('\n') # pull the text from the response object and split based on new lines
data = [l.split('\t') for l in lines] # split each line into its components based on tabs
# convert to dataframe using the first row as the column names; drop empty, final row
df = pd.DataFrame(data[1:-1], columns = data[0]) 
# dataframe with the preferred names of the two proteins and the score of the interaction
interactions = df[['preferredName_A', 'preferredName_B', 'score']]

In [146]:
g_kg = nx.empty_graph(0, create_using=nx.Graph)

adj_df = pd.read_csv('../features/kegg/kegg_reactions.csv',index_col = 0)
g_df = nx.from_pandas_adjacency(adj_df,create_using = nx.Graph())
g_kg = nx.compose(g_kg, g_df)

In [147]:
kegg_df_edge = pd.DataFrame(columns=['kegg_id_1','kegg_id_2'])
index = 0
for edge in list(g_kg.edges()):
    edge_1 = edge[0].replace('cpd:','')
    edge_2 = edge[1].replace('cpd:','')
    kegg_df_edge.loc[index,:] = [edge_1,edge_2]
    index += 1

In [149]:
node_feats.head()

Unnamed: 0,node,dbid,class,length,features,pca_1024,pca_512,pca_256,Metabolite Name,Metabolite ID,HMDB ID,KEGG ID,Protein Name,Uniprot ID,Gene Name
0,"Enoyl-CoA hydratase, mitochondrial",P14604,protein,290,"[-0.0115655055, -0.02298307, -0.023107061, 0.0...","[0.69804204, 1.5457662, -0.19488898, -0.072530...","[0.69804233, 1.5457658, -0.1948879, -0.0725309...","[0.6980426, 1.5457666, -0.19489, -0.0725313, -...",,,,,"Enoyl-CoA hydratase, mitochondrial",P14604,Echs1
1,3-Hydroxybutyryl-CoA,PW_C000904,metabolite,54,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, ...","[0.4531775710572908, 1.3835354738819938, 5.772...","[0.45317757105729123, 1.38353547388199, 5.7727...","[0.45317757105729006, 1.383535473881945, 5.772...",3-Hydroxybutyryl-CoA,PW_C000904,HMDB0001166,C03561,,,
2,Crotonoyl-CoA,PW_C001345,metabolite,53,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, ...","[0.04700508297648426, 1.1281863993703316, 5.78...","[0.04700508297648365, 1.1281863993703287, 5.78...","[0.047005082976483314, 1.1281863993702816, 5.7...",Crotonoyl-CoA,PW_C001345,HMDB0002009,C00877,,,
3,Water,PW_C001420,metabolite,1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-0.16956676756781652, 2.8263338162518443, 3.2...","[-0.1695667675678163, 2.8263338162518394, 3.22...","[-0.1695667675678263, 2.8263338162518137, 3.22...",Water,PW_C001420,HMDB0002111,C00001,,,
4,"Acetyl-CoA acetyltransferase, mitochondrial",P24752,protein,427,"[-0.10862053, -0.12662861, -0.16699015, 0.1391...","[0.66844195, 1.3619962, -0.062943734, -0.75548...","[0.66844237, 1.3619953, -0.062942795, -0.75548...","[0.6684426, 1.3619963, -0.062944755, -0.755485...",,,,,"Acetyl-CoA acetyltransferase, mitochondrial",P24752,ACAT1


In [89]:
'add edges'
"g.edges()"
'gene name : protein'
'kegg id': 'metabolite name'

'g.edges()'

In [150]:
gene_name_protein_dict = dict(zip(node_feats['Gene Name'].tolist(),node_feats['node'].tolist()))
kegg_id_metabolite_dict = dict(zip(node_feats['KEGG ID'].tolist(),node_feats['node'].tolist()))

In [151]:
interactions.drop_duplicates(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  interactions.drop_duplicates(inplace=True)


In [152]:
len(interactions)

12751

In [153]:
for index, row in interactions.iterrows():
    node_1 = row['preferredName_A']
    node_2 = row['preferredName_B']
    if node_1 in gene_name_protein_dict and node_2 in gene_name_protein_dict:
        n1 = gene_name_protein_dict[node_1]
        n2 = gene_name_protein_dict[node_2]
        # print((node_1,node_2))
        ppi_edge = (node_1,node_2)
        g.add_edge(n1, n2)

In [154]:
kegg_df_edge.head()

Unnamed: 0,kegg_id_1,kegg_id_2
0,C00084,C00024
1,C00084,C05125
2,C00084,C00469
3,C00084,C00186
4,C00084,C00033


In [155]:
counts = 0
for index, row in kegg_df_edge.iterrows():
    node_1 = row['kegg_id_1']
    node_2 = row['kegg_id_2']
    if node_1 in kegg_id_metabolite_dict and node_2 in kegg_id_metabolite_dict:
        n1 = kegg_id_metabolite_dict[node_1]
        n2 = kegg_id_metabolite_dict[node_2]
        counts += 1
        # print((node_1,node_2))
        g.add_edge(n1, n2)

In [156]:
print(nx.info(g))

Name: 
Type: Graph
Number of nodes: 2306
Number of edges: 8500
Average degree:   7.3721


In [157]:
len(g.nodes())

2306

In [115]:
node_feats.head()

Unnamed: 0,node,dbid,class,length,features,pca_1024,pca_512,pca_256,Metabolite Name,Metabolite ID,HMDB ID,KEGG ID,Protein Name,Uniprot ID,Gene Name
0,"Enoyl-CoA hydratase, mitochondrial",P14604,protein,290,"[-0.0115655055, -0.02298307, -0.023107061, 0.0...","[0.69804204, 1.5457662, -0.19488898, -0.072530...","[0.69804233, 1.5457658, -0.1948879, -0.0725309...","[0.6980426, 1.5457666, -0.19489, -0.0725313, -...",,,,,"Enoyl-CoA hydratase, mitochondrial",P14604,Echs1
1,3-Hydroxybutyryl-CoA,PW_C000904,metabolite,54,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, ...","[0.4531775710572908, 1.3835354738819938, 5.772...","[0.45317757105729123, 1.38353547388199, 5.7727...","[0.45317757105729006, 1.383535473881945, 5.772...",3-Hydroxybutyryl-CoA,PW_C000904,HMDB0001166,C03561,,,
2,Crotonoyl-CoA,PW_C001345,metabolite,53,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, ...","[0.04700508297648426, 1.1281863993703316, 5.78...","[0.04700508297648365, 1.1281863993703287, 5.78...","[0.047005082976483314, 1.1281863993702816, 5.7...",Crotonoyl-CoA,PW_C001345,HMDB0002009,C00877,,,
3,Water,PW_C001420,metabolite,1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-0.16956676756781652, 2.8263338162518443, 3.2...","[-0.1695667675678163, 2.8263338162518394, 3.22...","[-0.1695667675678263, 2.8263338162518137, 3.22...",Water,PW_C001420,HMDB0002111,C00001,,,
4,"Acetyl-CoA acetyltransferase, mitochondrial",P24752,protein,427,"[-0.10862053, -0.12662861, -0.16699015, 0.1391...","[0.66844195, 1.3619962, -0.062943734, -0.75548...","[0.66844237, 1.3619953, -0.062942795, -0.75548...","[0.6684426, 1.3619963, -0.062944755, -0.755485...",,,,,"Acetyl-CoA acetyltransferase, mitochondrial",P24752,ACAT1


In [158]:
len(node_feats)

2306

In [159]:
mpi_file_name = '../features/mpi_network/'+'mpi_hs_all.pkl'
df_file_name = '../features/mpi_features/'+'feature_df_mpi_hs_all.pkl'
mpi_data_file = open(mpi_file_name, 'wb') 
pickle.dump(g,mpi_data_file)
df_file = open(df_file_name,'wb')
pickle.dump(node_feats,df_file)

## MPI network description

In [35]:
mpi_data_file = open('../features/mpi_network/mpi_Homo_sapiens.pkl', 'rb') 
g = pickle.load(mpi_data_file)

In [36]:
nodes = list(g.nodes())

In [39]:
c_mets = 0
c_prtn = 0
c_prtn_nodes = []
c_mets_nodes = []
for i in nodes:
    if i in meta_name_id_dict:
        c_mets += 1
        c_mets_nodes.append(i)
    if i in protein_name_id_dict:
        c_prtn += 1
        c_prtn_nodes.append(i)
print('number of metabolites ',c_mets)
print('number of protein ',c_prtn)
print('total number ',len(nodes),c_mets+c_prtn,len(nodes)-c_prtn)

number of metabolites  1261
number of protein  855
total number  2306 2116 1451


In [41]:
protein_name_id_dict

{'Alanine--tRNA ligase, cytoplasmic': 'P50475',
 'Alanine aminotransferase 1': 'Q4V7F7',
 'Pyruvate carboxylase, mitochondrial': 'Q64555',
 'Serine--pyruvate aminotransferase': 'Q94055',
 'Alanine--tRNA ligase, mitochondrial': 'D3ZX08',
 'Mitochondrial pyruvate carrier 1': 'Q4V8N5',
 '4-aminobutyrate aminotransferase, mitochondrial': 'Q66HM1',
 'Glutamate decarboxylase 1': 'P18088',
 'Asparagine synthetase [glutamine-hydrolyzing]': 'Q66HR8',
 'Asparagine--tRNA ligase, cytoplasmic': 'Q8BP47',
 'Argininosuccinate synthase': 'V6AB05',
 'Aspartate--tRNA ligase, cytoplasmic': 'P15178',
 'Adenylosuccinate lyase': 'Q9I0K9',
 'D-aspartate oxidase': 'Q922Z0',
 'Aspartoacylase': 'Q6AZ03',
 'Adenylosuccinate synthetase isozyme 1': 'M0R629',
 'CAD protein': 'B7ZN27',
 'Isoaspartyl peptidase/L-asparaginase': 'Q8CG44',
 'Argininosuccinate lyase': 'Q02EA0',
 'L-amino-acid oxidase': 'Q9CXK7',
 'Unknown': 'Unknown',
 'Glutamine synthetase': 'Q9HU65',
 'Glutaminase liver isoform, mitochondrial': 'Q64606

In [154]:
test_g = g.copy()

In [155]:
redundant_nodes = ['Water','Hydrogen Ion','Phosphate','Oxygen','Pyrophosphate','Carbon dioxide','Hydrogen peroxide','Ammonia','Sodium']

In [156]:
for i in nodes:
    if i not in c_mets_nodes and i not in c_prtn_nodes:
        print(i,g.degree(i))
        test_g.remove_node(i)
    if i in redundant_nodes:
        test_g.remove_node(i)

electron-transfer flavoprotein 8
Reduced electron-transfer flavoprotein 8
Acetyl-CoA acyltransferase 14
Trifunctional enzyme, mitochondrial 38
Carnitine O-palmitoyltransferase 1 6
SubPathwayInput 152
ATP-binding cassette sub-family D 4
phosphatidylinositol 4,5-bisphosphate 1
1,2-diacyl-sn-glycerol 4
Sodium/potassium ATPase 6
Nicotinic Acetylcholine Receptor 3
L type Calcium channel 1
Sodium channel 1
Voltage Gated Potassium Channel 1
Glutamate receptor ionotropic 1
SubPathwayActivator 20
Tenase complex 1
SubPathwayInhibition 38
Fibrinogen 2
Fibrin (loose) 3
Coagulation factor XIIIa 2
Fibrin (mesh) 4
Fibrin degradation products 3
Tissue factor:Coagulation factor VIIa 1
SubPathwayActivation 8
Precursors of Prothrombin and coagulation factors VII, IX, ad X 1
Prothrombin and coagulation factors VII, IX and X 1
SubPathwayOutput 125
(S)-5-Diphosphomevalonic acid 2
Oxoglutarate dehydrogenase complex 11
Epoxide hydratase 2 9
AH2 2
A  1
Alcohol 1
Inward rectifier potassium channel (IK1) 1
Acety

In [157]:
attibutes_type_dict = {}

In [158]:
for node in test_g.nodes():
    if node in c_mets_nodes:
        attibutes_type_dict[node]='metabolite'
    elif node in c_prtn_nodes:
        attibutes_type_dict[node]='protein'

In [159]:
nx.set_node_attributes(test_g, attibutes_type_dict, name="type")

In [160]:
nx.write_gexf(test_g, "test_g.gexf")

In [161]:
def plot_degree_dist(G,nodes):
    degrees = [G.degree(n) for n in nodes]
    plt.hist(degrees)
    plt.show()

met_degrees = [(n,g.degree(n)) for n in c_mets_nodes]
prtn_degrees = [(n,g.degree(n)) for n in c_prtn_nodes]

In [162]:
met_degrees_counts = [i[1] for i in met_degrees]
prtn_degrees_counts = [i[1] for i in prtn_degrees]

In [163]:
np.median(met_degrees_counts)

2.0

In [164]:
np.median(prtn_degrees_counts)

5.0

In [165]:
met_degrees_sort = sorted(met_degrees, key = lambda x: x[1],reverse=True)
prtn_degrees_sort = sorted(prtn_degrees, key = lambda x: x[1],reverse=True)

In [166]:
met_degrees_sort[int(len(met_degrees_sort)/2)]

('D-Maltose', 2)

In [167]:
prtn_degrees_sort[int(len(prtn_degrees_sort)/2)]

('D-3-phosphoglycerate dehydrogenase', 5)

In [168]:
c_met_degrees = [0]*11
c_prtn_degrees = [0]*11
for c,i in met_degrees:
    if i <= 10:
        c_met_degrees[i-1] += 1
    else:
        c_met_degrees[-1] += 1
for c,i in prtn_degrees:
    if i <= 10:
        c_prtn_degrees[i-1] += 1
    else:
        c_prtn_degrees[-1] += 1

In [169]:
c_met_degrees

[387, 367, 170, 109, 65, 51, 26, 14, 9, 8, 55]

In [170]:
c_prtn_degrees

[57, 92, 78, 173, 69, 112, 75, 49, 28, 17, 105]