# This notebook is created to organize data from MetaCyc into one dataframe

### What kind of data is available by MetaCyc zipped file?
Noted that we recieved data from Pathway Tools platform version 22.6

**Important data that we extracted**
- compounds.dat
- enzrxns.dat
- reactions.dat

**Somewhat useful**
- atom-mapping and atom-mappint-smiles
- pathways

**May not be useful**
- classes
- gene_association
- genes
- metabolic-reactions.xml
- proteins
- protligandcplxes
- protseq
- pubs
- regulation
- rnas
- species
- transporters

In [1]:
import numpy as np
import pandas as pd
import math
import ast

### This is how our desired master dataframe look from last quarter

For now, make a dataset with enzyme/compounds linked to each other and then we can use RDKit-based function and existing functions to generate Mol-files, Distance, and Negative data

In [2]:
master_df = pd.read_csv('../datasets/MASTER_DF.csv')

In [3]:
master_df.head()

Unnamed: 0,entry,product,reacts,PubChem,SMILES,Mol,Fingerprint,dist,enzyme_class_1,enzyme_class_2,...,enzyme_class_7,n_C,n_H,n_O,n_N,n_P,n_S,n_X,DoU,MW
0,1.8.99.5,C00094,1.0,3394,OS(=O)O,<rdkit.Chem.rdchem.Mol object at 0x1ac9b8a210>,<rdkit.DataStructs.cDataStructs.ExplicitBitVec...,0.0,1,0,...,0,0.0,2.0,3.0,0.0,0.0,1.0,0.0,0.0,82.08
1,1.13.11.18,C00094,1.0,3394,OS(=O)O,<rdkit.Chem.rdchem.Mol object at 0x1ac9b8a580>,<rdkit.DataStructs.cDataStructs.ExplicitBitVec...,0.511007,1,0,...,0,0.0,2.0,3.0,0.0,0.0,1.0,0.0,0.0,82.08
2,1.8.99.5,C00283,1.0,3578,S,<rdkit.Chem.rdchem.Mol object at 0x1ac9b8ac10>,<rdkit.DataStructs.cDataStructs.ExplicitBitVec...,0.0,1,0,...,0,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,34.083
3,2.8.1.2,C00283,1.0,3578,S,<rdkit.Chem.rdchem.Mol object at 0x1ac9b8a2b0>,<rdkit.DataStructs.cDataStructs.ExplicitBitVec...,0.241667,0,1,...,0,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,34.083
4,4.4.1.28,C00283,1.0,3578,S,<rdkit.Chem.rdchem.Mol object at 0x1ac9b8a120>,<rdkit.DataStructs.cDataStructs.ExplicitBitVec...,0.294605,0,0,...,0,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,34.083


In [4]:
feature_df = master_df[['PubChem', 'dist', 'enzyme_class_1', 'enzyme_class_2', 'enzyme_class_3',
       'enzyme_class_4', 'enzyme_class_5', 'enzyme_class_6', 'enzyme_class_7',
        'n_O', 'n_N', 'n_P', 'n_S', 'n_X', 'DoU']]
feature_df.set_index(keys=['PubChem'], inplace=True)
feature_df.head()

Unnamed: 0_level_0,dist,enzyme_class_1,enzyme_class_2,enzyme_class_3,enzyme_class_4,enzyme_class_5,enzyme_class_6,enzyme_class_7,n_O,n_N,n_P,n_S,n_X,DoU
PubChem,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
3394,0.0,1,0,0,0,0,0,0,3.0,0.0,0.0,1.0,0.0,0.0
3394,0.511007,1,0,0,0,0,0,0,3.0,0.0,0.0,1.0,0.0,0.0
3578,0.0,1,0,0,0,0,0,0,0.0,0.0,0.0,1.0,0.0,0.0
3578,0.241667,0,1,0,0,0,0,0,0.0,0.0,0.0,1.0,0.0,0.0
3578,0.294605,0,0,0,1,0,0,0,0.0,0.0,0.0,1.0,0.0,0.0


## The current database from MetaCyc starts here

In [41]:
df_cpd = pd.read_csv('df_cpd.csv', index_col = 0)
df_rxn = pd.read_csv('parsed_rxns.csv', index_col = 0)
df_enz = pd.read_csv('df_enzrxns.csv', index_col = 0)
df_cpd = df_cpd.set_index(keys ='UNIQUE-ID')
df_rxn = df_rxn.set_index(keys = 'UNIQUE-ID')
df_enz = df_enz.set_index(keys = 'UNIQUE-ID')

# Here is an important note
The data from csv reading is in text formatted which lead to some problematic data handling.
- Some int data was read as float e.g. PubChemID which can be fixed by .astype() shown in cell #4
- List was formatted in text which required decoding by ast.literal_eval(x) which was written as a fucntion below

In [3]:
def recover_list(df, column):
    """This function will recover a list formatted string read from .csv into a list"""
    assert type(df[column][0]) != type([]), "TypeError: The data type is already a list, it should not be converted again"
    replacement = []
    for index, row in df.iterrows():
        
        data = []
        
        if type(row[column]) == type('string'):
            data = ast.literal_eval(row[column])
        else:
            pass
        replacement.append(data)
    df[column] = replacement
    return

In [42]:
# Change PubChemID into int type in df_cpd
PubChemID_int = df_cpd['PubChemID'].fillna(0).astype(int)
df_cpd['PubChemID'] = PubChemID_int

# Recover list format of df_rxn
rxn_list_fix = ['EC-NUMBER', 'ERXN-NUMBER', 'SUBSTRATES', 'PRODUCTS']
for col in rxn_list_fix:
    recover_list(df_rxn, col)

# Recover list format of df_enz
enz_list_fix = ['REACTION', 'ALTERNATIVE-SUBSTRATES', '^SUBSTRATE', 'KM', 'KCAT', 'VMAX']
for col in enz_list_fix:
    recover_list(df_enz, col)

##### the code below verify that all `.astype()` of PubChemID data didn't change the data unless it is `nan`
df_cpd = pd.read_csv('df_cpd.csv', index_col = 0)
df_cpd = df_cpd.set_index(keys ='UNIQUE-ID')
PubChemID_int = df_cpd['PubChemID'].fillna(0).astype(int)
for i in range(PubChemID_int.shape[0]):
    
#print (df_cpd['PubChemID'].loc['CPD-14743'])
#a = df_cpd['PubChemID'].loc['CPD-14743']
    if PubChemID_int[i] != df_cpd['PubChemID'][i]:
        print(PubChemID_int[i], df_cpd['PubChemID'][i])

In [11]:
df_cpd.head()

Unnamed: 0_level_0,COMMON-NAME,GIBBS-0,INCHI,SMILES,PubChemID
UNIQUE-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CPD-14966,"(2Z,4Z)-2-hydroxy-5-carboxymuconate-6-semialde...",-150.73376,InChI=1S/C7H6O6/c8-3-4(6(10)11)1-2-5(9)7(12)13...,[CH](=O)C(C(=O)[O-])=CC=C(O)C([O-])=O,90657979
CPD-14905,"4&alpha;,14&alpha;-dimethyl-porifersta-8,25(27...",568.9283,InChI=1S/C31H52O/c1-9-23(20(2)3)11-10-21(4)24-...,CCC(C(C)=C)CCC(C)[CH]3(CCC4(C)(C2(CC[CH]1(C(C)...,102515047
CPD-14658,(+)-orobanchyl acetate,90.736885,InChI=1S/C21H24O7/c1-10-8-14(27-19(10)23)25-9-...,CC1(C(=O)OC(C=1)OC=C2(C(=O)O[CH]3([CH]2C(OC(C)...,24796587
CPD-14885,3-aminobenzoate,12.324662,InChI=1S/C7H7NO2/c8-6-3-1-2-5(4-6)7(9)10/h1-4H...,C(=O)([O-])C1(C=C(N)C=CC=1),3014145
CPD-14884,3-amino-4-hydroxybenzenesulfonate,-40.905334,"InChI=1S/C6H7NO4S/c7-5-3-4(12(9,10)11)1-2-6(5)...",C1(C(S(=O)([O-])=O)=CC(N)=C(O)C=1),4146016


In [12]:
df_rxn.head()

Unnamed: 0_level_0,ERXN-NUMBER,EC-NUMBER,SUBSTRATES,PRODUCTS,GIBBS
UNIQUE-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
RXN-12314,[ENZRXN-19061],[],"[CPD-7557, Red-NADPH-Hemoprotein-Reductases, O...","[CPD-13248, Ox-NADPH-Hemoprotein-Reductases, W...",-108.40886
RXN-16877,[ENZRXN-24025],[],[Cys-Cys-HlmE],[4N-2mthio-5oxo-3Spyrrolidine-2-COOH-HlmE],-29.53003
PORPHOBILSYNTH-RXN,"[PORPHOBILSYNTH-ENZRXN, ENZRXN66-1464, ENZRXN-...",[EC-4.2.1.24],[5-AMINO-LEVULINATE],"[PROTON, WATER, PORPHOBILINOGEN]",-33.708138
RXN-9204,[ENZRXN-14772],[EC-2.5.1.74],"[CPD-21340, DIHYDROXYNAPHTHOATE, PROTON]","[CPD-12118, PPI, CARBON-DIOXIDE]",-16.385834
RXN-9516,"[ENZRXN-15226, ENZRXN3O-10296, ENZRXN1G-511, E...","[EC-2.3.1.179, EC-2.3.1.41]","[Butanoyl-ACPs, MALONYL-ACP, PROTON]","[3-oxo-hexanoyl-ACPs, CARBON-DIOXIDE, ACP]",-6.862854


In [13]:
df_enz.head()

Unnamed: 0_level_0,COMMON-NAME,ENZYME,REACTION,ALTERNATIVE-SUBSTRATES,^SUBSTRATE,KM,KCAT,VMAX,PH-OPT,SPECIFIC-ACTIVITY,TEMPERATURE-OPT
UNIQUE-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ENZRXN-15384,maleylacetate reductase,MONOMER-14207,[MALEYLACETATE-REDUCTASE-RXN],[],[],[],[],[],,,
ENZRXN-26017,neurosporene &beta;-cyclase,MONOMER-20355,[RXN-8038],[],[],[],[],[],,,
ENZRXN-18089,&beta;-<i>N</i>-acetylhexosaminidase,CPLX-8130,[3.2.1.52-RXN],[],[],[],[],[],,,
ENZRXN-19053,cellulase,MONOMER-16501,[RXN-2043],"[(CELLULOSE |Lichenin|), (CELLULOSE |Xylogluca...",[],[],[],[],,,
ENZRXN-952,IPP isomerase,YPL117C-MONOMER,[IPPISOM-RXN],[],[],[],[],[],,,


`&beta;-<i>N</i>-acetylhexosaminidase` refer to __&beta;-<i>N</i>-acetylhexosaminidase__

Some weird notation here
- RXN-12314 doesn't have any enzyme affilated
- However, it links to a certain enzymatic reaction ENZRXN-19061 which has the reaction linked back as well
- What I don't understand is ENZYME column from enzrxn dataset which link to somewhere else that we cannot find an enzyme database in this current version of MetaCyc database
    - This ENZYME name and common name match with ENZYME name in rxn dataset. However, this is still unclear

##### Write a function that call a `UNIQUE-ID` or `COMMON-NAME` into InChI
This is an easiest approach to link df_rxn with df_cpd

In [5]:
def get_inchi(ID):
    
    """This function accept UNIQUE-ID and return InChI string of a certain compound"""
    
    inchi = df_cpd['INCHI'][ID]
    
    return inchi

def get_smiles(ID):
    
    """This function accept UNIQUE-ID and return SMILES string of a certain compound"""
    
    smiles = df_cpd['SMILES'][ID]
    
    return smiles

In [195]:
get_inchi('CPD-7557')

'InChI=1S/C15H22O/c1-10-4-6-13-11(2)5-7-14(12(3)9-16)15(13)8-10/h8-9,11,13-15H,3-7H2,1-2H3/t11-,13+,14+,15+/m1/s1'

In [197]:
get_smiles('CPD-7557')

'CC1(CC[CH]2(C(CC[CH]([CH](C=1)2)C(C=O)=C)C))'

In [198]:
df_enz.loc['ENZRXN-19061']

COMMON-NAME               artemisinic aldehyde monooxygenase
ENZYME                                         MONOMER-12185
REACTION                                         [RXN-12314]
ALTERNATIVE-SUBSTRATES                                    []
^SUBSTRATE                                                []
KM                                                        []
KCAT                                                      []
VMAX                                                      []
PH-OPT                                                   NaN
SPECIFIC-ACTIVITY                                        NaN
TEMPERATURE-OPT                                          NaN
Name: ENZRXN-19061, dtype: object

In [63]:
type(int(df_cpd['PubChemID'].loc['CPD-14966']))

int

In [57]:
df_cpd.loc['5-AMINO-LEVULINATE']

COMMON-NAME                                    5-aminolevulinate
GIBBS-0                                                 -20.4499
INCHI          InChI=1S/C5H9NO3/c6-3-4(7)1-2-5(8)9/h1-3,6H2,(...
SMILES                                   C(C(C[N+])=O)CC([O-])=O
PubChemID                                            7.04852e+06
Name: 5-AMINO-LEVULINATE, dtype: object

In [58]:
df_cpd.loc['PROTON']

COMMON-NAME    H<SUP>+</SUP>
GIBBS-0             0.429061
INCHI           InChI=1S/p+1
SMILES                  [H+]
PubChemID               1038
Name: PROTON, dtype: object

In [14]:
df_rxn.head()

Unnamed: 0_level_0,ERXN-NUMBER,EC-NUMBER,SUBSTRATES,PRODUCTS,GIBBS
UNIQUE-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
RXN-12314,[ENZRXN-19061],[],"[CPD-7557, Red-NADPH-Hemoprotein-Reductases, O...","[CPD-13248, Ox-NADPH-Hemoprotein-Reductases, W...",-108.40886
RXN-16877,[ENZRXN-24025],[],[Cys-Cys-HlmE],[4N-2mthio-5oxo-3Spyrrolidine-2-COOH-HlmE],-29.53003
PORPHOBILSYNTH-RXN,"[PORPHOBILSYNTH-ENZRXN, ENZRXN66-1464, ENZRXN-...",[EC-4.2.1.24],[5-AMINO-LEVULINATE],"[PROTON, WATER, PORPHOBILINOGEN]",-33.708138
RXN-9204,[ENZRXN-14772],[EC-2.5.1.74],"[CPD-21340, DIHYDROXYNAPHTHOATE, PROTON]","[CPD-12118, PPI, CARBON-DIOXIDE]",-16.385834
RXN-9516,"[ENZRXN-15226, ENZRXN3O-10296, ENZRXN1G-511, E...","[EC-2.3.1.179, EC-2.3.1.41]","[Butanoyl-ACPs, MALONYL-ACP, PROTON]","[3-oxo-hexanoyl-ACPs, CARBON-DIOXIDE, ACP]",-6.862854


In [147]:
df_rxn.loc['3.2.1.52-RXN']

ERXN-NUMBER    ['ENZRXN0-241', 'ENZRXN-18087', 'ENZRXN-18089'...
EC-NUMBER                                        ['EC-3.2.1.52']
SUBSTRATES            ['N-Acetyl-beta-D-Hexosaminides', 'WATER']
PRODUCTS             ['N-acetyl-beta-D-hexosamines', 'Alcohols']
GIBBS                                                    130.412
Name: 3.2.1.52-RXN, dtype: object

To strip the string see method below
>>> import ast
>>> x = u'[ "A","B","C" , " D"]'
>>> x = ast.literal_eval(x)
>>> x
['A', 'B', 'C', ' D']
>>> x = [n.strip() for n in x]
>>> x
['A', 'B', 'C', 'D']

In [129]:
text = df_rxn['EC-NUMBER'].iloc[4]
text

"['EC-2.3.1.179', 'EC-2.3.1.41']"

In [135]:
type(text)

str

In [130]:
df_rxn.loc['PORPHOBILSYNTH-RXN']

ERXN-NUMBER    ['PORPHOBILSYNTH-ENZRXN', 'ENZRXN66-1464', 'EN...
EC-NUMBER                                        ['EC-4.2.1.24']
SUBSTRATES                                ['5-AMINO-LEVULINATE']
PRODUCTS                  ['PROTON', 'WATER', 'PORPHOBILINOGEN']
GIBBS                                                   -33.7081
Name: PORPHOBILSYNTH-RXN, dtype: object

In [131]:
text = df_rxn['EC-NUMBER'].loc['PORPHOBILSYNTH-RXN']

In [138]:
text = df_rxn['EC-NUMBER'].loc['RXN-9516']
split = text[2:-2].split("', '")
split

['EC-2.3.1.179', 'EC-2.3.1.41']

In [139]:
type(split)

list

In [140]:
split[0]

'EC-2.3.1.179'

In [146]:
# count how many reactions doesn't have EC-NUMBER
counter_1 = 0
counter_m = 0
counter_n = 0

for index, row in df_rxn.iterrows():
    
    if type(row['EC-NUMBER']) != type('string'):
        #   if math.isnan(row['EC-NUMBER']):
        counter_n += 1
    else:
        data = ast.literal_eval(row['EC-NUMBER'])
        
        if len(data) == 1:
            counter_1 += 1
        elif len(data) > 1:
            counter_m += 1
        else:
            pass
print('Out of total', df_rxn.shape[0], 'row of df_rxn')
print('The data with only one, multiple, and no EC-Number are', counter_1, counter_m, 'and', counter_n, 'respectively')

Out of total 16834 row of df_rxn
The data with only one, multiple, and no EC-Number are 13251 200 and 3383 respectively


# Rearrange them into master dataframe format
Here, we have to turn all df_rxn into that of master dataframe by re-indexing with EC-Number as an index

In [6]:
df_rxn.head()

Unnamed: 0_level_0,ERXN-NUMBER,EC-NUMBER,SUBSTRATES,PRODUCTS,GIBBS
UNIQUE-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
RXN-12314,[ENZRXN-19061],[],"[CPD-7557, Red-NADPH-Hemoprotein-Reductases, O...","[CPD-13248, Ox-NADPH-Hemoprotein-Reductases, W...",-108.40886
RXN-16877,[ENZRXN-24025],[],[Cys-Cys-HlmE],[4N-2mthio-5oxo-3Spyrrolidine-2-COOH-HlmE],-29.53003
PORPHOBILSYNTH-RXN,"[PORPHOBILSYNTH-ENZRXN, ENZRXN66-1464, ENZRXN-...",[EC-4.2.1.24],[5-AMINO-LEVULINATE],"[PROTON, WATER, PORPHOBILINOGEN]",-33.708138
RXN-9204,[ENZRXN-14772],[EC-2.5.1.74],"[CPD-21340, DIHYDROXYNAPHTHOATE, PROTON]","[CPD-12118, PPI, CARBON-DIOXIDE]",-16.385834
RXN-9516,"[ENZRXN-15226, ENZRXN3O-10296, ENZRXN1G-511, E...","[EC-2.3.1.179, EC-2.3.1.41]","[Butanoyl-ACPs, MALONYL-ACP, PROTON]","[3-oxo-hexanoyl-ACPs, CARBON-DIOXIDE, ACP]",-6.862854


In [9]:
df_rxn['EC-NUMBER']['RXN-9204'][0]

'EC-2.5.1.74'

In [19]:
df_rxn.iloc[0].index

Index(['ERXN-NUMBER', 'EC-NUMBER', 'SUBSTRATES', 'PRODUCTS', 'GIBBS'], dtype='object')

In [19]:
EC = []
rxn = []

for index, row in df_rxn.iterrows():
    
    if len(row['EC-NUMBER']) > 1:
        for i in range(len(row['EC-NUMBER'])):
            EC.append(row['EC-NUMBER'][i])
            rxn.append(index)
    elif len(row['EC-NUMBER']) == 1:
        EC.append(row['EC-NUMBER'][0])
        rxn.append(index)
    else:
        EC.append('No_Data')
        rxn.append(index)

In [20]:
df_master = pd.DataFrame({'EC-NUMBER' : EC,
                          'UNIQUE-ID' : rxn})

In [21]:
rxn_num = []
subs = []
pdts = []
gibbs = []

for index, row in df_master.iterrows():
    ID = row['UNIQUE-ID']
    rxn_num.append(df_rxn['ERXN-NUMBER'][ID])
    subs.append(df_rxn['SUBSTRATES'][ID])
    pdts.append(df_rxn['PRODUCTS'][ID])
    gibbs.append(df_rxn['GIBBS'][ID])

In [22]:
df_master['ERXN-NUMBER'] = rxn_num
df_master['SUBSTRATES'] = subs
df_master['PRODUCTS'] = pdts
df_master['GIBBS'] = gibbs
df_master.head()

Unnamed: 0,EC-NUMBER,UNIQUE-ID,ERXN-NUMBER,SUBSTRATES,PRODUCTS,GIBBS
0,No_Data,RXN-12314,[ENZRXN-19061],"[CPD-7557, Red-NADPH-Hemoprotein-Reductases, O...","[CPD-13248, Ox-NADPH-Hemoprotein-Reductases, W...",-108.40886
1,No_Data,RXN-16877,[ENZRXN-24025],[Cys-Cys-HlmE],[4N-2mthio-5oxo-3Spyrrolidine-2-COOH-HlmE],-29.53003
2,EC-4.2.1.24,PORPHOBILSYNTH-RXN,"[PORPHOBILSYNTH-ENZRXN, ENZRXN66-1464, ENZRXN-...",[5-AMINO-LEVULINATE],"[PROTON, WATER, PORPHOBILINOGEN]",-33.708138
3,EC-2.5.1.74,RXN-9204,[ENZRXN-14772],"[CPD-21340, DIHYDROXYNAPHTHOATE, PROTON]","[CPD-12118, PPI, CARBON-DIOXIDE]",-16.385834
4,EC-2.3.1.179,RXN-9516,"[ENZRXN-15226, ENZRXN3O-10296, ENZRXN1G-511, E...","[Butanoyl-ACPs, MALONYL-ACP, PROTON]","[3-oxo-hexanoyl-ACPs, CARBON-DIOXIDE, ACP]",-6.862854


In [105]:
df_master.tail()

Unnamed: 0,EC-NUMBER,UNIQUE-ID,ERXN-NUMBER,SUBSTRATES,PRODUCTS,GIBBS
17109,EC-4.2.1.59,RXN-9537,"[ENZRXN0-7977, ENZRXN-21704, ENZRXN1G-803, ENZ...",[R-3-hydroxymyristoyl-ACPs],"[Tetradec-2-enoyl-ACPs, WATER]",22.705246
17110,No_Data,ARACHIDONATE-5-LIPOXYGENASE-RXN,[ENZRXN66-1526],"[OXYGEN-MOLECULE, ARACHIDONIC_ACID]",[6E8Z11Z14Z-5S-5-HYDROPEROXYCOSA-6],-69.979996
17111,EC-1.14.13,RXN-16742,[ENZRXN-23928],"[CPD-18060, NADPH, PROTON, OXYGEN-MOLECULE]","[CPD-18056, NADP, WATER]",-82.77051
17112,EC-3.4.23.50,RXN-11145,[],"[WATER, SQNYPIVQ-Cleavage-Sites]","[Mature-P17-Matrix, P24-Capsid-Proteins]",
17113,EC-2.1.1.M46,RXN-19533,[ENZRXN-26293],"[S-ADENOSYLMETHIONINE, CPD-21063]","[CPD-21067, ADENOSYL-HOMO-CYS, PROTON]",12.328987


In [53]:
df_duplicate = df_master.groupby('EC-NUMBER').size().reset_index(name='count')
df_duplicate.head()

Unnamed: 0,EC-NUMBER,count
0,EC-1,11
1,EC-1.1,26
2,EC-1.1.1,172
3,EC-1.1.1.1,15
4,EC-1.1.1.10,1


In [122]:
df_duplicate.tail(200)

Unnamed: 0,EC-NUMBER,count
6653,EC-7.6.2.3,2
6654,EC-7.6.2.4,1
6655,EC-7.6.2.5,2
6656,EC-7.6.2.6,1
6657,EC-7.6.2.7,1
6658,EC-7.6.2.8,3
6659,EC-7.6.2.9,2
6660,No_Data,3383
6661,|EC-1.1.1.co|,2
6662,|EC-1.1.1.eb|,4


In [23]:
df_sorted = df_master.sort_values(by=['EC-NUMBER'])
df_sorted.head()

Unnamed: 0,EC-NUMBER,UNIQUE-ID,ERXN-NUMBER,SUBSTRATES,PRODUCTS,GIBBS
2905,EC-1,RXN0-6277,"[ENZRXN-153, ENZRXN-483]","[CPD-722, Red-Thioredoxin]","[BIOTIN, Ox-Thioredoxin, WATER]",4.545235
7737,EC-1,RXN-8705,[],"[CPD-8922, OXYGEN-MOLECULE]",[CPD-8928],-50.319992
11493,EC-1,RXN-13817,[],"[CPD-8922, OXYGEN-MOLECULE, Donor-H2]","[CPD-14837, WATER, Acceptor]",
16594,EC-1,R303-RXN,[ENZRXN-503],"[Nitroaromatic-Ox-Compounds, NADPH]","[Nitroaromatic-Red-Compounds, NADP]",
16307,EC-1,RXN-9620,[],"[CPD-10257, OXYGEN-MOLECULE, Donor-H2]","[CPD-10258, WATER, Acceptor]",


In [24]:
df_sorted.reset_index(inplace=True, drop=True)
df_sorted

Unnamed: 0,EC-NUMBER,UNIQUE-ID,ERXN-NUMBER,SUBSTRATES,PRODUCTS,GIBBS
0,EC-1,RXN0-6277,"[ENZRXN-153, ENZRXN-483]","[CPD-722, Red-Thioredoxin]","[BIOTIN, Ox-Thioredoxin, WATER]",4.545235
1,EC-1,RXN-8705,[],"[CPD-8922, OXYGEN-MOLECULE]",[CPD-8928],-50.319992
2,EC-1,RXN-13817,[],"[CPD-8922, OXYGEN-MOLECULE, Donor-H2]","[CPD-14837, WATER, Acceptor]",
3,EC-1,R303-RXN,[ENZRXN-503],"[Nitroaromatic-Ox-Compounds, NADPH]","[Nitroaromatic-Red-Compounds, NADP]",
4,EC-1,RXN-9620,[],"[CPD-10257, OXYGEN-MOLECULE, Donor-H2]","[CPD-10258, WATER, Acceptor]",
5,EC-1,RXN-9169,[],"[OXOPENTENOATE, Acceptor]","[CPD-9734, Donor-H2]",
6,EC-1,RXN-13553,"[ENZRXN-20651, ENZRXN-20646]","[CPD-14537, OXYGEN-MOLECULE, PROTON, Donor-H2]","[CPD-14536, WATER, Acceptor]",
7,EC-1,RXN-8704,[],"[CPD-8927, OXYGEN-MOLECULE]","[CPD-8928, WATER]",-121.353966
8,EC-1,RXN-12254,[],"[SALICYLALDEHYDE, Donor-H2]","[CPD-173, Acceptor]",
9,EC-1,RXN-8738,[],"[CPD-8961, OXYGEN-MOLECULE, Donor-H2]","[CPD-8962, WATER, Acceptor]",


In [116]:
df_sorted['EC-NUMBER'][0]

'EC-1'

In [25]:
EC_a = 'EC-1'

EC = []
ID = []
erxn = []
subs = []
pdts = []
gibbs = []
counter = 0

ID_temp = []
erxn_temp = []
subs_temp = []
pdts_temp = []
gibbs_temp = []

for index, row in df_sorted.iterrows():
    
    if row['EC-NUMBER'] == EC_a:
        ID_temp.append(row['UNIQUE-ID'])
        erxn_temp.append(row['ERXN-NUMBER'])
        subs_temp.append(row['SUBSTRATES'])
        pdts_temp.append(row['PRODUCTS'])
        gibbs_temp.append(row['GIBBS'])
        counter += 1
        
    elif counter == 0:
        ID.append(row['UNIQUE-ID'])
        erxn.append(row['ERXN-NUMBER'])
        subs.append(row['SUBSTRATES'])
        pdts.append(row['PRODUCTS'])
        gibbs.append(row['GIBBS'])
        
        EC.append(EC_a)
        EC_a = row['EC-NUMBER']
    else:
        ID.append(ID_temp)
        erxn.append(erxn_temp)
        subs.append(subs_temp)
        pdts.append(pdts_temp)
        gibbs.append(gibbs_temp)
        
        EC.append(EC_a)
        counter = 0
        EC_a = row['EC-NUMBER']


In [26]:
df_sorted_master = pd.DataFrame({'EC-NUMBER' : EC,
                                'UNIQUE-ID' : ID,
                                'ERXN-NUMBER' : erxn,
                                'SUBSTRATES' : subs,
                                'PRODUCTS' : pdts,
                                'GIBBS' : gibbs})

In [119]:
df_sorted_master

Unnamed: 0,EC-NUMBER,UNIQUE-ID,ERXN-NUMBER,SUBSTRATES,PRODUCTS,GIBBS
0,EC-1,"[RXN0-6277, RXN-8705, RXN-13817, R303-RXN, RXN...","[[ENZRXN-153, ENZRXN-483], [], [], [ENZRXN-503...","[[CPD-722, Red-Thioredoxin], [CPD-8922, OXYGEN...","[[BIOTIN, Ox-Thioredoxin, WATER], [CPD-8928], ...","[4.5452347, -50.319992, nan, nan, nan, nan, na..."
1,EC-1.1,"[RXN0-6277, RXN-8705, RXN-13817, R303-RXN, RXN...","[[ENZRXN-153, ENZRXN-483], [], [], [ENZRXN-503...","[[CPD-722, Red-Thioredoxin], [CPD-8922, OXYGEN...","[[BIOTIN, Ox-Thioredoxin, WATER], [CPD-8928], ...","[4.5452347, -50.319992, nan, nan, nan, nan, na..."
2,EC-1.1.1,"[RXN0-6277, RXN-8705, RXN-13817, R303-RXN, RXN...","[[ENZRXN-153, ENZRXN-483], [], [], [ENZRXN-503...","[[CPD-722, Red-Thioredoxin], [CPD-8922, OXYGEN...","[[BIOTIN, Ox-Thioredoxin, WATER], [CPD-8928], ...","[4.5452347, -50.319992, nan, nan, nan, nan, na..."
3,EC-1.1.1.1,"[RXN0-6277, RXN-8705, RXN-13817, R303-RXN, RXN...","[[ENZRXN-153, ENZRXN-483], [], [], [ENZRXN-503...","[[CPD-722, Red-Thioredoxin], [CPD-8922, OXYGEN...","[[BIOTIN, Ox-Thioredoxin, WATER], [CPD-8928], ...","[4.5452347, -50.319992, nan, nan, nan, nan, na..."
4,EC-1.1.1.10,RXN-9633,"[ENZRXN1G-246, ENZRXN-15237]","[R-3-hydroxystearoyl-ACPs, NADP]","[3-oxo-stearoyl-ACPs, NADPH, PROTON]",20.3757
5,EC-1.1.1.100,"[RXN0-6277, RXN-8705, RXN-13817, R303-RXN, RXN...","[[ENZRXN-153, ENZRXN-483], [], [], [ENZRXN-503...","[[CPD-722, Red-Thioredoxin], [CPD-8922, OXYGEN...","[[BIOTIN, Ox-Thioredoxin, WATER], [CPD-8928], ...","[4.5452347, -50.319992, nan, nan, nan, nan, na..."
6,EC-1.1.1.101,"[RXN0-6277, RXN-8705, RXN-13817, R303-RXN, RXN...","[[ENZRXN-153, ENZRXN-483], [], [], [ENZRXN-503...","[[CPD-722, Red-Thioredoxin], [CPD-8922, OXYGEN...","[[BIOTIN, Ox-Thioredoxin, WATER], [CPD-8928], ...","[4.5452347, -50.319992, nan, nan, nan, nan, na..."
7,EC-1.1.1.102,"[RXN0-6277, RXN-8705, RXN-13817, R303-RXN, RXN...","[[ENZRXN-153, ENZRXN-483], [], [], [ENZRXN-503...","[[CPD-722, Red-Thioredoxin], [CPD-8922, OXYGEN...","[[BIOTIN, Ox-Thioredoxin, WATER], [CPD-8928], ...","[4.5452347, -50.319992, nan, nan, nan, nan, na..."
8,EC-1.1.1.103,"[RXN0-6277, RXN-8705, RXN-13817, R303-RXN, RXN...","[[ENZRXN-153, ENZRXN-483], [], [], [ENZRXN-503...","[[CPD-722, Red-Thioredoxin], [CPD-8922, OXYGEN...","[[BIOTIN, Ox-Thioredoxin, WATER], [CPD-8928], ...","[4.5452347, -50.319992, nan, nan, nan, nan, na..."
9,EC-1.1.1.104,RETINOL-DEHYDROGENASE-RXN,"[ENZRXN-19467, ENZRXN-19433]","[CPD-13524, NAD]","[RETINAL, NADH, PROTON]",0.991837


#### Still, there is a problem
For some `EC-NUMBER` value, they are in `|_|` bracket with weird annotation with character instead of number. To solve this, we can modify some string value but if there is no repeating `EC-NUMBER` with and without bracket, we can just change the `EC-NUMBER` column without renewing the dataframe.

## So, we have a sortof Master Dataframe v1 right now

### What is the next step?
- Alter CPD-ID into PubChemID or InChI ---> go for PubChemID first
    - Done!
- Make reversible reaction
- Make negative dataset

In [31]:
def get_pubchem(ID):
    
    """This function accept UNIQUE-ID and return InChI string of a certain compound"""
    if ID in df_cpd['PubChemID']:
        pubchem = df_cpd['PubChemID'][ID]
    else:
        pubchem = ID
        
    return pubchem

In [125]:
df_rxn.head()

Unnamed: 0_level_0,ERXN-NUMBER,EC-NUMBER,SUBSTRATES,PRODUCTS,GIBBS
UNIQUE-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
RXN-12314,[ENZRXN-19061],[],"[CPD-7557, Red-NADPH-Hemoprotein-Reductases, O...","[CPD-13248, Ox-NADPH-Hemoprotein-Reductases, W...",-108.40886
RXN-16877,[ENZRXN-24025],[],[Cys-Cys-HlmE],[4N-2mthio-5oxo-3Spyrrolidine-2-COOH-HlmE],-29.53003
PORPHOBILSYNTH-RXN,"[PORPHOBILSYNTH-ENZRXN, ENZRXN66-1464, ENZRXN-...",[EC-4.2.1.24],[5-AMINO-LEVULINATE],"[PROTON, WATER, PORPHOBILINOGEN]",-33.708138
RXN-9204,[ENZRXN-14772],[EC-2.5.1.74],"[CPD-21340, DIHYDROXYNAPHTHOATE, PROTON]","[CPD-12118, PPI, CARBON-DIOXIDE]",-16.385834
RXN-9516,"[ENZRXN-15226, ENZRXN3O-10296, ENZRXN1G-511, E...","[EC-2.3.1.179, EC-2.3.1.41]","[Butanoyl-ACPs, MALONYL-ACP, PROTON]","[3-oxo-hexanoyl-ACPs, CARBON-DIOXIDE, ACP]",-6.862854


In [16]:
'Nitroaromatic-Ox-Compounds' in df_cpd['PubChemID']

False

In [14]:
'BIOTIN' in df_cpd['PubChemID']

True

In [30]:
if 'BIOTIN' in df_cpd['PubChemID']:
    print('Yes')

Yes


In [39]:
get_pubchem('MALONYL-ACP')

'MALONYL-ACP'

In [35]:
# Start from df_rxn and rerun the master dataframe again
subs_id = []
pdts_id = []

for index, row in df_rxn.iterrows():
    
    subs = []
    for item in row['SUBSTRATES']:
        subs.append(get_pubchem(item))
    subs_id.append(subs)
    
    pdts = []
    for item in row['PRODUCTS']:
        pdts.append(get_pubchem(item))
    pdts_id.append(pdts)

In [43]:
df_rxn['SUBSTRATES_PubChemID'] = subs_id
df_rxn['PRODUCTS_PubChemID'] = pdts_id
df_rxn.head()

Unnamed: 0_level_0,ERXN-NUMBER,EC-NUMBER,SUBSTRATES,PRODUCTS,GIBBS,SUBSTRATES_PubChemID,PRODUCTS_PubChemID
UNIQUE-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
RXN-12314,[ENZRXN-19061],[],"[CPD-7557, Red-NADPH-Hemoprotein-Reductases, O...","[CPD-13248, Ox-NADPH-Hemoprotein-Reductases, W...",-108.40886,"[15983961, Red-NADPH-Hemoprotein-Reductases, 977]","[52940136, Ox-NADPH-Hemoprotein-Reductases, 96..."
RXN-16877,[ENZRXN-24025],[],[Cys-Cys-HlmE],[4N-2mthio-5oxo-3Spyrrolidine-2-COOH-HlmE],-29.53003,[Cys-Cys-HlmE],[4N-2mthio-5oxo-3Spyrrolidine-2-COOH-HlmE]
PORPHOBILSYNTH-RXN,"[PORPHOBILSYNTH-ENZRXN, ENZRXN66-1464, ENZRXN-...",[EC-4.2.1.24],[5-AMINO-LEVULINATE],"[PROTON, WATER, PORPHOBILINOGEN]",-33.708138,[7048523],"[1038, 962, 6921588]"
RXN-9204,[ENZRXN-14772],[EC-2.5.1.74],"[CPD-21340, DIHYDROXYNAPHTHOATE, PROTON]","[CPD-12118, PPI, CARBON-DIOXIDE]",-16.385834,"[25244603, 54706667, 1038]","[45479709, 1023, 280]"
RXN-9516,"[ENZRXN-15226, ENZRXN3O-10296, ENZRXN1G-511, E...","[EC-2.3.1.179, EC-2.3.1.41]","[Butanoyl-ACPs, MALONYL-ACP, PROTON]","[3-oxo-hexanoyl-ACPs, CARBON-DIOXIDE, ACP]",-6.862854,"[Butanoyl-ACPs, MALONYL-ACP, 1038]","[3-oxo-hexanoyl-ACPs, 280, ACP]"


In [45]:
counter_s_y = 0
counter_s_n = 0

counter_p_y = 0
counter_p_n = 0

for index, row in df_rxn.iterrows():
    
    for item in row['SUBSTRATES']:
        if item in df_cpd['PubChemID']:
            counter_s_y += 1
        else:
            counter_s_n += 1
    
    for item in row['PRODUCTS']:
        if item in df_cpd['PubChemID']:
            counter_p_y += 1
        else:
            counter_p_n += 1

print('Counting substrates True is', counter_s_y, 'False is', counter_s_n)
print('Counting products True is', counter_p_y, 'False is', counter_p_n)

Counting substrates True is 29253 False is 8287
Counting products True is 34480 False is 7752


In [49]:
percent_s = counter_s_y/(counter_s_y + counter_s_n)
percent_p = counter_p_y/(counter_p_y + counter_p_n)

print('Percents of available substrate and product are', percent_s*100, 'and', percent_p*100)

Percents of available substrate and product are 77.92488012786362 and 81.64425080507672


In [139]:
get_pubchem('Nitroaromatic-Ox-Compounds')

KeyError: 'Nitroaromatic-Ox-Compounds'

### Problems found
Where there is no data available for a certain UNIQUE-ID, the function goes wrong!!!
- Those are macromolecules, general group of compounds, etc.

Already solve by check availability by `in` function