# This notebook is created to organize data from MetaCyc into one dataframe

### What kind of data is available by MetaCyc zipped file?
Noted that we recieved data from Pathway Tools platform version 22.6

**Important data that we extracted**
- compounds.dat
- enzrxns.dat
- reactions.dat

**Somewhat useful**
- atom-mapping and atom-mappint-smiles
- pathways

**May not be useful**
- classes
- gene_association
- genes
- metabolic-reactions.xml
- proteins
- protligandcplxes
- protseq
- pubs
- regulation
- rnas
- species
- transporters

In [1]:
import numpy as np
import pandas as pd
import math
import ast

### This is how our desired master dataframe look from last quarter

For now, make a dataset with enzyme/compounds linked to each other and then we can use RDKit-based function and existing functions to generate Mol-files, Distance, and Negative data

master_df = pd.read_csv('../datasets/MASTER_DF.csv')

master_df.head()

feature_df = master_df[['PubChem', 'dist', 'enzyme_class_1', 'enzyme_class_2', 'enzyme_class_3',
       'enzyme_class_4', 'enzyme_class_5', 'enzyme_class_6', 'enzyme_class_7',
        'n_O', 'n_N', 'n_P', 'n_S', 'n_X', 'DoU']]
feature_df.set_index(keys=['PubChem'], inplace=True)
feature_df.head()

## The current database from MetaCyc starts here

In [113]:
df_cpd = pd.read_csv('df_cpd.csv', index_col = 0)
df_rxn = pd.read_csv('parsed_rxns.csv', index_col = 0)
df_enz = pd.read_csv('df_enzrxns.csv', index_col = 0)
df_cpd = df_cpd.set_index(keys ='UNIQUE-ID')
df_rxn = df_rxn.set_index(keys = 'UNIQUE-ID')
df_enz = df_enz.set_index(keys = 'UNIQUE-ID')

# Here is an important note
The data from csv reading is in text formatted which lead to some problematic data handling.
- Some int data was read as float e.g. PubChemID which can be fixed by .astype() shown in cell #4
- List was formatted in text which required decoding by ast.literal_eval(x) which was written as a fucntion below

In [86]:
def recover_list(df, column):
    """This function will recover a list formatted string read from .csv into a list"""
    assert type(df[column][0]) != type([]), "TypeError: The data type is already a list, it should not be converted again"
    replacement = []
    for index, row in df.iterrows():
        
        data = []
        
        if type(row[column]) == type('string'):
            data = ast.literal_eval(row[column])
        else:
            pass
        replacement.append(data)
    df[column] = replacement
    return

In [114]:
# Change PubChemID into int type in df_cpd
PubChemID_int = df_cpd['PubChemID'].fillna(0).astype(int)
df_cpd['PubChemID'] = PubChemID_int

# Recover list format of df_rxn
rxn_list_fix = ['EC-NUMBER', 'ERXN-NUMBER', 'SUBSTRATES', 'PRODUCTS']
for col in rxn_list_fix:
    recover_list(df_rxn, col)

# Recover list format of df_enz
enz_list_fix = ['REACTION', 'ALTERNATIVE-SUBSTRATES', '^SUBSTRATE', 'KM', 'KCAT', 'VMAX']
for col in enz_list_fix:
    recover_list(df_enz, col)

##### the code below verify that all `.astype()` of PubChemID data didn't change the data unless it is `nan`
df_cpd = pd.read_csv('df_cpd.csv', index_col = 0)
df_cpd = df_cpd.set_index(keys ='UNIQUE-ID')
PubChemID_int = df_cpd['PubChemID'].fillna(0).astype(int)
for i in range(PubChemID_int.shape[0]):
    
#print (df_cpd['PubChemID'].loc['CPD-14743'])
#a = df_cpd['PubChemID'].loc['CPD-14743']
    if PubChemID_int[i] != df_cpd['PubChemID'][i]:
        print(PubChemID_int[i], df_cpd['PubChemID'][i])

In [None]:
df_cpd.head()

In [None]:
df_rxn.head()

In [None]:
df_enz.head()

`&beta;-<i>N</i>-acetylhexosaminidase` refer to __&beta;-<i>N</i>-acetylhexosaminidase__

Some weird notation here
- RXN-12314 doesn't have any enzyme affilated
- However, it links to a certain enzymatic reaction ENZRXN-19061 which has the reaction linked back as well
- What I don't understand is ENZYME column from enzrxn dataset which link to somewhere else that we cannot find an enzyme database in this current version of MetaCyc database
    - This ENZYME name and common name match with ENZYME name in rxn dataset. However, this is still unclear

##### Write a function that call a `UNIQUE-ID` or `COMMON-NAME` into InChI
This is an easiest approach to link df_rxn with df_cpd

In [None]:
def get_inchi(ID):
    
    """This function accept UNIQUE-ID and return InChI string of a certain compound"""
    
    inchi = df_cpd['INCHI'][ID]
    
    return inchi

def get_smiles(ID):
    
    """This function accept UNIQUE-ID and return SMILES string of a certain compound"""
    
    smiles = df_cpd['SMILES'][ID]
    
    return smiles

In [112]:
def get_pubchem(ID):
    
    """This function accept UNIQUE-ID and return InChI string of a certain compound"""
    if ID in df_cpd['PubChemID']:
        pubchem = df_cpd['PubChemID'][ID]
    else:
        pubchem = '0'
        
    return pubchem

In [None]:
get_inchi('CPD-7557')

In [None]:
get_smiles('CPD-7557')

df_enz.loc['ENZRXN-19061']

type(int(df_cpd['PubChemID'].loc['CPD-14966']))

df_cpd.loc['5-AMINO-LEVULINATE']

df_cpd.loc['PROTON']

df_rxn.head()

df_rxn.loc['3.2.1.52-RXN']

To strip the string see method below
>>> import ast
>>> x = u'[ "A","B","C" , " D"]'
>>> x = ast.literal_eval(x)
>>> x
['A', 'B', 'C', ' D']
>>> x = [n.strip() for n in x]
>>> x
['A', 'B', 'C', 'D']

text = df_rxn['EC-NUMBER'].iloc[4]
text

type(text)

df_rxn.loc['PORPHOBILSYNTH-RXN']

text = df_rxn['EC-NUMBER'].loc['PORPHOBILSYNTH-RXN']

text = df_rxn['EC-NUMBER'].loc['RXN-9516']
split = text[2:-2].split("', '")
split

type(split)

split[0]

In [None]:
# count how many reactions doesn't have EC-NUMBER
counter_1 = 0
counter_m = 0
counter_n = 0

for index, row in df_rxn.iterrows():
    
    if type(row['EC-NUMBER']) != type('string'):
        #   if math.isnan(row['EC-NUMBER']):
        counter_n += 1
    else:
        data = ast.literal_eval(row['EC-NUMBER'])
        
        if len(data) == 1:
            counter_1 += 1
        elif len(data) > 1:
            counter_m += 1
        else:
            pass
print('Out of total', df_rxn.shape[0], 'row of df_rxn')
print('The data with only one, multiple, and no EC-Number are', counter_1, counter_m, 'and', counter_n, 'respectively')

# Rearrange them into master dataframe format
Here, we have to turn all df_rxn into that of master dataframe by re-indexing with EC-Number as an index

In [None]:
df_rxn.head()

df_rxn['EC-NUMBER']['RXN-9204'][0]

df_rxn.iloc[0].index

# start running here

In [None]:
EC = []
rxn = []

for index, row in df_rxn.iterrows():
    
    if len(row['EC-NUMBER']) > 1:
        for i in range(len(row['EC-NUMBER'])):
            EC.append(row['EC-NUMBER'][i])
            rxn.append(index)
    elif len(row['EC-NUMBER']) == 1:
        EC.append(row['EC-NUMBER'][0])
        rxn.append(index)
    else:
        EC.append('No_Data')
        rxn.append(index)

In [None]:
df_master = pd.DataFrame({'EC-NUMBER' : EC,
                          'UNIQUE-ID' : rxn})

In [None]:
df_master.head()

In [None]:
rxn_num = []
subs = []
pdts = []
gibbs = []

for index, row in df_master.iterrows():
    ID = row['UNIQUE-ID']
    rxn_num.append(df_rxn['ERXN-NUMBER'][ID])
    subs.append(df_rxn['SUBSTRATES'][ID])
    pdts.append(df_rxn['PRODUCTS'][ID])
    gibbs.append(df_rxn['GIBBS'][ID])

In [None]:
df_master['ERXN-NUMBER'] = rxn_num
df_master['SUBSTRATES'] = subs
df_master['PRODUCTS'] = pdts
df_master['GIBBS'] = gibbs
df_master.head()

In [None]:
df_master.tail()

df_duplicate = df_master.groupby('EC-NUMBER').size().reset_index(name='count')
df_duplicate.head()

df_duplicate.tail()

In [None]:
df_sorted = df_master.sort_values(by=['EC-NUMBER'])
df_sorted.head()

In [None]:
df_sorted.reset_index(inplace=True, drop=True)

In [None]:
df_sorted['GIBBS'][2]

df_sorted['GIBBS'][2] = 'No-data'

In [None]:
for index, row in df_sorted.iterrows():
    
    if math.isnan(row['GIBBS']):
        df_sorted['GIBBS'][index] = 'No-Data'
        
df_sorted.head()

In [None]:
df_sorted['EC-NUMBER'][0]

This cell might be wrong and lead to duplication of that EC-1

In [None]:
EC_a = 'EC-1'

EC = []
ID = []
erxn = []
subs = []
pdts = []
gibbs = []
counter = 0

ID_temp = []
erxn_temp = []
subs_temp = []
pdts_temp = []
gibbs_temp = []

for index, row in df_sorted.iterrows():
    
    if row['EC-NUMBER'] == EC_a:
        ID_temp.append(row['UNIQUE-ID'])
        erxn_temp.append(row['ERXN-NUMBER'])
        subs_temp.append(row['SUBSTRATES'])
        pdts_temp.append(row['PRODUCTS'])
        gibbs_temp.append(row['GIBBS'])
        counter += 1
        
    elif counter == 0:
        ID.append(row['UNIQUE-ID'])
        erxn.append(row['ERXN-NUMBER'])
        subs.append(row['SUBSTRATES'])
        pdts.append(row['PRODUCTS'])
        gibbs.append(row['GIBBS'])
        
        EC.append(EC_a)
        EC_a = row['EC-NUMBER']
    else:
        ID.append(ID_temp)
        erxn.append(erxn_temp)
        subs.append(subs_temp)
        pdts.append(pdts_temp)
        gibbs.append(gibbs_temp)
        
        ID_temp = []
        erxn_temp = []
        subs_temp = []
        pdts_temp = []
        gibbs_temp = []

        EC.append(EC_a)
        counter = 0
        EC_a = row['EC-NUMBER']


In [None]:
df_sorted_master = pd.DataFrame({'EC-NUMBER' : EC,
                                'UNIQUE-ID' : ID,
                                'ERXN-NUMBER' : erxn,
                                'SUBSTRATES' : subs,
                                'PRODUCTS' : pdts,
                                'GIBBS' : gibbs})

In [None]:
df_sorted_master.head(10)

In [None]:
df_sorted_master.tail(10)

In [None]:
df_sorted_master.set_index(keys=['EC-NUMBER'], inplace=True)

In [None]:
df_sorted_master.index[4].count('.')

In [None]:
df_sorted_master

In [None]:
df_sorted_master.index

In [None]:
drop = []
for index, row in df_sorted_master.iterrows():
    if index.count('.') < 'EC-1.1.1.1'.count('.'):
        #print(index)
        drop.append(index)

In [None]:
drop[:10]

In [None]:
df_sorted_master_drop = df_sorted_master
for item in drop:
    df_sorted_master_drop = df_sorted_master_drop.drop(item)

In [None]:
for index, row in df_sorted_master_drop.iterrows():
    if index.count('.') < 'EC-1.1.1.1'.count('.'):
        print(index)

In [None]:
df_sorted_master_drop.drop(['UNIQUE-ID', 'ERXN-NUMBER'], axis=1)

This approach can be done by changing the substrates and products in df_rxn into PubChemID first ('0' for NaN)

In [None]:
# the file is too big!!! up to 1.7 G --- should reduce the size first
df_sorted_master.to_csv('df_master_1st.csv')

# changing to PubChemID still give 1.3 G left

# here is the first version of master dataframe
However, it is not perfect
- CPD-ID is not converted to PubChemID or InChI yet
- cofactor is not cleaned
- reversibility is not applied
- Anyway, should be in similar shape to that of previous master but contains single reactions

In [None]:
df_master_1st = pd.read_csv('df_master_1st.csv')

In [None]:
df_master = df_master_1st.drop(['UNIQUE-ID', 'ERXN-NUMBER'], axis = 1)
df_master['DIRECTION'] = 1
df_master.head()

In [None]:
recover_list(df_master, 'SUBSTRATES')
recover_list(df_master, 'PRODUCTS')

In [None]:
df_master.head()

In [None]:
df_master_rev = pd.DataFrame({'EC-NUMBER': df_master['EC-NUMBER'],
                             'SUBSTRATES': df_master['PRODUCTS'],
                             'PRODUCTS': df_master['SUBSTRATES'],
                             'GIBBS': df_master['GIBBS'],
                             'DIRECTION': df_master['DIRECTION']*(-1)})
df_master_rev.head()

In [None]:
df_merged =pd.concat([df_master, df_master_rev], ignore_index=True)
df_merged.head()

In [None]:
df_merged.sort_values('EC-NUMBER').reset_index(drop = True)

df_merged.to_csv('df_merged.csv')

In [83]:
master = pd.read_csv('../notebooks/df_merged_1st.csv')
master.shape
master.head()

Unnamed: 0.1,Unnamed: 0,EC-NUMBER,SUBSTRATES,PRODUCTS,GIBBS,DIRECTION
0,0,EC-1,"[['CPD-722', 'Red-Thioredoxin'], ['CPD-8922', ...","[['BIOTIN', 'Ox-Thioredoxin', 'WATER'], ['CPD-...","[4.5452347, -50.319992, 'No-Data', 'No-Data', ...",1
1,1,EC-1,"[['BIOTIN', 'Ox-Thioredoxin', 'WATER'], ['CPD-...","[['CPD-722', 'Red-Thioredoxin'], ['CPD-8922', ...","[4.5452347, -50.319992, 'No-Data', 'No-Data', ...",-1
2,2,EC-1.1,"[['CPD-19423', 'Acceptor'], ['C86-cis-keto-myc...","[['CPD-19443', 'Donor-H2'], ['3-oxo-C86-cis-ke...","['No-Data', 'No-Data', 'No-Data', 'No-Data', '...",1
3,3,EC-1.1,"[['CPD-19443', 'Donor-H2'], ['3-oxo-C86-cis-ke...","[['CPD-19423', 'Acceptor'], ['C86-cis-keto-myc...","['No-Data', 'No-Data', 'No-Data', 'No-Data', '...",-1
4,4,EC-1.1.1,"[['D-galactopyranose', 'NADP'], ['CPD-14807', ...","[['CPD-1242', 'NADPH', 'PROTON'], ['CPD-14806'...","[-1.6264648000000002, -0.1665039, 'No-Data', 2...",1


In [91]:
recover_list(master,'SUBSTRATES')

# ELLIE PAY ATTENTION TO ME

## you will probably need to run the recover_list function and then the rmv cofactors function.

## you are working in this region of the notebook.

## you need to find a way to make the get_pubchem function work with the master dataframe. this will hopefully get rid of things (like the "ACCEPTOR") if they do not have a pubchem id. You will also need to run a couple of cells relating to the df_cpd at the tippity top of this notebook

In [92]:
master.head()

Unnamed: 0.1,Unnamed: 0,EC-NUMBER,SUBSTRATES,PRODUCTS,GIBBS,DIRECTION
0,0,EC-1,"[[CPD-722, Red-Thioredoxin], [CPD-8922, OXYGEN...","[[BIOTIN, Ox-Thioredoxin, WATER], [CPD-8928], ...","[4.5452347, -50.319992, 'No-Data', 'No-Data', ...",1
1,1,EC-1,"[[BIOTIN, Ox-Thioredoxin, WATER], [CPD-8928], ...","[[CPD-722, Red-Thioredoxin], [CPD-8922, OXYGEN...","[4.5452347, -50.319992, 'No-Data', 'No-Data', ...",-1
2,2,EC-1.1,"[[CPD-19423, Acceptor], [C86-cis-keto-mycolate...","[[CPD-19443, Donor-H2], [3-oxo-C86-cis-keto-my...","['No-Data', 'No-Data', 'No-Data', 'No-Data', '...",1
3,3,EC-1.1,"[[CPD-19443, Donor-H2], [3-oxo-C86-cis-keto-my...","[[CPD-19423, Acceptor], [C86-cis-keto-mycolate...","['No-Data', 'No-Data', 'No-Data', 'No-Data', '...",-1
4,4,EC-1.1.1,"[[D-galactopyranose, NADP], [CPD-14807, NADP],...","[[CPD-1242, NADPH, PROTON], [CPD-14806, NADPH,...","[-1.6264648000000002, -0.1665039, 'No-Data', 2...",1


In [90]:
master['PRODUCTS']

['BIOTIN', 'Ox-Thioredoxin', 'WATER']

In [100]:
subs_list = []
pdts_list = []
for index, row in master.iterrows():
    
    Subs = []
    Pdts = []
    
    for item_set in row['SUBSTRATES']:
        if type(item_set) == list:
            for item in item_set:
                Subs.append(item)
        else:
            Subs.append(item_set)
    for item_set in row['PRODUCTS']:
        if type(item_set) == list:
            for item in item_set:
                Pdts.append(item)
        else:
            Pdts.append(item_set)
    subs_list.append(Subs)
    pdts_list.append(Pdts)
    
master['SUBS'] = subs_list
master['PDTS'] = pdts_list

In [81]:
master.head()

Unnamed: 0.1,Unnamed: 0,EC-NUMBER,SUBSTRATES,PRODUCTS,GIBBS,DIRECTION,SUBS,PDTS
0,0,EC-1,"[['CPD-722', 'Red-Thioredoxin'], ['CPD-8922', ...","[['BIOTIN', 'Ox-Thioredoxin', 'WATER'], ['CPD-...","[4.5452347, -50.319992, 'No-Data', 'No-Data', ...",1,"[[, [, ', C, P, D, -, 7, 2, 2, ', ,, , ', R, ...","[[, [, ', B, I, O, T, I, N, ', ,, , ', O, x, ..."
1,1,EC-1,"[['BIOTIN', 'Ox-Thioredoxin', 'WATER'], ['CPD-...","[['CPD-722', 'Red-Thioredoxin'], ['CPD-8922', ...","[4.5452347, -50.319992, 'No-Data', 'No-Data', ...",-1,"[[, [, ', B, I, O, T, I, N, ', ,, , ', O, x, ...","[[, [, ', C, P, D, -, 7, 2, 2, ', ,, , ', R, ..."
2,2,EC-1.1,"[['CPD-19423', 'Acceptor'], ['C86-cis-keto-myc...","[['CPD-19443', 'Donor-H2'], ['3-oxo-C86-cis-ke...","['No-Data', 'No-Data', 'No-Data', 'No-Data', '...",1,"[[, [, ', C, P, D, -, 1, 9, 4, 2, 3, ', ,, , ...","[[, [, ', C, P, D, -, 1, 9, 4, 4, 3, ', ,, , ..."
3,3,EC-1.1,"[['CPD-19443', 'Donor-H2'], ['3-oxo-C86-cis-ke...","[['CPD-19423', 'Acceptor'], ['C86-cis-keto-myc...","['No-Data', 'No-Data', 'No-Data', 'No-Data', '...",-1,"[[, [, ', C, P, D, -, 1, 9, 4, 4, 3, ', ,, , ...","[[, [, ', C, P, D, -, 1, 9, 4, 2, 3, ', ,, , ..."
4,4,EC-1.1.1,"[['D-galactopyranose', 'NADP'], ['CPD-14807', ...","[['CPD-1242', 'NADPH', 'PROTON'], ['CPD-14806'...","[-1.6264648000000002, -0.1665039, 'No-Data', 2...",1,"[[, [, ', D, -, g, a, l, a, c, t, o, p, y, r, ...","[[, [, ', C, P, D, -, 1, 2, 4, 2, ', ,, , ', ..."


In [101]:
small = master.iloc[:5,:]

In [107]:
def rm_cofactor_only_cpd1(df,prod_columnname,sub_columnname,cofactor_list):
    cleaned_product_column = []
    cleaned_substrate_column = []
    for index,row in df.iterrows():
        prod_compound_list =[]
        sub_compound_list = []
        for compound in row[prod_columnname]:
            if compound not in cofactor_list:
                prod_compound_list.append(compound)
            else:
                pass
        
        for comp in row[sub_columnname]:
            if comp not in cofactor_list:
                sub_compound_list.append(comp)
            else:
                pass
        if len(prod_compound_list)==0:
            cleaned_product_column.append('NA')
        else: 
            cleaned_product_column.append(prod_compound_list)
        if len(sub_compound_list)==0:
            cleaned_substrate_column.append('NA')
        else: 
            cleaned_substrate_column.append(sub_compound_list)
    newdf = df.drop(['PRODUCTS', 'SUBSTRATES'], axis=1)
    newdf['SUBSTRATES'] = cleaned_substrate_column
    newdf['PRODUCTS'] = cleaned_product_column

    return newdf

def rm_cofactor_only_cpd(df,prod_columnname,sub_columnname,cofactor_list):
    cleaned_product_column = []
    cleaned_substrate_column = []
    for index,row in df.iterrows():
        prod_compound_list =[]
        sub_compound_list = []
        compounds = ast.literal_eval(row[prod_columnname])
        for compound in compounds:
            if compound not in cofactor_list:
                prod_compound_list.append(compound)
            else:
                pass
        
        comps = ast.literal_eval(row[sub_columnname])
        for comp in comps:
            if comp not in cofactor_list:
                sub_compound_list.append(comp)
            else:
                pass
        if len(prod_compound_list)==0:
            cleaned_product_column.append('NA')
        else: 
            cleaned_product_column.append(prod_compound_list)
        if len(sub_compound_list)==0:
            cleaned_substrate_column.append('NA')
        else: 
            cleaned_substrate_column.append(sub_compound_list)
    newdf = df.drop(['PRODUCTS', 'SUBSTRATES'], axis=1)
    newdf['SUBSTRATES'] = cleaned_substrate_column
    newdf['PRODUCTS'] = cleaned_product_column

    return newdf

In [5]:
cofactor_list = ['WATER',
 'ATP',
 'NAD',
 'NADH',
 'NADPH',
 'NADP',
 'OXYGEN-MOLECULE',
 'ADP',
 'PROTON',
 'CARBON-MONOXIDE',
 'CARBON-DIOXIDE',
 'CO2',
 'Carbon dioxide',
 'FMN',
 'FAD',
 'FMNH',
 'FADH',
 'FMNH2',
 'FADH2',
 'PYRIDOXAL_PHOSPHATE',
 'RIBOFLAVIN',
 'AMP']

In [108]:
cleaned_df_productinList = rm_cofactor_only_cpd1(small,'PDTS','SUBS',cofactor_list)

In [111]:
cleaned_df_productinList.iloc[1,7]

['CPD-722',
 'Red-Thioredoxin',
 'CPD-8922',
 'CPD-8922',
 'Donor-H2',
 'Nitroaromatic-Ox-Compounds',
 'CPD-10257',
 'Donor-H2',
 'OXOPENTENOATE',
 'Acceptor',
 'CPD-14537',
 'Donor-H2',
 'CPD-8927',
 'SALICYLALDEHYDE',
 'Donor-H2',
 'CPD-8961',
 'Donor-H2',
 'COUMARATE']

In [119]:
def get_pubchem(ID):
    
    """This function accept UNIQUE-ID and return InChI string of a certain compound"""
    if ID in df_cpd['PubChemID']:
        pubchem = df_cpd['PubChemID'][ID]
    else:
        pubchem = '0'
        
    return pubchem

In [117]:
def get_pubchem(df,colname):
    
    """This function accept UNIQUE-ID and return InChI string of a certain compound"""
    listed = df[colname].values.tolist()
    newlist = []
    for index, row in df.iterrows():
        ID = row[colname]
        if ID in df_cpd['PubChemID']:
            newlist.append(df_cpd['PubChemID'][ID])
        else:
            newlist.append('0')
    df['new'+colname] = newlist  
    return df

In [118]:
get_pubchem(master,'PRODUCTS')

TypeError: unhashable type: 'list'

In [109]:
cleaned_df_productinList.iloc[1,5]

['CPD-722',
 'Red-Thioredoxin',
 'CPD-8922',
 'OXYGEN-MOLECULE',
 'CPD-8922',
 'OXYGEN-MOLECULE',
 'Donor-H2',
 'Nitroaromatic-Ox-Compounds',
 'NADPH',
 'CPD-10257',
 'OXYGEN-MOLECULE',
 'Donor-H2',
 'OXOPENTENOATE',
 'Acceptor',
 'CPD-14537',
 'OXYGEN-MOLECULE',
 'PROTON',
 'Donor-H2',
 'CPD-8927',
 'OXYGEN-MOLECULE',
 'SALICYLALDEHYDE',
 'Donor-H2',
 'CPD-8961',
 'OXYGEN-MOLECULE',
 'Donor-H2',
 'COUMARATE']

In [None]:
# from metamoles.py
def parse_reversible_reactions(reaction_list: list):
    """
    parse_reversible_reactions() queries the KEGG database with the input
        reaction list, and parses the results for all reactions that have been
        annotated with "<=>" in the reaction equation, which suggests that the
        catalyzed reaction is reversible

    Args:
        reaction_list (list): contains KEGG reaction IDs (e.g. 'R00709')

    Returns:
        list: contains KEGG IDs of reversible reactions
    """

    reversible_reaction = []
    for reaction in reaction_list:
        reaction_file = REST.kegg_get(reaction).read()
        for i in reaction_file.rstrip().split("\n"):
            if i.startswith("EQUATION") and "<=>" in i:
                reversible_reaction.append(reaction)
    return reversible_reaction

def combine_substrates_products(df: pd.DataFrame):
    """
    combine_substrates_products() is for use with a collection of enzymes
        in which it is understood that they are capable of catalyzing both the
        forward and reverser reactions. In this case, both the substrates and
        the products should be considered as bioreachable products.
        This function parses the list of substrates and products from their
        respective fields in the input dataframe, and returns a new dataframe
        with the combined substrates and products in a column labeled 'product'

    WARNING: combine_substrates_products() should not be run multiple times on
        the same dataframe becuase it will will append duplicate substrates

    Args:
        df (pandas.DataFrame): must contain fields
            ['entry', 'substrate', 'product']

    Returns:
        pandas.DataFrame: contains only fields ['entry', 'product']
    """

    rowindex = np.arange(0,len(df))
    df_with_ordered_index = df.set_index(rowindex)

    newdf = df_with_ordered_index
    # should this be a .copy()?

    for index,row in df_with_ordered_index.iterrows():
        productlist = row['product']
        substratelist = row['substrate']
        newdf.iloc[index,2] = productlist + substratelist

    return newdf[["entry","product"]]


def explode_dataframe(dataframe: pd.DataFrame, explosion_function,
                        explosion_target_field: str, fields_to_include: list):
    """
    explode_dataframe() applies the input explosion_function to the target
        field in each row of a dataframe. Each item in the output of the
        explosion_function is an anchor for a new row in the new dataframe. All
        of the supplied fields_to_include are added to the explosion item,
        and appended to the new dataframe row.

    Args:
        dataframe (pandas.DataFrame): input dataset
        explosion_function (function): function to be applied to target
            column in dataframe
        explosion_target_field (str): name of field in dataframe to which the
            explosion funciton will be applied
        fields_to_include (list): a list of strings that denote the columns of
            the input dataframe to be included in the output

    Returns:
        pandas.DataFrame: new exploded dataframe
    """
    new_rows = []
    for _, row in dataframe.iterrows():
        explosion_list = explosion_function(row[explosion_target_field])
        for item in explosion_list:
            row_data = [row[field] for field in fields_to_include]
            row_data.append(item)
            new_rows.append(row_data)

    fields_to_include.append(explosion_target_field)
    new_df = pd.DataFrame(new_rows, columns=fields_to_include)

    return new_df


def remove_cofactors(master_df: pd.DataFrame, master_cpd_field: str,
                     cofactor_df: pd.DataFrame, cofactor_field: str,
                     drop_na=True):
    """
    remove_cofactors() should be used to clean the dataset of cofactors. These
        will be included in the KEGG records as substrates and products, but
        are not actually products in the reaction

    Args:
        master_df (pandas.DataFrame): input dataset
        master_cpd_field (str): field that contains products
        cofactor_df (pandas.DataFrame): contains cofactors to be removed
        cofactor_field (str): field that contains cofactors
        drop_na (bool): default True

    Returns:
        pandas.DataFrame: cleaned data without cofactor entries
    """
    cofactor_list = parse_compound_ids(cofactor_df[cofactor_field])
    bool_mask = [False if cpd in cofactor_list else True for cpd in master_df[master_cpd_field]]
    clean_df = master_df[bool_mask]
    clean_df = clean_df.drop_duplicates()

    if drop_na:
        clean_df = clean_df[clean_df[master_cpd_field] != 'NA']
    else:
        pass

    return clean_df

def create_negative_matches(dataframe: pd.DataFrame,
                            enzyme_field: str, compound_field: str):
    """
    create_negative_matches() returns two dataframes.
        One dataframe is positive data that contains all the enzyme-compound
        pairs that exist in the input dataset.
        The second data frame is negative data made from matching all
        enzyme-compound pairs that do not exist in the dataset.

    Args:
        dataframe (pandas.DataFrame): input dataset
        enzyme_field (str): column in dataframe that contains enzyme ids
        compound_field (str): column in dataframe that contains compound ids

    Returns:
        pandas.DataFrame: positive data
            (contains fields ['enzyme', 'product', 'reacts'])
        pandas.DataFrame: negative data
            (contains fields ['enzyme', 'product', 'reacts'])
    """
    unique_enzymes = set(dataframe[enzyme_field].unique())
    # set of all unique enzymes in provided dataframes
    unique_cpds = set(dataframe[compound_field].unique())
    # set of all unique compounds in provided dataframe

    positive_data = []
    negative_data = []
    # initialize empty lists

    for enzyme in unique_enzymes:
    # iterate through unique enzyme set
        working_prods = set(dataframe[dataframe[enzyme_field] == enzyme][compound_field].unique())
        # unique set of all products reported to reaction with this enzyme in provided dataset
        non_working_prods = (unique_cpds - working_prods)
        # set math of all remaining products in the dataset minus those reported to react

        reactions = [{'reacts':1.0, 'enzyme':enzyme, 'product':product} for product in working_prods]
        # create new entry for each positive reaction
        non_reactions = [{'reacts':0.0, 'enzyme':enzyme, 'product':product} for product in non_working_prods]
        # create new entry for each negative reaction

        positive_data.extend(reactions)
        # add positive reactions to master list
        negative_data.extend(non_reactions)
        # add negative reactions to master list

    positive_df = pd.DataFrame(positive_data)
    negative_df = pd.DataFrame(negative_data)

    return positive_df, negative_df



#### Still, there is a problem
For some `EC-NUMBER` value, they are in `|_|` bracket with weird annotation with character instead of number. To solve this, we can modify some string value but if there is no repeating `EC-NUMBER` with and without bracket, we can just change the `EC-NUMBER` column without renewing the dataframe.

## So, we have a sortof Master Dataframe v1 right now

### What is the next step?
- Alter CPD-ID into PubChemID or InChI ---> go for PubChemID first
    - Done!
- Make reversible reaction
- Make negative dataset

In [None]:
def get_pubchem(ID):
    
    """This function accept UNIQUE-ID and return InChI string of a certain compound"""
    if ID in df_cpd['PubChemID']:
        pubchem = df_cpd['PubChemID'][ID]
    else:
        pubchem = '0'
        
    return pubchem

In [None]:
df_rxn.head()

In [None]:
'Nitroaromatic-Ox-Compounds' in df_cpd['PubChemID']

In [None]:
'BIOTIN' in df_cpd['PubChemID']

In [None]:
if 'BIOTIN' in df_cpd['PubChemID']:
    print('Yes')

In [None]:
get_pubchem('MALONYL-ACP')

In [None]:
# Start from df_rxn and rerun the master dataframe again
subs_id = []
pdts_id = []

for index, row in df_rxn.iterrows():
    
    subs = []
    for item in row['SUBSTRATES']:
        subs.append(get_pubchem(item))
    subs_id.append(subs)
    
    pdts = []
    for item in row['PRODUCTS']:
        pdts.append(get_pubchem(item))
    pdts_id.append(pdts)

In [None]:
df_rxn['SUBSTRATES'] = subs_id
df_rxn['PRODUCTS'] = pdts_id
df_rxn.head()

In [None]:
counter_s_y = 0
counter_s_n = 0

counter_p_y = 0
counter_p_n = 0

for index, row in df_rxn.iterrows():
    
    for item in row['SUBSTRATES']:
        if item in df_cpd['PubChemID']:
            counter_s_y += 1
        else:
            counter_s_n += 1
    
    for item in row['PRODUCTS']:
        if item in df_cpd['PubChemID']:
            counter_p_y += 1
        else:
            counter_p_n += 1

print('Counting substrates True is', counter_s_y, 'False is', counter_s_n)
print('Counting products True is', counter_p_y, 'False is', counter_p_n)

In [None]:
percent_s = counter_s_y/(counter_s_y + counter_s_n)
percent_p = counter_p_y/(counter_p_y + counter_p_n)

print('Percents of available substrate and product are', percent_s*100, 'and', percent_p*100)

In [None]:
get_pubchem('Nitroaromatic-Ox-Compounds')

### Problems found
Where there is no data available for a certain UNIQUE-ID, the function goes wrong!!!
- Those are macromolecules, general group of compounds, etc.

Already solve by check availability by `in` function