### 2023-11-29 Get the substrate for each ph optimum from BRENDA
* 1. Get the substrate for proteins I already curated
* 2. Assign the substrate based on papers where it's listed
      - if Kms are listed, just use the substrate with the lowest Km(?)
* 3. For papers/organisms where multiple substrates are listed, need to review manually
* 4. Sabio RK has substrates listed. About 1/4 of uniprot IDs overlap with BRENDA
* 5. RetroBioCat has the substrate curated along with the pH of the reaction

TODO UniProt pH data

In [1]:
import numpy as np
import pandas as pd
import os
import itertools
from typing import List, Tuple
import string
from pathlib import Path
from tqdm.auto import tqdm, trange

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [5]:
base_dir = Path("/projects/robustmicrob/jlaw/projects/prot_stability_engineering")
inputs_dir = base_dir / "inputs/brenda"

### Location of manually curated files
- inputs/brenda/manual_annot
- ph_range: 20230907_ph_range_redox_ann.csv, 20230907_ph_range_multi_prot_ec_reactions.csv
- ph_opt_redox: manual_annot/20230907_ph_opt_redox_ann.csv
  - only has oxidation/reduction assignment, not substrate info
- Google sheet with manually reviewed info: 
[20230907_ph_range_multi_prot_ec_reactions](https://docs.google.com/spreadsheets/d/1bEWh-YTyQF_R0x4COMM9Q1CX-Uzkd5dmIDqM26JnwHQ/edit?usp=sharing)

In [2]:
multi_file = inputs_dir / "manual_annot/20230907_ph_range_multi_prot_ec_reactions.csv"
data_multi = pd.read_csv(multi_file)
print(len(data_multi))
data_multi.head(2)


876


Unnamed: 0,index,ec_num,ph_min,ph_max,comments,organism,uniprot_id,ref,notes,ec_uniprot_grp,substrates,products,reaction_type,substrates_rev,enzyme_immobilization,condition,to_fix,notes.1
0,3270,4.1.2.42,5.0,10.0,"pH 5.0: about 55% of maximal activity, pH 10.0...",Delftia sp.,A0A031HCH9,747468,"{'aldol addition', 'retro-aldol addition'}",89,{'D-threonine'},{'glycine + acetaldehyde'},-,,,,,
1,3271,4.1.2.42,6.0,9.5,"pH 6.0: about 65% of maximal activity, pH 9.5:...",Delftia sp.,A0A031HCH9,747468,{'retro-aldol addition'},89,{'D-threonine'},{'glycine + acetaldehyde'},-,,,,,


In [4]:
redox_ann_file = inputs_dir / "manual_annot/20230907_ph_range_redox_ann.csv"
data_redox = pd.read_csv(redox_ann_file)
print(len(data_redox))
data_redox.head(2)

725


Unnamed: 0,index,ec_link,ref_link,ec_num,name,ph_min,ph_max,comments,organism,uniprot_id,...,substrates,products,notes,reaction_type,substrates_rev,condition,to_fix,notes.1,rxn_type_guess,all_substrates
0,13,https://www.brenda-enzymes.org/enzyme.php?ecno...,https://www.brenda-enzymes.org/literature.php?...,1.1.1.1,alcohol dehydrogenase,7.0,9.6,"pH 7.0: about 70% of maximal activity, pH 9.6:...",Aeropyrum pernix,Q9Y9P9,...,"{'1-propanol + NAD+', '2-propanol + NAD+', 'et...","{'acetaldehyde + NADH + H+', '4-methoxybenzald...",{'reduction of 2-pentanone'},reduction,2-pentanone,,,,oxidation_or_reduction,"{'benzyl alcohol', '4-methoxyphenylacetone', '..."
1,16,https://www.brenda-enzymes.org/enzyme.php?ecno...,https://www.brenda-enzymes.org/literature.php?...,1.1.1.1,alcohol dehydrogenase,9.6,11.5,"pH 9.5: about 40% of maximal activity, pH 11.5...",Aeropyrum pernix,Q9Y9P9,...,"{'1-propanol + NAD+', '2-propanol + NAD+', 'et...","{'acetaldehyde + NADH + H+', '4-methoxybenzald...",{'oxidation of 2-pentanol'},oxidation,2-pentanone,,,,oxidation_or_reduction,"{'benzyl alcohol', '4-methoxyphenylacetone', '..."


In [6]:
# "rev" stands for "reviewed"
data_multi.substrates_rev.value_counts()

inulin + H2O                            12
4-nitrophenyl beta-D-glucopyranoside     9
filaggrin-L-arginine + H2O               6
ADP + D-glucose                          4
xylan                                    4
                                        ..
benzaldehyde                             1
L-lactate + NAD+                         1
oxaloacetate + NADH + H+                 1
(S)-malate + NAD+                        1
D-xylulose + NADH + H+                   1
Name: substrates_rev, Length: 98, dtype: int64

In [7]:
data_redox.substrates_rev.value_counts()

(+)-bornane-2,5-dione + FMNH2 + O2                        4
cellobiose                                                4
glutathione + dehydroascorbate                            3
pyruvate + NADH + H+                                      2
spermine                                                  2
reduced coenzyme F420 + NADP+                             2
pyruvate + NH3 + NADH                                     2
propanal                                                  2
methylmalonate-semialdehyde + CoA + NAD+                  2
2-acetolactate                                            2
glyoxylate                                                2
6-phospho-D-gluconate + NADP+                             2
(S)-malate + NADP+                                        2
2-pentanone                                               2
salutaridine + NADPH + H+                                 2
(7S)-salutaridinol + NADP+', 'salutaridinol + NADP+       2
scytalone + NADP+                       