#### Dependencies:

In [77]:
from kcat_prediction import *

Before using kcat prediction for the first time, we have to download the ESM-1b model:

In [78]:
import esm
model, alphabet = torch.hub.load("facebookresearch/esm:v0.4.0", "esm1b_t33_650M_UR50S")

Using cache found in C:\Users\marle/.cache\torch\hub\facebookresearch_esm_v0.4.0


Example for calculating kcat prediction. You can enter multiple enyzmes and substrates. Enzyme 1 will be paired with substrates 1/products 1, enzyme 2 will be paired with substrates 2/products 2 and so on...

Enter enzymes as amino acid sequences and substrates/products either as SMILES strings, KEGG Compound IDs, or InChI strings. If the reaction has multiple substrates or products, separate them by using a semicolon (";)

In [79]:
import pandas as pd

split = "full"
data = pd.read_pickle(join("../../data", "splits", split, "val_data.pkl"))

data["substrate_IDs"]

data['substrate_IDs'] = data['substrate_IDs'].apply(list)

data['substrate_IDs'] = ['#'.join(map(str, l)) for l in data['substrate_IDs']]

data['product_IDs'] = data['product_IDs'].apply(list)

data['product_IDs'] = ['#'.join(map(str, l)) for l in data['product_IDs']]

In [81]:
df = kcat_predicton(substrates = data["substrate_IDs"].tolist(),
               products = data["product_IDs"].tolist(),
               enzymes = data["Sequence"].tolist(),
                values = np.log10(data["kcat"].tolist()),
                                 RID = data["Reaction ID"].tolist(),
                                 SID = data["Sequence ID"].tolist(),
                       EC = data["ECs"].tolist())

Step 1/3: Calculating numerical representations for all substrates and products.
Step 2/3: Calculating numerical representations for all enzymes.
.....2(a) Loading ESM-1b model.
.....2(b) Loading model parameters for task-specific model.
.....2(c) Calculating enzyme representations.
Step 3/3: Making predictions for kcat.
                                          substrates  \
0  InChI=1S/C10H10N2O/c13-12-6-5-8-7-11-10-4-2-1-...   
1  InChI=1S/C10H10N2O/c13-12-6-5-8-7-11-10-4-2-1-...   
2  InChI=1S/C10H17NO9S2/c1-2-3-6(11-20-22(16,17)1...   
3  InChI=1S/H2O/h1H2#InChI=1S/C12H15NO8/c14-5-8-9...   
4  InChI=1S/C3H7NO2S/c4-2(1-7)3(5)6/h2,7H,1,4H2,(...   
5  InChI=1S/C3H7NO2S/c4-2(1-7)3(5)6/h2,7H,1,4H2,(...   
6  InChI=1S/C3H7NO2S/c4-2(1-7)3(5)6/h2,7H,1,4H2,(...   
7  InChI=1S/C10H10N2O/c13-12-6-5-8-7-11-10-4-2-1-...   
8  InChI=1S/C10H10N2O/c13-12-6-5-8-7-11-10-4-2-1-...   

                                            products  \
0  InChI=1S/H2O/h1H2#InChI=1S/C13H15N3O3S/c14-10(...   
1  I

In [83]:
true_values = [0.367977,-0.053875,1.079181,0.863323,-1.014723]
predicted = df.head(5)["kcat [s^(-1)]"].tolist()

In [85]:
import sklearn.metrics as sk
import scipy as sci
from math import sqrt
df = df[df["complete"]==True]
rms = sqrt(sk.mean_squared_error(df["value"].tolist(), df["kcat [s^(-1)]"].tolist(), squared=True))
R2 = sk.r2_score(df["value"].tolist(), df["kcat [s^(-1)]"].tolist())
Pearson = sci.stats.pearsonr(df["value"].tolist(), df["kcat [s^(-1)]"].tolist())[0]

print(rms, R2, Pearson)

rms = sqrt(sk.mean_squared_error((10**df["value"]).tolist(), (10**df["kcat [s^(-1)]"]).tolist(), squared=True))
R2 = sk.r2_score((10**df["value"]).tolist(), (10**df["kcat [s^(-1)]"]).tolist())
Pearson = sci.stats.pearsonr((10**df["value"]).tolist(), (10**df["kcat [s^(-1)]"]).tolist())[0]
MAE = np.mean(abs(np.array((10**df["value"]).tolist()) - np.array((10**df["kcat [s^(-1)]"]).tolist())))
MedAE = np.median(abs(np.array((10**df["value"]).tolist()) - np.array((10**df["kcat [s^(-1)]"]).tolist())))

print(rms, R2, Pearson, MAE, MedAE)

2.5370512866253274 -9.919769516834709 -0.5445550976822251
961.407875587815 -61838.17550337661 -0.48324017501752337 570.9313425666096 43.09756088256836
