Notebook purpose: evaluate how efficiently we could search for catalysts using the ML model under various constraints.

The most conspicuous constraint is to find a set number of active catalysts without any unnecessary DFT calculations
What is unnecessary? --> 100% of O2 binding calculations are to actual binding sites
So we can accept a model with lower accuracy as long as it has no false positives --> only a small penalty for false negatives

Let's say we're only willing to run 5 DFT O2 binding calculations, and we want basically all of them to show that we found active sites. We'd probably want each of these to be per catalyst, to show that we've found 5 unique active catalysts. Assuming we're working with 10% of the data as a "test" set, that's about 27 calalysts, so we want to pick the ones that the model is most confident have at least 1 site that binds O2.

Really, this is a question of whether the active sites for a set of catalysts are most likely to actually be binding
Can order by log-loss and take that as an estimate of uncertainty (is that a fair expectation?)


In [1]:
import pandas as pd
import numpy as np

# Preprocessing
from sklearn.model_selection import GroupShuffleSplit

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import confusion_matrix

In [2]:
from ngcc_ml import data_tools
from ngcc_ml import skl_tools

In [3]:
df = pd.read_csv("/home/nricke/work/ngcc_ml/DidItBindv5.csv")
df["Doesitbind"] = df["Doesitbind"].astype("int")

In [4]:
df.columns

Index(['Unnamed: 0', 'Atom Number', 'Catalyst Name', 'CatalystO2File',
       'Element', 'SpinDensity', 'ChElPGPositiveCharge', 'ChElPGNeutralCharge',
       'ChargeDifference', 'Doesitbind', 'BondLength', 'IonizedFreeEnergy',
       'IonizationEnergy', 'BindingEnergy', 'NeutralFreeEnergy', 'OrthoOrPara',
       'Meta', 'FartherThanPara', 'DistanceToN', 'AverageBondLength',
       'BondLengthRange', 'NumberOfHydrogens', 'AromaticSize', 'IsInRingSize6',
       'IsInRingSize5', 'NeighborSpinDensity', 'NeighborChElPGCharge',
       'NeighborChargeDifference', 'AromaticExtent', 'RingEdge',
       'NumNitrogens', 'NumHeteroatoms', 'ring_nitrogens',
       'atom_plane_deviation', 'ring_plane_deviation', 'charge'],
      dtype='object')

In [17]:
df

Unnamed: 0.1,Unnamed: 0,Atom Number,Catalyst Name,CatalystO2File,Element,SpinDensity,ChElPGPositiveCharge,ChElPGNeutralCharge,ChargeDifference,Doesitbind,...,NeighborChElPGCharge,NeighborChargeDifference,AromaticExtent,RingEdge,NumNitrogens,NumHeteroatoms,ring_nitrogens,atom_plane_deviation,ring_plane_deviation,charge
0,0,1,sf100x0,,C,-0.008245,-0.275350,-0.200227,0.075123,0,...,0.363881,-0.176308,0,0,1,5,0,0.000000e+00,0.000000,0
1,1,3,sf100x0,sf100x0O2-2_optsp_a0m2.out,C,0.555664,-0.064043,-0.339572,-0.275529,1,...,0.338356,-0.005618,18,2,1,5,1,6.723311e-02,0.263086,0
2,2,4,sf100x0,,C,-0.181519,0.037008,0.096249,0.059241,0,...,-0.680474,-0.444099,18,1,1,5,1,3.875031e-02,0.263086,0
3,3,5,sf100x0,,C,0.208580,-0.226453,-0.308850,-0.082397,0,...,0.449458,0.030624,18,2,1,5,1,1.263070e-05,0.263086,0
4,4,6,sf100x0,sf100x0O2-5_optsp_a0m2.out,C,-0.119560,0.176015,0.163894,-0.012121,0,...,-0.461894,-0.191920,18,2,1,5,1,6.831072e-02,0.263086,0
5,5,7,sf100x0,sf100x0O2-6_optsp_a0m2.out,C,0.221689,0.308617,0.215658,-0.092959,0,...,-0.416906,-0.059564,18,2,1,5,1,1.112409e-02,0.263086,0
6,6,8,sf100x0,,C,-0.138080,-0.337969,-0.357701,-0.019732,0,...,0.389078,-0.188655,18,2,1,5,1,1.082265e-02,0.263086,0
7,7,9,sf100x0,,C,0.243169,0.054121,-0.032052,-0.086173,0,...,-0.182024,0.039573,18,1,1,5,1,8.823929e-02,0.263086,0
8,8,10,sf100x0,,C,-0.012210,0.079364,0.079428,0.000064,0,...,-0.220231,-0.121254,18,1,1,5,1,6.307182e-03,0.263086,0
9,9,11,sf100x0,sf100x0O2-10_optsp_a0m2.out,C,-0.006535,0.093478,0.074304,-0.019174,0,...,0.029898,-0.063652,18,1,1,5,1,3.980934e-02,0.263086,0


In [5]:


feature_cols = {"SpinDensity", "ChElPGNeutralCharge", "ChargeDifference", "IonizationEnergy", "OrthoOrPara", "Meta", "FartherThanPara", "DistanceToN", "AverageBondLength",  "NumberOfHydrogens", "IsInRingSize6", "IsInRingSize5", "NeighborSpinDensity", 'NeighborChElPGCharge', 'NeighborChargeDifference', "AromaticExtent", "RingEdge", "NumNitrogens", "NumHeteroatoms", "charge", "atom_plane_deviation", "ring_plane_deviation", "ring_nitrogens"}
not_scaled_cols = {"OrthoOrPara", "Meta", "FartherThanPara", "NumberOfHydrogens", "IsInRingSize6", "IsInRingSize5", "RingEdge", "NumNitrogens", "NumHeteroatoms", "ring_nitrogens", "charge"}
df_scale = data_tools.process_data(df, scaledCols=list(feature_cols - not_scaled_cols))
rfc = RandomForestClassifier(n_estimators=1000, max_depth=100, class_weight={0:0.5, 1:0.5})
gkf_scores, df_gkf = skl_tools.group_kfold_evaluate(rfc, df_scale, feature_cols, target_col="Doesitbind")

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


Accuracy of RFC on test set: 0.96
Accuracy of RFC on training set: 1.00
Accuracy of RFC on test set: 0.95
Accuracy of RFC on training set: 1.00
Accuracy of RFC on test set: 0.94
Accuracy of RFC on training set: 1.00
Accuracy of RFC on test set: 0.95
Accuracy of RFC on training set: 1.00
Accuracy of RFC on test set: 0.96
Accuracy of RFC on training set: 1.00
Accuracy of RFC on test set: 0.96
Accuracy of RFC on training set: 1.00
Accuracy of RFC on test set: 0.95
Accuracy of RFC on training set: 1.00
Accuracy of RFC on test set: 0.95
Accuracy of RFC on training set: 1.00
Accuracy of RFC on test set: 0.95
Accuracy of RFC on training set: 1.00
Accuracy of RFC on test set: 0.95
Accuracy of RFC on training set: 1.00
mean: 0.9526865859309659


In [6]:
df_gkf.columns

Index(['Unnamed: 0', 'Atom Number', 'Catalyst Name', 'CatalystO2File',
       'Element', 'SpinDensity', 'ChElPGPositiveCharge', 'ChElPGNeutralCharge',
       'ChargeDifference', 'Doesitbind', 'BondLength', 'IonizedFreeEnergy',
       'IonizationEnergy', 'BindingEnergy', 'NeutralFreeEnergy', 'OrthoOrPara',
       'Meta', 'FartherThanPara', 'DistanceToN', 'AverageBondLength',
       'BondLengthRange', 'NumberOfHydrogens', 'AromaticSize', 'IsInRingSize6',
       'IsInRingSize5', 'NeighborSpinDensity', 'NeighborChElPGCharge',
       'NeighborChargeDifference', 'AromaticExtent', 'RingEdge',
       'NumNitrogens', 'NumHeteroatoms', 'ring_nitrogens',
       'atom_plane_deviation', 'ring_plane_deviation', 'charge',
       'Doesitbind_pred', 'Doesitbind_predproba'],
      dtype='object')

In [10]:
df_gkf = df_gkf.sort_values(by="Doesitbind_predproba", ascending=False)[["Catalyst Name", "Doesitbind", "Doesitbind_pred", "Doesitbind_predproba"]]

In [22]:
df_gkf = df_gkf.merge(df[["Atom Number"]], left_index=True, right_index=True)

In [23]:
df_gkf.to_csv("df_classify_GroupKFold.csv")

Now that we've seen this is relatively successful in this framework, the next step is to do a head-to-head search comparison.
For a set of C catalysts, search until a subset A are found that are active, with the goal of checking O2 binding for as few as possible.
This is really quite similar to above, but we just want to keep track of slightly different metrics. For each group, we now want to instead ask 

In [12]:
def search_for_active_catalysts(df_catalysts, order_col, feature_cols, target_col="Doesitbind", find_num=10):
    """
    df_catalysts (pandas dataframe): catalysts to search
    order_col (str): column name to sort catalysts by. Expected for predict_proba or random values
    """
    df_sort = df_catalysts.sort_values(by=order_col, ascending=False)
    found_list = []
    count = 0
    for index, row in df_sort.iterrows():
        if row["Catalyst Name"] not in found_list:
            if row[target_col] == 1:
                found_list.append(row["Catalyst Name"])
            count += 1
            assert len(found_list) <= find_num
            if len(found_list) == find_num:
                break
    return found_list, count

In [13]:
df_ts = df_gkf.copy()
df_ts

Unnamed: 0,Catalyst Name,Doesitbind,Doesitbind_pred,Doesitbind_predproba
2487,sf252x0,1,1,1.000
3319,sf45x0,1,1,1.000
2217,sf238x0,1,1,1.000
3697,sf64x0,1,1,1.000
967,sf158x0,1,1,1.000
3644,sf61x0,1,1,1.000
3903,sf80x0,1,1,1.000
1812,sf208x0,1,1,1.000
2316,sf243x0,1,1,0.999
704,sf146x0,1,1,0.999


In [14]:
df_ts = df_ts.assign(random_ordering=np.random.rand(df_ts.shape[0]))

In [15]:
df_ts.iloc[0].random_ordering

0.22637466711018506

In [None]:
l, c = search_for_active_catalysts(df_ts, order_col="random_ordering", feature_cols=feature_cols, find_num=100)
print(len(l))
print(c)

In [None]:
df_test_all = df_test_all.drop_duplicates()

In [None]:
df_test_all = df_test_all.assign(random_ordering=np.random.rand(df_test_all.shape[0]))

In [16]:
l_O2, c_O2 = search_for_active_catalysts(df_ts, order_col="random_ordering", feature_cols=feature_cols, find_num=100)
print(len(l_O2), c_O2)
l_t, c_t = search_for_active_catalysts(df_ts, order_col="Doesitbind_predproba", feature_cols=feature_cols, find_num=100)
print(len(l_t), c_t)

100 607
100 101
