1. Load mol objects and highlight sites that are being misclassified
2. A lot of the false positives are tertiary carbon atoms with high spin. Tertiary carbon atoms don't usually do as well, but the model can't tell the difference between T-carbons embedded in rings or connected to methyl groups. This could deserve its own feature
3. A lot of the false negatives are for small molecules with multiple nitrogen atoms. Correcting the aromatic ring size feature could potentially improve this. First check if the RDKit feature is trustworthy; if not, can just look for the full set of carbons with only 3 bonds, as these are necessarily sp2.
4. The aromatic ring size features appears to be completely wrong, and I found at least one counterexample for OrthoOrPara. DistanceToN may obviate OrthoOrPara anyway
5. sf189x0 has an active site at carbon 4 (binding energy -0.1eV), and a symmetric site at carbon 21 that is not active. This likely illustrates different conformers can subtely affect these energies
6. Catalyst sf41x0 lists all binding sites as inactive, but some actually are (and possibly most). If I move the correct files to the correct directory, can I just rerun Kunal's parsing scripts?

Score to beat: 93% with random forests
Since 80% of sites are not active, the threshold for success is 80%

In [1]:
import pandas as pd

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GroupShuffleSplit
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

# Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

# Data Analysis
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix

In [2]:
from ngcc_ml import data_tools



In [3]:
def processData(df_in, oneHotCols=[], scaledCols=[]):
    df = df_in.copy()
    scaler = StandardScaler()
    oneHotEncoder = OneHotEncoder(categories = "auto", sparse = False)
    if scaledCols != []:
        df[scaledCols] = scaler.fit_transform(df[scaledCols])
    #if oneHotCols != []:
    #    df[encoded_columns] = oneHotEncoder.fit_transform(alldata[oneHotCols])
    return df
    

In [4]:
#df = pd.read_csv("/home/nricke/work/klodaya/notebooks_klodaya/DidItBindWithSubstructure.csv", index_col=0)
#df = pd.read_csv("/home/nricke/work/ngcc_ml/DidItBindv2.csv")
df = pd.read_csv("/home/nricke/work/ngcc_ml/DidItBindv4.csv")
df_aug = pd.read_json("/home/nricke/work/ngcc_ml/cat_aux_aromatic_flex.json")
df_aug["Atom Number"] = df_aug["Atom Number"] + 1
df_aug.drop(columns=["ring_edge", "aromatic_extent"], inplace=True)
df = df.merge(df_aug, on=["Atom Number", "Catalyst Name"])

In [5]:
#df.to_csv("/home/nricke/work/ngcc_ml/DidItBindv5.csv")

In [6]:
print(df.columns)

Index(['Atom Number', 'Catalyst Name', 'CatalystO2File', 'Element',
       'SpinDensity', 'ChElPGPositiveCharge', 'ChElPGNeutralCharge',
       'ChargeDifference', 'Doesitbind', 'BondLength', 'IonizedFreeEnergy',
       'IonizationEnergy', 'BindingEnergy', 'NeutralFreeEnergy', 'OrthoOrPara',
       'Meta', 'FartherThanPara', 'DistanceToN', 'AverageBondLength',
       'BondLengthRange', 'NumberOfHydrogens', 'AromaticSize', 'IsInRingSize6',
       'IsInRingSize5', 'NeighborSpinDensity', 'NeighborChElPGCharge',
       'NeighborChargeDifference', 'AromaticExtent', 'RingEdge',
       'NumNitrogens', 'NumHeteroatoms', 'ring_nitrogens',
       'atom_plane_deviation', 'ring_plane_deviation', 'charge'],
      dtype='object')


In [7]:
df = df[df["Catalyst Name"] != "sf7x0"]

In [23]:
feature_cols = {"SpinDensity", "ChElPGNeutralCharge", "ChargeDifference", "IonizationEnergy", "OrthoOrPara", "Meta", "FartherThanPara", "DistanceToN", "NumberOfHydrogens", "IsInRingSize6", "IsInRingSize5", "NeighborSpinDensity", 'NeighborChElPGCharge', 'NeighborChargeDifference', "AromaticExtent", "RingEdge", "NumNitrogens", "NumHeteroatoms", "charge", "atom_plane_deviation", "ring_plane_deviation", "ring_nitrogens"}
not_scaled_cols = {"OrthoOrPara", "Meta", "FartherThanPara", "NumberOfHydrogens", "IsInRingSize6", "IsInRingSize5", "RingEdge", "NumNitrogens", "NumHeteroatoms", "ring_nitrogens", "charge"}
df_scale = processData(df, scaledCols=list(feature_cols - not_scaled_cols))
X = df_scale[feature_cols]
y = df_scale["Doesitbind"].astype('int')

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [24]:
poly = PolynomialFeatures(2)
X_poly = poly.fit_transform(X)
X_train_poly, X_test_poly, y_train_poly, y_test_poly = train_test_split(X, y, test_size=0.1, random_state=0)

In [25]:
train_inds, test_inds = next(GroupShuffleSplit(test_size=0.10, n_splits=2, random_state = 7).split(df, groups=df['Catalyst Name']))
train = df.iloc[train_inds]
test = df.iloc[test_inds]
X_train_group = train[feature_cols]
y_train_group = train["Doesitbind"].astype("int")
X_test_group = test[feature_cols]
y_test_group = test["Doesitbind"].astype("int")

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)

In [20]:
rfc = RandomForestClassifier(n_estimators=100, max_depth=100, class_weight={0:0.5, 1:0.5})
rfc.fit(X_train, y_train)
print('Accuracy of RFC on test set: {:.2f}'.format(rfc.score(X_test, y_test)))
print('Accuracy of RFC on training set: {:.2f}'.format(rfc.score(X_train, y_train)))
scores = cross_val_score(rfc, X, y, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy of RFC on test set: 0.96
Accuracy of RFC on training set: 1.00
Accuracy: 0.95 (+/- 0.03)


In [21]:
rfc = RandomForestClassifier(n_estimators=100, max_depth=100, class_weight={0:0.5, 1:0.5})
rfc.fit(X_train, y_train)
print('Accuracy of RFC on test set: {:.2f}'.format(rfc.score(X_test_poly, y_test_poly)))
print('Accuracy of RFC on training set: {:.2f}'.format(rfc.score(X_train_poly, y_train_poly)))
scores = cross_val_score(rfc, X_poly, y, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy of RFC on test set: 0.95
Accuracy of RFC on training set: 1.00
Accuracy: 0.94 (+/- 0.04)


In [29]:
df.iloc[train_inds]

Unnamed: 0,Atom Number,Catalyst Name,CatalystO2File,Element,SpinDensity,ChElPGPositiveCharge,ChElPGNeutralCharge,ChargeDifference,Doesitbind,BondLength,...,NeighborChElPGCharge,NeighborChargeDifference,AromaticExtent,RingEdge,NumNitrogens,NumHeteroatoms,ring_nitrogens,atom_plane_deviation,ring_plane_deviation,charge
0,1,sf100x0,,C,-0.008245,-0.275350,-0.200227,0.075123,False,0.000000,...,0.363881,-0.176308,0,0,1,5,0,0.000000e+00,0.000000,0
1,3,sf100x0,sf100x0O2-2_optsp_a0m2.out,C,0.555664,-0.064043,-0.339572,-0.275529,True,1.535452,...,0.338356,-0.005618,18,2,1,5,1,6.723311e-02,0.263086,0
2,4,sf100x0,,C,-0.181519,0.037008,0.096249,0.059241,False,0.000000,...,-0.680474,-0.444099,18,1,1,5,1,3.875031e-02,0.263086,0
3,5,sf100x0,,C,0.208580,-0.226453,-0.308850,-0.082397,False,0.000000,...,0.449458,0.030624,18,2,1,5,1,1.263070e-05,0.263086,0
4,6,sf100x0,sf100x0O2-5_optsp_a0m2.out,C,-0.119560,0.176015,0.163894,-0.012121,False,1.701145,...,-0.461894,-0.191920,18,2,1,5,1,6.831072e-02,0.263086,0
5,7,sf100x0,sf100x0O2-6_optsp_a0m2.out,C,0.221689,0.308617,0.215658,-0.092959,False,1.619080,...,-0.416906,-0.059564,18,2,1,5,1,1.112409e-02,0.263086,0
6,8,sf100x0,,C,-0.138080,-0.337969,-0.357701,-0.019732,False,0.000000,...,0.389078,-0.188655,18,2,1,5,1,1.082265e-02,0.263086,0
7,9,sf100x0,,C,0.243169,0.054121,-0.032052,-0.086173,False,0.000000,...,-0.182024,0.039573,18,1,1,5,1,8.823929e-02,0.263086,0
8,10,sf100x0,,C,-0.012210,0.079364,0.079428,0.000064,False,0.000000,...,-0.220231,-0.121254,18,1,1,5,1,6.307182e-03,0.263086,0
9,11,sf100x0,sf100x0O2-10_optsp_a0m2.out,C,-0.006535,0.093478,0.074304,-0.019174,False,1.633351,...,0.029898,-0.063652,18,1,1,5,1,3.980934e-02,0.263086,0


In [30]:
df.iloc[test_inds]

Unnamed: 0,Atom Number,Catalyst Name,CatalystO2File,Element,SpinDensity,ChElPGPositiveCharge,ChElPGNeutralCharge,ChargeDifference,Doesitbind,BondLength,...,NeighborChElPGCharge,NeighborChargeDifference,AromaticExtent,RingEdge,NumNitrogens,NumHeteroatoms,ring_nitrogens,atom_plane_deviation,ring_plane_deviation,charge
48,1,sf102x0,,C,0.004268,-0.398224,-0.389284,0.008940,False,0.000000,...,0.931734,-0.030626,0,0,2,2,0,0.000000e+00,0.000000e+00,1
49,3,sf102x0,sf102x0O2-2_optsp_c1m2.out,C,0.210099,-0.275257,-0.383752,-0.108495,True,1.505260,...,0.859034,-0.031338,16,2,2,2,2,1.261000e-07,2.651000e-07,1
50,4,sf102x0,sf102x0O2-3_optsp_c1m2.out,C,0.015709,0.225917,0.204654,-0.021263,False,3.253027,...,-0.673833,-0.179364,16,1,2,2,2,2.300000e-09,2.651000e-07,1
51,5,sf102x0,sf102x0O2-4_optsp_c1m2.out,C,0.071579,-0.197178,-0.271850,-0.074672,False,1.575589,...,0.126852,-0.107606,16,2,2,2,2,6.510000e-07,2.651000e-07,1
52,6,sf102x0,sf102x0O2-5_optsp_c1m2.out,C,0.068855,-0.200452,-0.275572,-0.075120,False,1.576456,...,0.130110,-0.108109,16,2,2,2,2,4.759000e-07,2.651000e-07,1
53,7,sf102x0,sf102x0O2-6_optsp_c1m2.out,C,0.018490,0.225762,0.203310,-0.022452,False,3.245221,...,-0.666351,-0.175544,16,1,2,2,2,2.177000e-07,2.651000e-07,1
54,8,sf102x0,sf102x0O2-7_optsp_c1m2.out,C,0.205799,-0.273892,-0.379959,-0.106067,True,1.505522,...,0.853288,-0.034098,16,2,2,2,2,4.300000e-09,2.651000e-07,1
55,10,sf102x0,sf102x0O2-9_optsp_c1m2.out,C,0.240388,-0.230612,-0.344736,-0.114124,True,1.495640,...,0.808411,-0.036787,16,2,2,2,2,5.321000e-07,2.651000e-07,1
56,11,sf102x0,sf102x0O2-10_optsp_c1m2.out,C,0.015540,0.195531,0.170223,-0.025308,False,3.234220,...,-0.620300,-0.183944,16,1,2,2,2,1.168000e-07,2.651000e-07,1
57,12,sf102x0,sf102x0O2-11_optsp_c1m2.out,C,-0.015012,-0.016463,-0.010820,0.005643,False,3.213171,...,0.355302,-0.043957,16,1,2,2,2,1.880000e-07,2.651000e-07,1


In [31]:
rfc = RandomForestClassifier(n_estimators=1000, max_depth=100, class_weight={0:0.5, 1:0.5})
rfc.fit(X_train_group, y_train_group)
print('Accuracy of RFC on test set: {:.2f}'.format(rfc.score(X_test_group, y_test_group)))
print('Accuracy of RFC on training set: {:.2f}'.format(rfc.score(X_train_group, y_train_group)))
#scores = cross_val_score(rfc, X_, y, cv=10)
#print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy of RFC on test set: 0.95
Accuracy of RFC on training set: 1.00


In [36]:
y_pred = rfc.predict(X_test)

In [37]:
confusion_matrix(y_test, y_pred)

array([[334,   4],
       [ 15,  62]])

In [25]:
# Get indices of misclassified active sites
X_test_y = X_test.copy()
X_test_y["y_pred"] = y_pred
X_test_y["y_test"] = y_test

In [26]:
X_test_y

Unnamed: 0,SpinDensity,ChElPGNeutralCharge,ChargeDifference,IonizationEnergy,OrthoOrPara,Meta,FartherThanPara,DistanceToN,NumberOfHydrogens,IsInRingSize6,...,NeighborChargeDifference,AromaticExtent,RingEdge,NumNitrogens,NumHeteroatoms,charge,atom_plane_deviation,ring_plane_deviation,y_pred,y_test
3839,0.172929,0.087672,-0.066763,0.112169,1,0,0,3,0,1,...,-0.046795,10,1,1,2,0,2.894924e-04,8.253672e-02,0,0
1587,-0.094143,-0.045620,-0.004262,0.112467,1,0,0,3,1,1,...,-0.206556,22,2,1,1,0,1.246600e-06,6.195300e-06,0,1
3307,-0.001868,-0.113641,-0.046783,0.107378,1,0,0,3,1,1,...,-0.119995,14,2,1,2,0,1.558900e-06,5.014300e-06,0,0
2801,-0.016301,0.194077,0.018759,0.134957,1,0,0,3,0,1,...,-0.125213,26,1,2,2,1,1.257963e-01,3.325958e-02,0,0
690,-0.066284,-0.252968,-0.015700,0.116005,1,0,0,1,1,1,...,-0.112705,14,2,1,1,0,3.000000e-10,6.276000e-07,0,0
2151,0.131017,-0.146973,-0.058716,0.140242,0,1,0,2,1,1,...,-0.066581,22,2,2,2,1,6.190000e-08,2.141200e-06,0,0
45,0.103984,-0.306391,-0.070399,0.131547,1,0,0,3,1,1,...,-0.030891,26,2,2,2,1,1.487000e-07,2.599700e-06,0,0
1170,-0.083431,-0.178648,-0.028146,0.114104,1,0,0,3,1,1,...,-0.186119,14,2,1,1,0,4.286000e-07,8.880000e-08,0,0
4011,-0.004732,-0.224231,0.028031,0.112159,1,0,0,1,3,0,...,-0.105489,0,0,3,3,0,0.000000e+00,0.000000e+00,0,0
1388,-0.062734,0.346745,-0.000924,0.118019,0,0,1,4,0,1,...,-0.157521,9,2,1,1,0,4.273394e-04,8.318387e-02,0,0


In [27]:
dfy = df.merge(X_test_y[["y_pred", "y_test"]], how="inner", left_index=True, right_index=True)

In [28]:
dfy_miss = dfy[dfy["y_pred"] != dfy["y_test"]]
dfy_false_pos = dfy_miss[dfy_miss["y_pred"] == 1]
dfy_false_neg = dfy_miss[dfy_miss["y_pred"] == 0]

In [31]:
dfy_false_neg[["Atom Number", "Catalyst Name", "BindingEnergy", "IonizationEnergy", "SpinDensity", "AromaticExtent", "RingEdge", "DistanceToN"]].sort_values(by="Catalyst Name")

Unnamed: 0,Atom Number,Catalyst Name,BindingEnergy,IonizationEnergy,SpinDensity,AromaticExtent,RingEdge,DistanceToN,NumNitrogens,NumHeteroatoms
70,8,sf103x0,-0.148885,0.113956,-0.062347,16,2,1,2,2
398,12,sf130x0,-0.262186,0.116031,0.314554,18,1,1,1,1
685,8,sf145x0,-0.11233,0.116005,0.248376,14,2,2,1,1
905,8,sf155x0,-0.122158,0.115284,-0.059431,14,2,1,1,1
1068,3,sf163x0,-0.348189,0.109233,0.175958,18,2,3,1,1
1587,7,sf195x0,-0.176371,0.112467,-0.094143,22,2,3,1,1
1994,10,sf21x3,-0.312524,0.135329,0.021309,13,2,1,3,3
2201,6,sf237x0,-0.14172,0.101123,0.2309,10,2,2,1,1
2531,13,sf256x0,-0.180823,0.114215,-0.084342,18,2,1,1,1
2587,13,sf259x0,-0.108101,0.108938,0.278214,18,1,1,1,1


In [32]:
dfy_false_pos[["Atom Number", "Catalyst Name", "BindingEnergy", "IonizationEnergy", "SpinDensity", "Meta", "OrthoOrPara", "DistanceToN", "NumberOfHydrogens", "AromaticExtent", "RingEdge"]].sort_values(by="Catalyst Name")

Unnamed: 0,Atom Number,Catalyst Name,BindingEnergy,IonizationEnergy,SpinDensity,Meta,OrthoOrPara,DistanceToN,NumberOfHydrogens,AromaticExtent,RingEdge
304,9,sf126x0,0.0,0.104659,0.326079,0,1,3,0,17,2
2443,12,sf24x1,0.0,0.102481,0.217215,0,1,1,0,15,1
2953,4,sf276x0,-0.297833,0.101115,0.2494,0,1,1,0,17,1


In [83]:
mlp = MLPClassifier(max_iter=20000, hidden_layer_sizes = (400,400,200,100,100), alpha=0.1)
mlp.fit(X_train, y_train)
print('Accuracy of MLP classifier on test set: {:.2f}'.format(mlp.score(X_test, y_test)))
print('Accuracy of MLP classifier on training set: {:.2f}'.format(mlp.score(X_train, y_train)))
#scores = cross_val_score(mlp, X, y, cv=10)
#print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy of MLP classifier on test set: 0.95
Accuracy of MLP classifier on training set: 0.98


In [None]:
mlp = MLPClassifier(max_iter=15000, hidden_layer_sizes = (2048, 1024, 512, 15), alpha=0.1)
mlp.fit(X_train, y_train)
print('Accuracy of MLP classifier on test set: {:.2f}'.format(mlp.score(X_test, y_test)))
print('Accuracy of MLP classifier on training set: {:.2f}'.format(mlp.score(X_train, y_train)))
scores = cross_val_score(mlp, X, y, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy of MLP classifier on test set: 0.96
Accuracy of MLP classifier on training set: 0.99
