1. Load mol objects and highlight sites that are being misclassified
2. A lot of the false positives are tertiary carbon atoms with high spin. Tertiary carbon atoms don't usually do as well, but the model can't tell the difference between T-carbons embedded in rings or connected to methyl groups. This could deserve its own feature
3. A lot of the false negatives are for small molecules with multiple nitrogen atoms. Correcting the aromatic ring size feature could potentially improve this. First check if the RDKit feature is trustworthy; if not, can just look for the full set of carbons with only 3 bonds, as these are necessarily sp2.
4. The aromatic ring size features appears to be completely wrong, and I found at least one counterexample for OrthoOrPara. DistanceToN may obviate OrthoOrPara anyway
5. sf189x0 has an active site at carbon 4 (binding energy -0.1eV), and a symmetric site at carbon 21 that is not active. This likely illustrates different conformers can subtely affect these energies
6. Catalyst sf41x0 lists all binding sites as inactive, but some actually are (and possibly most). If I move the correct files to the correct directory, can I just rerun Kunal's parsing scripts?

Score to beat: 93% with random forests
Since 80% of sites are not active, the threshold for success is 80%

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix

In [2]:
from ngcc_ml import data_tools



In [10]:
df = pd.read_csv("/home/nricke/work/klodaya/notebooks_klodaya/DidItBindWithSubstructure.csv", index_col=0)
df_aug = pd.read_json("/home/nricke/work/klodaya/notebooks_klodaya/additional_aromatic_features.json")

In [12]:
df_aug["Atom Number"] = df_aug["Atom Number"] + 1

In [16]:
df

Unnamed: 0,Atom Number,Catalyst Name,CatalystO2File,Element,SpinDensity,ChElPGPositiveCharge,ChElPGNeutralCharge,ChargeDifference,Doesitbind,BondLength,...,Substructure3,Substructure4,Substructure5,Substructure6,Substructure7,Substructure8,Substructure9,Substructure10,aromatic_extent,ring_edge
0,1,sf100x0,,C,-0.008245,-0.275350,-0.200227,0.075123,False,0.000000,...,0,0,0,0,0,0,0,0,0,0
1,3,sf100x0,D:\Kunal\Documents\MIT\allfiles\O2_binding_dat...,C,0.555664,-0.064043,-0.339572,-0.275529,True,1.535452,...,0,0,0,0,0,0,0,0,18,2
2,4,sf100x0,,C,-0.181519,0.037008,0.096249,0.059241,False,0.000000,...,0,0,0,0,0,0,0,0,18,1
3,5,sf100x0,,C,0.208580,-0.226453,-0.308850,-0.082397,False,0.000000,...,0,0,0,0,0,0,0,0,18,2
4,6,sf100x0,D:\Kunal\Documents\MIT\allfiles\O2_binding_dat...,C,-0.119560,0.176015,0.163894,-0.012121,False,1.701145,...,0,0,0,0,0,0,0,0,18,2
5,7,sf100x0,D:\Kunal\Documents\MIT\allfiles\O2_binding_dat...,C,0.221689,0.308617,0.215658,-0.092959,False,1.619080,...,0,0,0,0,0,0,0,0,18,2
6,8,sf100x0,,C,-0.138080,-0.337969,-0.357701,-0.019732,False,0.000000,...,0,0,0,0,0,0,0,0,18,2
7,9,sf100x0,,C,0.243169,0.054121,-0.032052,-0.086173,False,0.000000,...,0,0,0,0,0,0,0,0,18,1
8,10,sf100x0,,C,-0.012210,0.079364,0.079428,0.000064,False,0.000000,...,0,0,0,0,0,0,0,0,18,1
9,11,sf100x0,D:\Kunal\Documents\MIT\allfiles\O2_binding_dat...,C,-0.006535,0.093478,0.074304,-0.019174,False,1.633351,...,0,0,0,0,0,0,0,0,18,1


In [14]:
df_aug

Unnamed: 0,aromatic_extent,ring_edge,Atom Number,Catalyst Name
0,14,2,1,sf224x0
1,14,2,2,sf224x0
10,14,2,13,sf224x0
100,6,2,7,sf123x0
1000,18,2,2,sf162x0
1001,18,2,3,sf162x0
1002,18,2,4,sf162x0
1003,18,2,5,sf162x0
1004,18,1,7,sf162x0
1005,18,2,8,sf162x0


In [15]:
df = df.merge(df_aug, on=["Atom Number", "Catalyst Name"])

In [5]:
df.columns

Index(['Atom Number', 'Catalyst Name', 'CatalystO2File', 'Element',
       'SpinDensity', 'ChElPGPositiveCharge', 'ChElPGNeutralCharge',
       'ChargeDifference', 'Doesitbind', 'BondLength', 'IonizedFreeEnergy',
       'IonizationEnergy', 'BindingEnergy', 'NeutralFreeEnergy', 'OrthoOrPara',
       'Meta', 'FartherThanPara', 'DistanceToN', 'AverageBondLength',
       'BondLengthRange', 'NumberOfHydrogens', 'AromaticSize', 'IsInRingSize6',
       'IsInRingSize5', 'NeighborSpinDensity', 'NeighborChElPGCharge',
       'NeighborChargeDifference', 'Substructure1', 'Substructure2',
       'Substructure3', 'Substructure4', 'Substructure5', 'Substructure6',
       'Substructure7', 'Substructure8', 'Substructure9', 'Substructure10'],
      dtype='object')

In [17]:
df.Doesitbind.value_counts()

False    3455
True      697
Name: Doesitbind, dtype: int64

In [9]:
df[df["Catalyst Name"] == "sf41x0"][["Atom Number", "Doesitbind"]]

Unnamed: 0,Atom Number,Doesitbind
3271,2,False
3272,3,False
3273,4,False
3274,5,False
3275,6,False
3276,7,False
3277,8,False
3278,9,False
3279,11,False


In [26]:
feature_cols = ["SpinDensity", "ChElPGNeutralCharge", "ChargeDifference", "IonizationEnergy", "OrthoOrPara", "Meta", "FartherThanPara", "DistanceToN", "NumberOfHydrogens", "IsInRingSize6", "IsInRingSize5", "NeighborSpinDensity", 'NeighborChElPGCharge', 'NeighborChargeDifference', "aromatic_extent", "ring_edge"]
#feature_cols = ["SpinDensity", "ChElPGNeutralCharge", "ChargeDifference", "IonizationEnergy", "DistanceToN", "NumberOfHydrogens", "IsInRingSize6", "IsInRingSize5", "aromatic_extent", "ring_edge"]
X = df[feature_cols]
y = df["Doesitbind"].astype('int')

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [28]:
X_train

Unnamed: 0,SpinDensity,ChElPGNeutralCharge,ChargeDifference,IonizationEnergy,OrthoOrPara,Meta,FartherThanPara,DistanceToN,NumberOfHydrogens,IsInRingSize6,IsInRingSize5,NeighborSpinDensity,NeighborChElPGCharge,NeighborChargeDifference,aromatic_extent,ring_edge
3683,0.000056,-0.187646,0.003891,0.149713,0,0,1,5,2,0,0,1.690000e-21,0.169684,0.002991,0,0
688,-0.094708,-0.078123,-0.012205,0.116005,1,0,0,3,1,1,0,3.217150e-01,-0.367552,-0.204274,14,2
317,0.134705,-0.027722,-0.098825,0.112043,1,0,0,3,0,1,0,2.749800e-02,-0.114973,-0.037510,18,1
1853,-0.152223,-0.132049,0.012878,0.108346,0,1,0,2,1,1,0,8.390420e-01,-0.386341,-0.566982,6,2
794,0.038993,-0.267240,-0.027223,0.128954,0,0,1,4,1,1,0,-7.904300e-02,0.285712,0.009520,18,2
3177,-0.012113,-0.404354,-0.005856,0.129286,1,0,0,1,3,0,0,0.000000e+00,0.000000,0.000000,0,0
3969,0.065516,-0.407100,-0.073372,0.110611,0,0,1,4,1,1,0,7.328500e-02,0.485605,-0.092185,14,2
3213,0.460940,-0.264845,-0.198712,0.128019,1,0,0,3,1,1,0,-3.270260e-01,0.108945,0.007336,23,2
620,-0.077338,-0.195221,0.005302,0.111298,0,0,1,6,1,1,0,2.558460e-01,-0.024620,-0.138818,22,2
1767,0.135867,-0.178886,-0.140003,0.093784,1,0,0,1,1,1,0,3.005800e-02,0.286464,0.042051,6,2


In [29]:
rfc = RandomForestClassifier(n_estimators=100, max_depth=100, class_weight={0:0.5, 1:0.5})
rfc.fit(X_train, y_train)
print('Accuracy of RFC on test set: {:.2f}'.format(rfc.score(X_test, y_test)))
print('Accuracy of RFC on training set: {:.2f}'.format(rfc.score(X_train, y_train)))
scores = cross_val_score(rfc, X, y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy of RFC on test set: 0.93
Accuracy of RFC on training set: 1.00
Accuracy: 0.94 (+/- 0.04)


In [None]:
y_pred = rfc.predict(X_test)

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
# Get indices of misclassified active sites
X_test_y = X_test.copy()
X_test_y["y_pred"] = y_pred
X_test_y["y_test"] = y_test

In [None]:
X_test_y

In [None]:
dfy = df.merge(X_test_y[["y_pred", "y_test"]], how="inner", left_index=True, right_index=True)

In [None]:
dfy_miss = dfy[dfy["y_pred"] != dfy["y_test"]]
dfy_false_pos = dfy_miss[dfy_miss["y_pred"] == 1]
dfy_false_neg = dfy_miss[dfy_miss["y_pred"] == 0]

In [None]:
dfy_miss

In [None]:
dfy_miss.loc[70]

In [None]:
686.93232484 - 686.81836881

In [None]:
dfy_false_neg[["Atom Number", "AromaticSize", "Catalyst Name", "BindingEnergy", "IonizationEnergy", "SpinDensity"]].sort_values(by="Catalyst Name")

In [None]:
dfy_false_pos[["Atom Number", "AromaticSize", "Catalyst Name", "BindingEnergy", "IonizationEnergy", "SpinDensity", "Meta", "OrthoOrPara", "DistanceToN", "NumberOfHydrogens"]].sort_values(by="Catalyst Name")

In [None]:
df[df["Catalyst Name"] == "sf41x0"][["Atom Number", "Doesitbind", "BindingEnergy"]]