1. Load mol objects and highlight sites that are being misclassified
2. A lot of the false positives are tertiary carbon atoms with high spin. Tertiary carbon atoms don't usually do as well, but the model can't tell the difference between T-carbons embedded in rings or connected to methyl groups. This could deserve its own feature
3. A lot of the false negatives are for small molecules with multiple nitrogen atoms. Correcting the aromatic ring size feature could potentially improve this. First check if the RDKit feature is trustworthy; if not, can just look for the full set of carbons with only 3 bonds, as these are necessarily sp2.
4. The aromatic ring size features appears to be completely wrong, and I found at least one counterexample for OrthoOrPara. DistanceToN may obviate OrthoOrPara anyway
5. sf189x0 has an active site at carbon 4 (binding energy -0.1eV), and a symmetric site at carbon 21 that is not active. This likely illustrates different conformers can subtely affect these energies
6. Catalyst sf41x0 lists all binding sites as inactive, but some actually are (and possibly most). If I move the correct files to the correct directory, can I just rerun Kunal's parsing scripts?

Score to beat: 93% with random forests
Since 80% of sites are not active, the threshold for success is 80%

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix

In [2]:
from ngcc_ml import data_tools



In [9]:
#df = pd.read_csv("/home/nricke/work/klodaya/notebooks_klodaya/DidItBindWithSubstructure.csv", index_col=0)
#df = pd.read_csv("/home/nricke/work/ngcc_ml/DidItBindv2.csv")
df = pd.read_csv("/home/nricke/work/ngcc_ml/DidItBindv4.csv")
#df_aug = pd.read_json("/home/nricke/work/klodaya/notebooks_klodaya/additional_aromatic_features.json")
#df_aug["Atom Number"] = df_aug["Atom Number"] + 1
#df = df.merge(df_aug, on=["Atom Number", "Catalyst Name"])

In [10]:
print(df.columns)

Index(['Atom Number', 'Catalyst Name', 'CatalystO2File', 'Element',
       'SpinDensity', 'ChElPGPositiveCharge', 'ChElPGNeutralCharge',
       'ChargeDifference', 'Doesitbind', 'BondLength', 'IonizedFreeEnergy',
       'IonizationEnergy', 'BindingEnergy', 'NeutralFreeEnergy', 'OrthoOrPara',
       'Meta', 'FartherThanPara', 'DistanceToN', 'AverageBondLength',
       'BondLengthRange', 'NumberOfHydrogens', 'AromaticSize', 'IsInRingSize6',
       'IsInRingSize5', 'NeighborSpinDensity', 'NeighborChElPGCharge',
       'NeighborChargeDifference', 'AromaticExtent', 'RingEdge',
       'NumNitrogens', 'NumHeteroatoms'],
      dtype='object')


In [24]:
df[["Catalyst Name", "CatalystO2File", "NumNitrogens", "NumHeteroatoms"]]

Unnamed: 0,Catalyst Name,CatalystO2File,NumNitrogens,NumHeteroatoms
0,sf100x0,,1,5
1,sf100x0,sf100x0O2-2_optsp_a0m2.out,1,5
2,sf100x0,,1,5
3,sf100x0,,1,5
4,sf100x0,sf100x0O2-5_optsp_a0m2.out,1,5
5,sf100x0,sf100x0O2-6_optsp_a0m2.out,1,5
6,sf100x0,,1,5
7,sf100x0,,1,5
8,sf100x0,,1,5
9,sf100x0,sf100x0O2-10_optsp_a0m2.out,1,5


In [15]:
feature_cols = ["SpinDensity", "ChElPGNeutralCharge", "ChargeDifference", "IonizationEnergy", "OrthoOrPara", "Meta", "FartherThanPara", "DistanceToN", "NumberOfHydrogens", "IsInRingSize6", "IsInRingSize5", "NeighborSpinDensity", 'NeighborChElPGCharge', 'NeighborChargeDifference', "AromaticExtent", "RingEdge", "NumNitrogens", "NumHeteroatoms"]
#feature_cols = ["SpinDensity", "ChElPGNeutralCharge", "ChargeDifference", "IonizationEnergy", "DistanceToN", "NumberOfHydrogens", "IsInRingSize6", "IsInRingSize5", "aromatic_extent", "ring_edge"]
X = df[feature_cols]
y = df["Doesitbind"].astype('int')

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)

In [20]:
rfc = RandomForestClassifier(n_estimators=100, max_depth=100, class_weight={0:0.5, 1:0.5})
rfc.fit(X_train, y_train)
print('Accuracy of RFC on test set: {:.2f}'.format(rfc.score(X_test, y_test)))
print('Accuracy of RFC on training set: {:.2f}'.format(rfc.score(X_train, y_train)))
scores = cross_val_score(rfc, X, y, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy of RFC on test set: 0.96
Accuracy of RFC on training set: 1.00
Accuracy: 0.95 (+/- 0.04)


In [25]:
y_pred = rfc.predict(X_test)

In [26]:
confusion_matrix(y_test, y_pred)

array([[337,   3],
       [ 12,  64]])

In [27]:
# Get indices of misclassified active sites
X_test_y = X_test.copy()
X_test_y["y_pred"] = y_pred
X_test_y["y_test"] = y_test

In [28]:
X_test_y

Unnamed: 0,SpinDensity,ChElPGNeutralCharge,ChargeDifference,IonizationEnergy,OrthoOrPara,Meta,FartherThanPara,DistanceToN,NumberOfHydrogens,IsInRingSize6,IsInRingSize5,NeighborSpinDensity,NeighborChElPGCharge,NeighborChargeDifference,AromaticExtent,RingEdge,NumNitrogens,NumHeteroatoms,y_pred,y_test
3983,0.098972,-0.302021,-0.065125,0.165386,0,1,0,2,1,0,1,0.103981,-0.008527,-0.169789,14,2,1,1,1,1
2816,-0.006381,-0.105725,-0.030465,0.111035,1,0,0,3,1,1,0,-0.007195,-0.150623,-0.030453,15,2,2,2,0,1
1021,-0.019794,-0.376019,-0.018956,0.106365,1,0,0,1,2,0,0,0.258707,0.968018,0.059035,0,0,1,1,0,0
3050,0.010031,-0.124953,-0.053670,0.128862,0,0,1,4,1,1,0,0.089869,-0.311567,-0.081164,21,2,2,2,0,0
3260,-0.084100,-0.121882,0.019673,0.112078,0,1,0,2,1,1,0,0.533958,-0.319428,-0.395693,14,2,1,1,0,0
1828,0.297703,-0.044029,-0.135487,0.114464,1,0,0,3,1,1,0,-0.229712,-0.258391,-0.005414,6,2,2,2,1,1
2513,0.208417,-0.151800,-0.061739,0.098780,1,0,0,1,1,1,0,-0.044998,0.161027,-0.082251,17,2,1,1,1,1
983,0.012926,-0.116644,-0.021900,0.106497,0,0,1,6,1,1,0,0.022355,-0.284113,-0.065247,18,2,1,1,0,0
3053,0.263084,-0.275052,-0.136534,0.128862,1,0,0,3,1,1,0,-0.199803,0.097995,0.015148,21,2,2,2,1,1
2618,0.204702,-0.108437,-0.138332,0.119088,1,0,0,1,1,1,0,-0.055856,0.050613,0.022151,10,2,2,2,1,1


In [29]:
dfy = df.merge(X_test_y[["y_pred", "y_test"]], how="inner", left_index=True, right_index=True)

In [30]:
dfy_miss = dfy[dfy["y_pred"] != dfy["y_test"]]
dfy_false_pos = dfy_miss[dfy_miss["y_pred"] == 1]
dfy_false_neg = dfy_miss[dfy_miss["y_pred"] == 0]

In [31]:
dfy_miss

Unnamed: 0,Atom Number,Catalyst Name,CatalystO2File,Element,SpinDensity,ChElPGPositiveCharge,ChElPGNeutralCharge,ChargeDifference,Doesitbind,BondLength,...,IsInRingSize5,NeighborSpinDensity,NeighborChElPGCharge,NeighborChargeDifference,AromaticExtent,RingEdge,NumNitrogens,NumHeteroatoms,y_pred,y_test
70,8,sf103x0,sf103x0O2-7_optsp_a0m2.out,C,-0.062347,0.328233,0.281964,-0.046269,True,1.525185,...,0,0.277207,-0.839831,-0.103196,16,2,2,2,0,1
304,9,sf126x0,,C,0.326079,0.219511,0.154627,-0.064884,False,0.0,...,0,-0.252743,-0.749409,-0.034507,17,2,2,2,1,0
398,12,sf130x0,sf130x0O2-11_optsp_a0m2.out,C,0.314554,0.272177,0.145634,-0.126543,True,1.600405,...,0,-0.178559,-0.214878,0.015018,18,1,1,1,0,1
545,15,sf138x0,sf138x0O2-14_optsp_a0m2.out,C,0.292932,0.103805,-0.080619,-0.184424,False,1.581486,...,0,-0.155924,0.389255,0.151319,18,1,1,1,1,0
1505,6,sf18x1,sf18x1O2-5_optsp_a0m2.out,C,0.054312,0.226299,0.083054,-0.143245,True,1.642008,...,0,0.206867,-0.04542,0.168694,13,1,2,2,0,1
1713,8,sf200x0,sf200x0O2-7_optsp_a0m2.out,C,0.242176,-0.248962,-0.393185,-0.144223,True,1.6069,...,0,-0.163826,0.39646,-0.052349,10,2,1,1,0,1
1764,1,sf205x0,sf205x0O2-0_optsp_c1m2.out,C,0.283434,0.034632,-0.099601,-0.134233,True,1.617366,...,0,-0.176707,-0.178069,-0.003648,6,2,2,2,0,1
1969,5,sf21x1,sf21x1O2-4_optsp_a0m2.out,C,0.207945,-0.118521,-0.253541,-0.13502,True,1.575613,...,0,-0.058889,-0.143143,-0.08209,13,2,3,3,0,1
2201,6,sf237x0,sf237x0O2-5_optsp_a0m2.out,C,0.2309,-0.113008,-0.229704,-0.116696,True,1.687505,...,0,-0.116452,-0.161595,-0.132929,10,2,1,1,0,1
2243,8,sf23x1,sf23x1O2-7_optsp_a0m2.out,C,0.292066,-0.328868,-0.444316,-0.115448,True,1.556999,...,0,-0.177536,0.541387,-0.039465,15,2,2,2,0,1


In [32]:
df["Atom Number"].min()

1

In [42]:
dfy_false_neg[["Atom Number", "Catalyst Name", "BindingEnergy", "IonizationEnergy", "SpinDensity", "AromaticExtent", "RingEdge", "DistanceToN", "NumNitrogens", "NumHeteroatoms"]].sort_values(by="Catalyst Name")

Unnamed: 0,Atom Number,Catalyst Name,BindingEnergy,IonizationEnergy,SpinDensity,AromaticExtent,RingEdge,DistanceToN,NumNitrogens,NumHeteroatoms
70,8,sf103x0,-0.148885,0.113956,-0.062347,16,2,1,2,2
398,12,sf130x0,-0.262186,0.116031,0.314554,18,1,1,1,1
1505,6,sf18x1,-0.257201,0.105113,0.054312,13,1,1,2,2
1713,8,sf200x0,-0.210899,0.096401,0.242176,10,2,2,1,1
1764,1,sf205x0,-0.773228,0.093784,0.283434,6,2,3,2,2
1969,5,sf21x1,-0.136926,0.108858,0.207945,13,2,2,3,3
2201,6,sf237x0,-0.14172,0.101123,0.2309,10,2,2,1,1
2243,8,sf23x1,-0.114422,0.106642,0.292066,15,2,2,2,2
2509,7,sf254x0,-0.202684,0.09878,0.19092,17,2,4,1,1
2816,10,sf26x1,-0.589056,0.111035,-0.006381,15,2,3,2,2


In [36]:
dfy_false_pos[["Atom Number", "Catalyst Name", "BindingEnergy", "IonizationEnergy", "SpinDensity", "Meta", "OrthoOrPara", "DistanceToN", "NumberOfHydrogens", "AromaticExtent", "RingEdge"]].sort_values(by="Catalyst Name")

Unnamed: 0,Atom Number,Catalyst Name,BindingEnergy,IonizationEnergy,SpinDensity,Meta,OrthoOrPara,DistanceToN,NumberOfHydrogens,AromaticExtent,RingEdge
304,9,sf126x0,0.0,0.104659,0.326079,0,1,3,0,17,2
545,15,sf138x0,-0.091794,0.108152,0.292932,0,1,1,0,18,1
2953,4,sf276x0,-0.297833,0.101115,0.2494,0,1,1,0,17,1


In [40]:
df[df["Catalyst Name"] == "sf126x0"][["Atom Number", "CatalystO2File", "Doesitbind", "BindingEnergy"]]

Unnamed: 0,Atom Number,CatalystO2File,Doesitbind,BindingEnergy
297,1,,False,0.0
298,2,,False,0.0
299,3,,False,0.0
300,4,sf126x0O2-3_optsp_a0m2.out,False,2.813952
301,5,sf126x0O2-4_optsp_a0m2.out,True,-0.145957
302,7,sf126x0O2-6_optsp_a0m2.out,True,-0.867367
303,8,sf126x0O2-7_optsp_a0m2.out,False,3.507939
304,9,,False,0.0
305,10,,False,0.0
306,11,,False,0.0


In [26]:
df[df["Catalyst Name"] == "sf133x0"][["Atom Number", "Doesitbind", "BindingEnergy", "ring_edge", "aromatic_extent"]]

Unnamed: 0,Atom Number,Doesitbind,BindingEnergy,ring_edge,aromatic_extent
438,1,False,0.0,0,0
439,2,False,0.0,2,14
440,3,False,0.029823,2,14
441,4,False,0.0,2,14
442,5,False,0.0,1,14
443,6,False,0.0,2,14
444,7,False,0.0,1,14
445,8,False,0.142847,2,14
446,9,False,-0.0539,2,14
447,10,False,0.168937,2,14
