# Random Forest models

## DATASETS:
(a) Carbonic Anhydrase II (ChEMBL205), a protein lyase,  
(b) Cyclin-dependent kinase 2 (CHEMBL301), a protein kinase,  
(c) ether-a-go-go-related gene potassium channel 1 (HERG) (CHEMBL240), a voltage-gated ion channel,  
(d) Dopamine D4 receptor (CHEMBL219), a monoamine GPCR,  
(e) Coagulation factor X (CHEMBL244), a serine protease,  
(f) Cannabinoid CB1 receptor (CHEMBL218), a lipid-like GPCR and  
(g) Cytochrome P450 19A1 (CHEMBL1978), a cytochrome P450.  
The activity classes were selected based on data availability and as representatives of therapeutically important target classes or as anti-targets.

In [1]:
!nvidia-smi

Tue Apr 26 11:44:11 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.172.01   Driver Version: 450.172.01   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-DGXS...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   42C    P0   229W / 300W |   4091MiB / 32505MiB |     12%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-DGXS...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   41C    P0    39W / 300W |      9MiB / 32508MiB |      0%      Default |
|       

In [2]:
# Import
import pandas as pd
import numpy as np
from pathlib import Path

In [3]:
from rdkit import Chem
from rdkit.Chem import AllChem

[11:44:15] Enabling RDKit 2019.09.3 jupyter extensions


In [4]:
path = Path('../dataset/13321_2017_226_MOESM1_ESM/')
#df = pd.read_csv('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL205_cl.csv', index_col=0)

In [5]:
#df.head()
list(path.iterdir())

[PosixPath('../dataset/13321_2017_226_MOESM1_ESM/mol_images'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL218'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL219'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL240'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL244'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL301'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL205'),
 PosixPath('../dataset/13321_2017_226_MOESM1_ESM/CHEMBL1978')]

# Run the functions on a file from dataset and store the results

In [6]:
dataset='CHEMBL205'

In [11]:
df = pd.read_csv(path/f'{dataset}/{dataset}_ecfp_1024_train_valid.csv')

In [59]:
df.head()

Unnamed: 0,CID,SMILES,ECFP4_1,ECFP4_2,ECFP4_3,ECFP4_4,ECFP4_5,ECFP4_6,ECFP4_7,ECFP4_8,...,ECFP4_1017,ECFP4_1018,ECFP4_1019,ECFP4_1020,ECFP4_1021,ECFP4_1022,ECFP4_1023,ECFP4_1024,Activity,is_valid
0,CHEMBL1589687,S1c2n(ncn2)C(O)=C1C([NH+]1CCc2c(C1)cccc2)c1ccc...,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,False
1,CHEMBL3092937,S(=O)(=O)(N)c1cc(ccc1)-c1nnn(c1)C1OC(COC(=O)C)...,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,1,False
2,CHEMBL325684,O=C1N(Cc2ccc(cc2)-c2ccccc2C(=O)[O-])C(=NC12CC2...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,False
3,CHEMBL488713,Fc1cc(F)c(F)cc1CC([NH3+])CC(=O)N1N=CCC1C(=O)Nc...,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,False
4,CHEMBL2069846,Fc1cc(F)c(F)cc1CC([NH3+])CC(=O)N1CCN(CC1)C(=O)...,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,False


# Split data

In [40]:
# Split into X,y
X, y = df.drop(["CID", "SMILES", "Activity", 'is_valid'], axis=1), df["Activity"]

In [41]:
# check info of dataframe
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10764 entries, 0 to 10763
Columns: 1024 entries, ECFP4_1 to ECFP4_1024
dtypes: int64(1024)
memory usage: 84.1 MB


In [60]:
X.head()

Unnamed: 0,ECFP4_1,ECFP4_2,ECFP4_3,ECFP4_4,ECFP4_5,ECFP4_6,ECFP4_7,ECFP4_8,ECFP4_9,ECFP4_10,...,ECFP4_1015,ECFP4_1016,ECFP4_1017,ECFP4_1018,ECFP4_1019,ECFP4_1020,ECFP4_1021,ECFP4_1022,ECFP4_1023,ECFP4_1024
0,0,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
4,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [61]:
# y is a pandas series
y.head(), y.size, type(y)

(0    0
 1    0
 2    0
 3    0
 4    1
 Name: Activity, dtype: int64,
 3589,
 pandas.core.series.Series)

# Train test split

In [62]:
from sklearn.model_selection import train_test_split, KFold

In [63]:
# regular train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)

# 5-Fold Cross Validation

In [64]:
#5-fold
kf = KFold(n_splits=5, shuffle=True, random_state=999)

In [65]:
# append to a list / could write to csv file to keep integrity
X_train_list, X_valid_list, y_train_list, y_valid_list = [], [], [], []

for train_index, valid_index in kf.split(X_train):
    X_train_list.append(X_train.iloc[train_index])
    X_valid_list.append(X_train.iloc[valid_index])
    y_train_list.append(y_train.iloc[train_index])
    y_valid_list.append(y_train.iloc[valid_index]) 

In [66]:
y_train_list[0].head()

1129    0
157     0
2975    0
1981    0
2916    1
Name: Activity, dtype: int64

In [67]:
X_train_list[1].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2297 entries, 1129 to 2284
Columns: 1024 entries, ECFP4_1 to ECFP4_1024
dtypes: int64(1024)
memory usage: 18.0 MB


In [68]:
# TODO: add splits to csv file

# Random Forest

In [69]:
from sklearn.ensemble import RandomForestClassifier

In [70]:
from sklearn.metrics import auc,roc_auc_score,recall_score,precision_score,f1_score
from  sklearn.metrics import matthews_corrcoef
from sklearn.metrics import accuracy_score

In [71]:
# train method for Random Forest
def train_rf(X_train, X_test, y_train, y_test, n_estimators=200, criterion='entropy', max_features='sqrt'):

    
    rf = RandomForestClassifier(n_estimators=n_estimators, criterion=criterion, min_samples_split=2, max_features=max_features, 
                               max_leaf_nodes=None,bootstrap=False,oob_score=False, n_jobs=-1, random_state=69)
    
    rf.fit(X_train,y_train)
    y_pred= rf.predict(X_test)
    y_pred_prob=rf.predict_proba(X_test)
    
    temp=[]
    for j in range(len(y_pred_prob)):
        temp.append(y_pred_prob[j][1])
    auc=roc_auc_score(np.array(y_test),np.array(temp))
    acc2=accuracy_score(y_test,y_pred)
    mcc=matthews_corrcoef(y_test,y_pred)
    Recall=recall_score(y_test, y_pred,pos_label=1)
    Precision=precision_score(y_test, y_pred,pos_label=1)
    F1_score=f1_score(y_test, y_pred,pos_label=1)

    return auc,acc2,mcc,Recall,Precision,F1_score, rf

In [74]:
def train_on_dataset(dataset, bits=1024, n_estimators=200, criterion='entropy', max_features='log2'):
    
    print(f'Training on dataset: {dataset} with {bits} bits fingerprint features')
    
    df = pd.read_csv(path/f'{dataset}/{dataset}_ecfp_{bits}_train_valid.csv')
    X, y = df.drop(["CID", "SMILES", "Activity", 'is_valid'], axis=1), df["Activity"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=666)
    kf = KFold(n_splits=5, shuffle=True, random_state=999)
    X_train_list, X_valid_list, y_train_list, y_valid_list = [], [], [], []

    for train_index, valid_index in kf.split(X_train):
        X_train_list.append(X_train.iloc[train_index])
        X_valid_list.append(X_train.iloc[valid_index])
        y_train_list.append(y_train.iloc[train_index])
        y_valid_list.append(y_train.iloc[valid_index]) 
    aucs, accs, mccs, recalls, precs, f1_scores = [], [], [], [], [], []
    for i in range(0,5):
        X_train = X_train_list[i]
        X_valid = X_valid_list[i]
        y_train = y_train_list[i]
        y_valid = y_valid_list[i]
        auc,acc2,mcc,Recall,Precision,F1_score, rf = train_rf(X_train, X_valid, y_train, y_valid, 
                                                      n_estimators=n_estimators, criterion=criterion, 
                                                              max_features=max_features)
        mccs.append(mcc)
        aucs.append(auc)
        accs.append(acc2)
        mccs.append(mcc)
        recalls.append(Recall)
        precs.append(Precision)
        f1_scores.append(F1_score)
        
    print(f"Average ROCAUC of the folds: {np.mean(aucs)}")
    print(f"Average accuracy of the folds: {np.mean(accs)}")
    print(f"Average Matthews correlation of the folds: {np.mean(mccs)}")
    print(f"Average recall of the folds: {np.mean(recalls)}")
    print(f"Average precision of the folds: {np.mean(precs)}")
    print(f"Average f1 score of the folds: {np.mean(f1_scores)}")
    print()
    score = []
    score.append(np.mean(aucs))
    score.append(np.mean(accs))
    score.append(np.mean(mccs))
    score.append(np.mean(recalls))
    score.append(np.mean(precs))
    score.append(np.mean(f1_scores))
    score = np.mean(score)
    mean_mcc = np.mean(mccs)
    return score, mean_mcc, rf

In [75]:
_, _, rf = train_on_dataset(dataset)

Training on dataset: CHEMBL205 with 1024 bits fingerprint features
Average ROCAUC of the folds: 0.9847830504372403
Average accuracy of the folds: 0.9698056082476205
Average Matthews correlation of the folds: 0.8080068522303454
Average recall of the folds: 0.75614996941153
Average precision of the folds: 0.8993212681002388
Average f1 score of the folds: 0.8195126946582214



# Test

In [76]:
df_test = pd.read_csv(path/f'{dataset}/{dataset}_ecfp_1024_test1.csv')
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3589 entries, 0 to 3588
Columns: 1027 entries, CID to Activity
dtypes: int64(1025), object(2)
memory usage: 28.1+ MB


In [77]:
# Split into X,y
X, y = df_test.drop(["CID", "SMILES", "Activity"], axis=1), df_test["Activity"]

In [83]:
y_pred= rf.predict(X)
len(y_pred)

3589

In [85]:
y_pred_prob=rf.predict_proba(X)
y_pred_prob

array([[0.975, 0.025],
       [0.99 , 0.01 ],
       [0.95 , 0.05 ],
       ...,
       [0.99 , 0.01 ],
       [0.3  , 0.7  ],
       [0.975, 0.025]])

In [90]:
preds = pd.DataFrame()
preds['class'] = y
preds['predictions'] = list(y_pred_prob)
preds.head()

Unnamed: 0,class,predictions
0,0,"[0.975, 0.025]"
1,0,"[0.99, 0.01]"
2,0,"[0.95, 0.05]"
3,0,"[0.995, 0.005]"
4,1,"[0.105, 0.895]"


In [91]:
preds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3589 entries, 0 to 3588
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   class        3589 non-null   int64 
 1   predictions  3589 non-null   object
dtypes: int64(1), object(1)
memory usage: 56.2+ KB


In [92]:
preds = preds[0:-1]
preds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3588 entries, 0 to 3587
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   class        3588 non-null   int64 
 1   predictions  3588 non-null   object
dtypes: int64(1), object(1)
memory usage: 56.2+ KB


In [93]:
preds.to_csv(path/f'{dataset}/{dataset}_predictions_RF.csv', index=False)

# Test for different parameters

**Results:** The best results on dataset CHEMBL205 was from n_estimators=200, criterion=entropy, max_features=log2 \
**Note:** This could depend on the dataset

In [12]:
from sklearn.model_selection import ParameterGrid

In [79]:
dataset = 'CHEMBL218'

In [80]:
param_grid = {
    'n_estimators': [100,200,300,700, 1000], 
    'criterion': ['gini', 'entropy'],
    'max_features': ['log2', 'sqrt']
             }

param_grid = ParameterGrid(param_grid)

In [81]:
mean_scores = []
mean_mccs = []
for setting in param_grid:
    print(f"Testing combination: {setting}")
    score, mccs = train_on_dataset(dataset, 
                     bits=1024, 
                     n_estimators=setting['n_estimators'],
                     criterion=setting['criterion'],
                     max_features=setting['max_features']
                    )
    mean_scores.append(score)
    mean_mccs.append(mccs)
i = 0
for setting in param_grid:
    #print(f'Mean score for {setting} is {mean_scores[i]}')
    print(f'Mean mcc for {setting} is {mean_mccs[i]}')
    i += 1


Testing combination: {'criterion': 'gini', 'max_features': 'log2', 'n_estimators': 100}
Training on dataset: CHEMBL218 with 1024 bits fingerprint features
Average ROCAUC of the folds: 0.9906456212905432
Average accuracy of the folds: 0.985263470213134
Average Matthews correlation of the folds: 0.9068313007508195
Average recall of the folds: 0.862871754615148
Average precision of the folds: 0.9695692834443944
Average f1 score of the folds: 0.9128635591089667

Testing combination: {'criterion': 'gini', 'max_features': 'log2', 'n_estimators': 200}
Training on dataset: CHEMBL218 with 1024 bits fingerprint features
Average ROCAUC of the folds: 0.9923972977752629
Average accuracy of the folds: 0.9857414626638805
Average Matthews correlation of the folds: 0.9097014786781694
Average recall of the folds: 0.8642661642814689
Average precision of the folds: 0.9734985331742154
Average f1 score of the folds: 0.9153962726162158

Testing combination: {'criterion': 'gini', 'max_features': 'log2', 'n_es

Average ROCAUC of the folds: 0.9923290322458165
Average accuracy of the folds: 0.9856616861903242
Average Matthews correlation of the folds: 0.9102269705190531
Average recall of the folds: 0.8736827680279184
Average precision of the folds: 0.9644779414620579
Average f1 score of the folds: 0.9165346876256922

Testing combination: {'criterion': 'entropy', 'max_features': 'sqrt', 'n_estimators': 1000}
Training on dataset: CHEMBL218 with 1024 bits fingerprint features
Average ROCAUC of the folds: 0.9922815098635283
Average accuracy of the folds: 0.9856617179232096
Average Matthews correlation of the folds: 0.910092985618809
Average recall of the folds: 0.8726793292771433
Average precision of the folds: 0.96532688238535
Average f1 score of the folds: 0.916312250419676

Mean mcc for {'criterion': 'gini', 'max_features': 'log2', 'n_estimators': 100} is 0.9068313007508195
Mean mcc for {'criterion': 'gini', 'max_features': 'log2', 'n_estimators': 200} is 0.9097014786781694
Mean mcc for {'criter

In [82]:
i = 0
for setting in param_grid:
    if mean_mccs[i] == np.amax(mean_mccs):
        print(f'Highest score is: {mean_mccs[i]} from: {setting}')
    i += 1

Highest score is: 0.9109375467409876 from: {'criterion': 'entropy', 'max_features': 'sqrt', 'n_estimators': 200}


# 1024 vs 512 bit fingerprint

**Results:** 1024 is better on all parts but takes longer to train usually

In [71]:
dataset = 'CHEMBL205'

In [72]:
train_on_dataset(dataset, bits=512)
train_on_dataset(dataset, bits=1024)

Training on dataset: CHEMBL205 with 512 bits fingerprint features
Average ROCAUC of the folds: 0.986635283838058
Average accuracy of the folds: 0.9729657070535278
Average Matthews correlation of the folds: 0.8328577809532958
Average recall of the folds: 0.8205780032982327
Average precision of the folds: 0.8757370373103303
Average f1 score of the folds: 0.846947314852844

Training on dataset: CHEMBL205 with 1024 bits fingerprint features
Average ROCAUC of the folds: 0.9874170906586827
Average accuracy of the folds: 0.9731515374932875
Average Matthews correlation of the folds: 0.8347839190949525
Average recall of the folds: 0.8277251965140863
Average precision of the folds: 0.8719397932002083
Average f1 score of the folds: 0.8490316339410597



0.8906748618170462

# Train on all datasets

In [13]:
datasets = ['CHEMBL205', 'CHEMBL301', 
            'CHEMBL240', 'CHEMBL219', 
            'CHEMBL244', 'CHEMBL218', 
            'CHEMBL1978']

In [14]:
top_mcc_scores = {
    
    'CHEMBL205': 0.862,
    'CHEMBL301': 0.926,
    'CHEMBL240': 0.884,
    'CHEMBL219': 0.887,
    'CHEMBL244': 0.983,
    'CHEMBL218': 0.941,
    'CHEMBL1978': 0.904}


In [101]:
mccs = []

In [102]:
for dataset in datasets:
    _, mcc = train_on_dataset(dataset, bits=512)
    mccs.append(mcc)

Training on dataset: CHEMBL205 with 512 bits fingerprint features
Average ROCAUC of the folds: 0.9876839623265227
Average accuracy of the folds: 0.9721297506548311
Average Matthews correlation of the folds: 0.8252921642404335
Average recall of the folds: 0.7932998688175008
Average precision of the folds: 0.8901113017236748
Average f1 score of the folds: 0.8387525257174839

Training on dataset: CHEMBL301 with 512 bits fingerprint features
Average ROCAUC of the folds: 0.9871230843054807
Average accuracy of the folds: 0.9825910398115102
Average Matthews correlation of the folds: 0.8894002637213945
Average recall of the folds: 0.8159172403162784
Average precision of the folds: 0.9888888888888889
Average f1 score of the folds: 0.8936870469862273

Training on dataset: CHEMBL240 with 512 bits fingerprint features
Average ROCAUC of the folds: 0.9759656896211462
Average accuracy of the folds: 0.9686147186147187
Average Matthews correlation of the folds: 0.7869141940048865
Average recall of the 

In [103]:
mccs = [round(num, 3) for num in mccs]
mcc_scores = dict(zip(datasets, mccs))
mcc_scores

{'CHEMBL205': 0.825,
 'CHEMBL301': 0.889,
 'CHEMBL240': 0.787,
 'CHEMBL219': 0.834,
 'CHEMBL244': 0.966,
 'CHEMBL218': 0.897,
 'CHEMBL1978': 0.865}

In [104]:
top_mcc_scores

{'CHEMBL205': 0.862,
 'CHEMBL301': 0.926,
 'CHEMBL240': 0.884,
 'CHEMBL219': 0.887,
 'CHEMBL244': 0.983,
 'CHEMBL218': 0.941,
 'CHEMBL1978': 0.904}

# RF with clustering

In [32]:
dataset = 'CHEMBL205'

In [58]:
df = pd.read_csv(path/f'{dataset}_cl_ECFP_1024_with_100_clusters.csv')

In [59]:
df.head()

Unnamed: 0,Name,SMILES,Cluster,ECFP4_1,ECFP4_2,ECFP4_3,ECFP4_4,ECFP4_5,ECFP4_6,ECFP4_7,...,ECFP4_1016,ECFP4_1017,ECFP4_1018,ECFP4_1019,ECFP4_1020,ECFP4_1021,ECFP4_1022,ECFP4_1023,ECFP4_1024,Activity
0,CHEMBL188002,S(=O)(=O)(N)c1cc(N/C(/S)=N\c2cc(C(=O)[O-])c(cc...,28,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,CHEMBL364127,Clc1ccc(cc1)C(=O)NC1Cc2cc(S(=O)(=O)N)ccc2C1,28,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
2,CHEMBL1683469,S(=O)(=O)(N)c1ccc(cc1)CNS(=O)(=O)CC12CCC(CC1=O...,93,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
3,CHEMBL52564,Oc1ccccc1\C=C\C(=O)[O-],14,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,CHEMBL21427,OB(O)c1ccc(OC)cc1,32,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [60]:
df.Cluster.unique()

array([28, 93, 14, 32, 69, 24, 95, 11, 77, 72, 75, 31, 23, 84, 34, 92, 47,
        1, 20, 13, 37, 15,  4, 65, 18, 16, 81, 94,  5, 12, 39, 62, 71, 43,
       44, 46,  7,  2, 86, 35, 73, 38, 48, 33, 64, 41, 22, 27, 50, 89, 61,
       80, 19, 67, 10, 91, 99, 26,  6, 53, 21, 68, 78, 74, 57, 63, 17, 55,
       98, 79, 58, 25,  3, 85, 88, 82,  8, 36, 83,  9, 97, 45, 56, 29, 90,
       66, 70, 51, 42, 49, 52, 40, 96, 60, 30, 87, 59, 76,  0, 54])

In [61]:
values = df.Cluster.value_counts(ascending=True)

In [62]:
values = values[values < 2].index

In [63]:
list(values)

[0, 66, 83, 87, 90, 97, 59, 65, 49, 76, 29, 45, 3]

In [64]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17941 entries, 0 to 17940
Columns: 1028 entries, Name to Activity
dtypes: int64(1026), object(2)
memory usage: 140.7+ MB


In [65]:
for i in list(values):
    
    df = pd.concat([*[df.loc[df.Cluster == i]]*2, 
                    *[df.loc[df.Cluster != i]]], 
                    ignore_index=True)


In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17954 entries, 0 to 17953
Columns: 1028 entries, Name to Activity
dtypes: int64(1026), object(2)
memory usage: 140.8+ MB


In [67]:
df.reset_index(drop=True, inplace=True)

In [68]:
df.Cluster.value_counts()

77    1138
13    1018
32     957
71     947
5      827
      ... 
83       2
9        2
66       2
51       2
0        2
Name: Cluster, Length: 100, dtype: int64

In [69]:
X, y = df.drop(['Name', 'SMILES', 'Activity'], axis=1), df['Activity']

In [70]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=df['Cluster'], random_state=42)

In [71]:
X_train.head()

Unnamed: 0,Cluster,ECFP4_1,ECFP4_2,ECFP4_3,ECFP4_4,ECFP4_5,ECFP4_6,ECFP4_7,ECFP4_8,ECFP4_9,...,ECFP4_1015,ECFP4_1016,ECFP4_1017,ECFP4_1018,ECFP4_1019,ECFP4_1020,ECFP4_1021,ECFP4_1022,ECFP4_1023,ECFP4_1024
14473,69,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1802,5,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3429,61,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
2527,53,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1040,28,0,0,1,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [72]:
X_train.Cluster.value_counts() / len(X_train), X_test.Cluster.value_counts() / len(X_test)

(77    0.063430
 13    0.056696
 32    0.053288
 71    0.052789
 5     0.046055
         ...   
 45    0.000083
 54    0.000083
 78    0.000083
 86    0.000083
 0     0.000083
 Name: Cluster, Length: 100, dtype: float64,
 77    0.063291
 13    0.056709
 32    0.053333
 71    0.052658
 5     0.046076
         ...   
 65    0.000169
 49    0.000169
 45    0.000169
 29    0.000169
 0     0.000169
 Name: Cluster, Length: 100, dtype: float64)

In [73]:
y_train.value_counts() / len(y_train), y_test.value_counts() / len(y_test)

(0    0.908305
 1    0.091695
 Name: Activity, dtype: float64,
 0    0.910717
 1    0.089283
 Name: Activity, dtype: float64)

In [74]:
%%capture
X_train, X_test = X_train.drop(['Cluster'], axis=1), X_test.drop(['Cluster'], axis=1)

In [75]:
X_train.head()

Unnamed: 0,ECFP4_1,ECFP4_2,ECFP4_3,ECFP4_4,ECFP4_5,ECFP4_6,ECFP4_7,ECFP4_8,ECFP4_9,ECFP4_10,...,ECFP4_1015,ECFP4_1016,ECFP4_1017,ECFP4_1018,ECFP4_1019,ECFP4_1020,ECFP4_1021,ECFP4_1022,ECFP4_1023,ECFP4_1024
14473,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1802,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3429,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
2527,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1040,0,0,1,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [76]:
train_rf(X_train, X_test, y_train, y_test)

(0.9900108040542529,
 0.9738396624472574,
 0.837667473170147,
 0.8431001890359168,
 0.861003861003861,
 0.8519579751671442,
 RandomForestClassifier(bootstrap=False, criterion='entropy',
                        max_features='sqrt', n_estimators=200, n_jobs=-1,
                        random_state=69))

In [77]:
def train_on_dataset_with_cluster(dataset, n_clusters = 10, bits=1024, n_estimators=200, criterion='entropy', max_features='log2'):
    
    print(f'Training on dataset: {dataset} with {bits} bits fingerprint features')
    
    df = pd.read_csv(path/f'{dataset}_cl_ECFP_1024_with_{n_clusters}_clusters.csv')
    values = df.Cluster.value_counts(ascending=True)
    values = values[values < 2].index
    for i in list(values):
        df = pd.concat([*[df.loc[df.Cluster == i]]*2, 
                        *[df.loc[df.Cluster != i]]], 
                        ignore_index=True)
    df.reset_index(drop=True, inplace=True)
    X, y = df.drop(["Name", "SMILES","Cluster", "Activity"], axis=1), df["Activity"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=df['Cluster'], random_state=42)
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    X_train_list, X_valid_list, y_train_list, y_valid_list = [], [], [], []

    for train_index, valid_index in kf.split(X_train):
        X_train_list.append(X_train.iloc[train_index])
        X_valid_list.append(X_train.iloc[valid_index])
        y_train_list.append(y_train.iloc[train_index])
        y_valid_list.append(y_train.iloc[valid_index]) 
    aucs, accs, mccs, recalls, precs, f1_scores = [], [], [], [], [], []
    for i in range(0,5):
        X_train = X_train_list[i]
        X_valid = X_valid_list[i]
        y_train = y_train_list[i]
        y_valid = y_valid_list[i]
        auc,acc2,mcc,Recall,Precision,F1_score, rf = train_rf(X_train, X_valid, y_train, y_valid, 
                                                      n_estimators=n_estimators, criterion=criterion, 
                                                              max_features=max_features)
        mccs.append(mcc)
        aucs.append(auc)
        accs.append(acc2)
        mccs.append(mcc)
        recalls.append(Recall)
        precs.append(Precision)
        f1_scores.append(F1_score)
        
    print(f"Average ROCAUC of the folds: {np.mean(aucs)}")
    print(f"Average accuracy of the folds: {np.mean(accs)}")
    print(f"Average Matthews correlation of the folds: {np.mean(mccs)}")
    print(f"Average recall of the folds: {np.mean(recalls)}")
    print(f"Average precision of the folds: {np.mean(precs)}")
    print(f"Average f1 score of the folds: {np.mean(f1_scores)}")
    print()
    score = []
    score.append(np.mean(aucs))
    score.append(np.mean(accs))
    score.append(np.mean(mccs))
    score.append(np.mean(recalls))
    score.append(np.mean(precs))
    score.append(np.mean(f1_scores))
    score = np.mean(score)
    mean_mcc = np.mean(mccs)
    return score, mean_mcc, rf

In [78]:
train_on_dataset_with_cluster(dataset)

Training on dataset: CHEMBL205 with 1024 bits fingerprint features
Average ROCAUC of the folds: 0.9909652500963002
Average accuracy of the folds: 0.9745715475067872
Average Matthews correlation of the folds: 0.841955183890321
Average recall of the folds: 0.821246401487674
Average precision of the folds: 0.8918098488822744
Average f1 score of the folds: 0.8548926004352506



(0.8959068053831012,
 0.841955183890321,
 RandomForestClassifier(bootstrap=False, criterion='entropy',
                        max_features='log2', n_estimators=200, n_jobs=-1,
                        random_state=69))

In [79]:
train_on_dataset(dataset)

Training on dataset: CHEMBL205 with 1024 bits fingerprint features
Average ROCAUC of the folds: 0.9881450356172132
Average accuracy of the folds: 0.9743593706024445
Average Matthews correlation of the folds: 0.8401407786107054
Average recall of the folds: 0.8135409249099645
Average precision of the folds: 0.8965677168370252
Average f1 score of the folds: 0.8528434688090117



(0.8942662158977276,
 0.8401407786107054,
 RandomForestClassifier(bootstrap=False, criterion='entropy',
                        max_features='log2', n_estimators=200, n_jobs=-1,
                        random_state=69))

# Try different number of clusters

### 100 gives the best results, but 10 is also good

In [30]:
cluster = [10, 100]

In [31]:
for c in cluster: 
    train_on_dataset_with_cluster(dataset, n_clusters=c)

Training on dataset: CHEMBL205 with 1024 bits fingerprint features
Average ROCAUC of the folds: 0.9899136094648535
Average accuracy of the folds: 0.9765869192950773
Average Matthews correlation of the folds: 0.8540500272395473
Average recall of the folds: 0.8350271171323802
Average precision of the folds: 0.8997390862172718
Average f1 score of the folds: 0.8661018444427098

Training on dataset: CHEMBL205 with 1024 bits fingerprint features
Average ROCAUC of the folds: 0.9913091353314061
Average accuracy of the folds: 0.9766416894295926
Average Matthews correlation of the folds: 0.8572071579443641
Average recall of the folds: 0.8429947747880597
Average precision of the folds: 0.8979989887512426
Average f1 score of the folds: 0.869321805116645

