## CHEM452 - Logistic Regression

Ok, let's load in the bace.csv dataset.

The BACE dataset provides quantitative IC50 and qualitative (binary label)
  binding results for a set of inhibitors of human beta-secretase 1 (BACE-1).
  All data are experimental values reported in scientific literature over the
  past decade, some with detailed crystal structures available. A collection
  of 1522 compounds is provided, along with the regression labels of IC50.
  Scaffold splitting is recommended for this dataset.
  The raw data csv file contains columns below:
  - "mol" - SMILES representation of the molecular structure
  - "pIC50" - Negative log of the IC50 binding affinity
  - "class" - Binary labels for inhibitor

  https://github.com/deepchem/deepchem/blob/master/deepchem/molnet/load_function/bace_datasets.py

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
!pip install rdkit-pypi
import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

df = pd.read_csv("bace.csv")
df.head()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Unnamed: 0,mol,CID,Class,Model,pIC50,MW,AlogP,HBA,HBD,RB,...,PEOE6 (PEOE6),PEOE7 (PEOE7),PEOE8 (PEOE8),PEOE9 (PEOE9),PEOE10 (PEOE10),PEOE11 (PEOE11),PEOE12 (PEOE12),PEOE13 (PEOE13),PEOE14 (PEOE14),canvasUID
0,O1CC[C@@H](NC(=O)[C@@H](Cc2cc3cc(ccc3nc2N)-c2c...,BACE_1,1,Train,9.154901,431.56979,4.4014,3,2,5,...,53.205711,78.640335,226.85541,107.43491,37.133846,0.0,7.98017,0.0,0.0,1
1,Fc1cc(cc(F)c1)C[C@H](NC(=O)[C@@H](N1CC[C@](NC(...,BACE_2,1,Train,8.853872,657.81073,2.6412,5,4,16,...,73.817162,47.1716,365.67694,174.07675,34.923889,7.98017,24.148668,0.0,24.663788,2
2,S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H](...,BACE_3,1,Train,8.69897,591.74091,2.5499,4,3,11,...,70.365707,47.941147,192.40652,255.75255,23.654478,0.230159,15.87979,0.0,24.663788,3
3,S1(=O)(=O)C[C@@H](Cc2cc(O[C@H](COCC)C(F)(F)F)c...,BACE_4,1,Train,8.69897,591.67828,3.168,4,3,12,...,56.657166,37.954151,194.35304,202.76335,36.498634,0.980913,8.188327,0.0,26.385181,4
4,S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H](...,BACE_5,1,Train,8.69897,629.71283,3.5086,3,3,11,...,78.945702,39.361153,179.71288,220.4613,23.654478,0.230159,15.87979,0.0,26.100143,5


Ok, now we are going to use Morgan Fingerprints as our featurization. These can be a little cumbersome to generate, so I provide the relevant code here to make the feature vector.

In [None]:
mols = [Chem.MolFromSmiles(smiles) for smiles in df['mol']]
from rdkit import DataStructs
radius = 3
nBits = 1024
info = {}
fps = [AllChem.GetMorganFingerprintAsBitVect(mol, radius=radius, nBits=nBits, bitInfo=info) for mol in mols]
X = []
for fp_object in fps:
  fp_vect = np.zeros((1,), dtype=float)
  DataStructs.ConvertToNumpyArray(fp_object, fp_vect)
  X.append(fp_vect)
y = df.Class


Next, let's split our data into training and test sets

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)

Let's start with logistic regression. Create the model, fit it to the training data, and then use the model to predict the classification for the test data.

In [None]:
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train,y_train)
y_pred=logreg.predict(X_test)

Now, let's quantify the prediction by creating the confusion matrix. The diagonal elements are correct classifications and the off diagonal elements are the type 1 and type 2 errors.

In [None]:
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

array([[168,  33],
       [ 40, 138]])

We can also compute other metrics, like accuracy, recall, and precision.

In [None]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

Accuracy: 0.8073878627968337
Precision: 0.8070175438596491
Recall: 0.7752808988764045
