Skip to content

Latest commit

 

History

History
231 lines (171 loc) · 7.88 KB

ch09_qsar.asciidoc

File metadata and controls

231 lines (171 loc) · 7.88 KB

Chapter 9: Basics of Quantitative Structure-Activity Relationship (QSAR)

jupyter

The correlation between chemical structure and biological activity is called Structure Activity Relationship (SAR) or Quantitative SAR (QSAR). In general, similar compounds are known to exhibit similar biological activities, and it is very important in drug discovery research to understand this correlation and apply it to drug design.

In addition, there are two types of problems such as classification problems to estimate which class a compound belongs to, such as cell death or toxicity, or toxicity, and regression problems to estimate continuous values such as % inhibition.

Consider the cause of no effect (classification problem)

Label the ones with an IC50 less than 1 uM with hERG inhibition and the others with no hERG inhibition using 73 data from ChEMBL hERG inhibition assay.

First, import the necessary libraries.

from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, Draw
from rdkit.Chem.Draw import IPythonConsole
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.ensemble import RandomForestClassifier

Processing of tab-delimited text downloaded with ChEMBL is almost the same as in Chapter 8, but this time I want liveness data, so I search for the column STANDARD_VALUE and retrieve the numerical value. If this value is less than 1000 nM, label it as POS, otherwise label it as NEG. At the end, I will make the label numpy array.

mols = []
labels = []
with open("ch09_compounds.txt") as f:
    header = f.readline()
    smiles_index = -1
    for i, title in enumerate(header.split("\t")):
        if title == "CANONICAL_SMILES":
            smiles_index = i
        elif title == "STANDARD_VALUE":
            value_index = i
    for l in f:
        ls = l.split("\t")
        mol = Chem.MolFromSmiles(ls[smiles_index])
        mols.append(mol)
        val = float(ls[value_index])
        if val < 1000:
            labels.append("POS")
        else:
            labels.append("NEG")

labels = np.array(labels)

Then convert the mol object into a fingerprint. From this fingerprint, create a model to predict the presence or absence of hERG inhibition.

fps = []
for mol in mols:
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    fps.append(arr)
fps = np.array(fps)

Divide the data set into two of the training set test set. The test set will be used later to evaluate the accuracy of the created prediction model.

x_train, x_test, y_train, y_test = train_test_split(fps, labels)

To create a predictive model, just create an instance and train it with the fit method.

rf = RandomForestClassifier()
rf.fit(x_train, y_train)

Predict the test set you split up earlier.

y_pred = rf.predict(x_test)

Create a Confusion matrix.

What is confusion matrix?

Confusion matrix is a table that summarizes the results of class classification. It is possible to visualize clearly whether the class is classified correctly, and as TP and TN are many and FP and FN are small, it is possible to classify better.

Actual class

Positive

Negative

Predicted class

Positive

True Positive(TP)

False Positive(FP)

Negative

False Negative(FN)

True Negative(TN)

confusion_matrix(y_test, y_pred)
#array([[11,  1],[ 5,  2]])
11 1

5

2

Let’s look at the F1 score.

f1_score(y_test, y_pred, pos_label="POS")
#0.4

It is not very good.

Note
Because the train_test_split function randomly splits the training set and test set, the value of the confidence matrix, F1 score changes with each execution.
With F1 score
  • The ratio of what is truly correct among what is predicted to be correct is called accuracy rate precision = TP / (TP + FP)

  • The rate at which the correct thing is predicted to be correct is called the recall rate recall = TP / (TP + FN)

The F1 score is the harmonic mean of the precision rate and the recall rate

It is calculated by F1 = 2 * (precision * recall) / (precision + recall)

Predict the efficacy of drugs (regression problem)

Regression models, as discussed earlier, are models that predict continuous values. This time, create a regression model of RandomForest, and evaluate its accuracy with R2. Let’s use the data from hERG’s assay data used in classification problems. Import the required libraries first.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from math import log10

We labeled it for classification problems, but now we want to predict continuous values, so we convert it to pIC50. (We will supplement later on why it is convenient to use pIC50)

pIC50s = []
with open("ch09_compounds.txt") as f:
    header = f.readline()
    for i, title in enumerate(header.split("\t")):
        if title == "STANDARD_VALUE":
            value_index = i
    for l in f:
        ls = l.split("\t")
        val = float(ls[value_index])
        pIC50 = 9 - log10(val)
        pIC50s.append(pIC50)

pIC50s = np.array(pIC50s)

Divide the data set into two: training set and test set. The fingerprint uses what was created at the time of classification model.

x_train, x_test, y_train, y_test = train_test_split(fps, pIC50s)

I will train. In the case of Scikit-learn, this procedure is fit and predict with almost the same method in any method.

rf = RandomForestRegressor()
rf.fit(x_train, y_train)

Let’s predict.

y_pred = rf.predict(x_test)

Let’s put out the prediction accuracy with R2.

r2_score(y_test, y_pred)
#0.52

Is there anything like that?

With R2 score

It is often used as one of the evaluation indicators for the goodness of fit of regression, also called the determination coefficient.

Model applicability (applicability domain)

The method introduced here is a model generated based on the hypothesis that similar compounds exhibit similar biological activities. What is the prediction accuracy if there is no compound that is similar to the training set?

Of course, the predicted value is not reliable in that case. In other words, is the prediction likely to be that prediction? The degree of reliability always goes around. The extent to which such models can be trusted or applied is called the applicability domain. In this regard, the scope of application and model application Mr. Kaneko, Meiji University are detailed.

(Extra column) How reliable can the applicability domain be?

Long time ago, do the similar compounds of Dr. Hugo Kubinyi show similar activities? I remember the question that I was impressed by the fact that converting the estradiol OH group into a Methoxy group gave examples of the loss of activity.

The applicability domain is a method to measure the accuracy of the prediction from the similarity of the training set. Here comes the question of who the similarity is for. It is our hand that we think this compound and this compound are similar, but it is ultimately determined by the protein whether it is similar or not. Therefore, the activity can not always be predicted from the similarity, and the activity often disappears even if the similarity is extremely high. In particular, Activity Cliff, described in the context of MMP, gives such an event its name.