Chapter 6: Evaluating Compound Similarity

What does it mean that compounds are similar？

What does it mean that two compounds are similar? Somehow similar in shape?The expression is not scientific. Chemoinformatics evaluates whether or not they are more similar by quantitative measures such as similarity (generally 0-100% value) and dissimilarity (distance).

Here we mainly introduce two representative scales.

Descriptor

A parameter that expresses the overall characteristics of a molecule numerically is called a descriptor. Many descriptors have been proposed to date, including molecular weight, polar surface properties (PSA), and partition coefficient (logP). By evaluating the similarity of these descriptors, it is possible to express how similar the two molecules are. Also note that the features of the whole molecule are represented by a single number and are not local features.

Note	There are cases where commercial software is needed to calculate some descriptors.

Fingerprint

Another is fingerprinting. A fingerprint is a representation of the partial structure of a molecule in binary of 0 and 1, and the correspondence between the presence or absence of a partial structure and the bit on (1), off (0), representing a set of partial structures. This expresses the characteristics of the molecule. There are two types of fingerprints: fixed-length FP and variable-length FP. In the old days, fixed-length FPs (FPs whose partial structure and index were determined in advance) called MACCS keys were used, but now ECFP4, a variable length FP called Morgan2 is used.

As for the RDKit fingerprint, please read Developper of RDKit, Greg’s Slide for details.

This time, let’s evaluate the similarity using ECFP4 (Morgan2).

Difference between SMILES and fingerprint

SMILES represents the structure as an ASCII string, and the fingerprint is the binary representation of the presence or absence of a substructure. The difference is that the former is one of the structural expressions , while the latter is one of the feature expressions . Since only the presence or absence of partial structures is expressed, information such as the relationship between partial structures (how connected by positional relationship) is lost, and the original structure is not restored.

Some people call it Bag-of-Fragments because it corresponds to Bag-of-Words often used in text-mining.

calculate similarity

Let’s evaluate the similarity of toluene and chlorobenzene as simple molecules.

from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, Draw
from rdkit.Chem.Draw import IPythonConsole

Load molecules from SMILES.

mol1 = Chem.MolFromSmiles("Cc1ccccc1")
mol2 = Chem.MolFromSmiles("Clc1ccccc1")

Check it visually.

Draw.MolsToGridImage([mol1, mol2])

Generate a Morgan fingerprint of radius 2 equivalent to ECFP4.

fp1 = AllChem.GetMorganFingerprint(mol1, 2)
fp2 = AllChem.GetMorganFingerprint(mol2, 2)

Tanimoto coefficient is used for similarity evaluation.

DataStructs.TanimotoSimilarity(fp1, fp2)
# 0.5384615384615384

If you want to find the distance that is the complement of similarity (that is, 1-similarity), set returnDistance = True. In clustering described later, distance is more often used than similarity.

DataStructs.TanimotoSimilarity (fp1, fp2, returnDistance = True )
# 0.46153846153846156

Virtual screening

So far we have described how to evaluate the similarity of compounds. Selecting a specific compound group from many compounds using this similarity index is called virtual screening.

For example, if a compound that is likely to be a drug is published in a patent or a paper, or a compound that is likely to be promising is found in our assay system, similar compounds in the compound library database of our company or the database of commercially available compounds are more promising I want to find out if there is something like that. Here, it is possible to purchase an analog of influenza drug which is known as a neuraminidase inhibitor Inavir link:Find out using ZINC.

The molecular weight of Inavir was about 350,and ALogP was -2.92, a compartment of 3.4 million compounds with a ZINC molecular weight of 350-375 and a LogP of less than -1 was selected. This section is divided into 16 files, but download and use only the first set.

Note	How to download data is explained in Chapter 4.

We can perform shell command on jupyter notebook by starting from !. The following is an example of downloading ZINC data set with wget command on jupyter notebook

!wget http://files.docking.org/2D/EA/EAED.smi

Note	If you do not have the wget command on Windows etc., you can download the file by directly entering the URL in the browser. Move the downloaded file to the directory where jupyter is running.

Use SmilesMolSupplier function to read SMILES from a file and convert it to mol object. Think of it as the SMILES version of SDMolSupplier introduced in Chapter 5.

spl = Chem.rdmolfiles.SmilesMolSupplier( "EAED.smi" )
    len (spl)
# 195493

Next, prepare a function to check the degree of similarity with Inavir (Laninamivir).

laninamivir = Chem.MolFromSmiles("CO[C@H]([C@H](O)CO)[C@@H]1OC(=C[C@H](NC(=N)N)[C@H]1NC(=O)C)C(=O)O")
laninamivir_fp = AllChem.GetMorganFingerprint(laninamivir, 2)

def calc_laninamivir_similarity(mol):
    fp = AllChem.GetMorganFingerprint(mol, 2)
    sim = DataStructs.TanimotoSimilarity(laninamivir_fp, fp)
return sim

Check it.

similar_mols =[]
for mol in mols:
    sim = calc_laninamivir_similarity(mol)
    if sim > 0.2:
        similar_mols.append((mol, sim))

Sort the results in descending order of similarity and retrieve only the first ten.

similar_mols.sort(key=lambda x: x[1], reverse=True)
mols = [l[0] for l in similar_mols[:10]]

Let’s draw them.

Draw.MolsToGridImage(mols, molsPerRow=5)

As you can see if the similarity is confirmed, about 200,000 compounds examined this time can only find a compound with a maximum similarity is 23%. However, ZINC contains 750 million entries, so there should be many more similar compounds in it.

Clustering

For example, when creating a library by purchasing commercially available compounds, we want to have as much diversity as possible, so we group together similar compounds so that only similar compounds are not biased, and select a representative among them. If you want to organize compounds based on their structural similarity, use a technique called clustering.

Clustering 5614 hits from Novartis’s antimalarial assay

Import library for clustering and reading data.

from rdkit.ML.Cluster import Butina
mols = Chem.SDMolSupplier("ch06_nov_hts.sdf")

If for some reason RDKit cannot read the molecule, it will generate None instead of a mol object. An error will occur if this None is passed to the GetMorganFingerprintAsBitVect function, so the fingerprint will be generated while excluding None.

fps = []
valid_mols = []

for mol in mols:
    if mol is not None:
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
        fps.append(fp)
        valid_mols.append(mol)

Generate distance matrix (distance matrix of lower triangle) from fingerprint. DataStructs.BulkTanimotoSimilarity can be used to calculate similarities and distances between multiple compounds as a reference compound.

distance_matrix = []
for i, fp in enumerate(fps):
    distance  = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i+1],returnDistance = True )
    distance_matrix.extend(distance)

Cluster compounds using the distance matrix. The third argument is the distance threshold. In this example, clustering is performed on compounds with a similarity of 0.2 or more than 80%.

clusters = Butina.ClusterData(distance_matrix, len(fps), 0.2, isDistData=True)

Check number of cluster.

len(clusters)
#2492

Visualize structures of first cluster.

mols_ =[valid_mols[i] for i in clusters[0]]
Draw.MolsToGridImage(mols_, molsPerRow=5)

In this case, clustering was performed using the library provided in RDKit, but some methods can be used with Scikit-learn. And in practice this method is often used.

Structure Based Drug Design(SBDD)

Here we evaluate the similarity of apixaban and rivaroxaban, which are marketed as anticoagulants.

apx = Chem.MolFromSmiles("COc1ccc(cc1)n2nc(C(=O)N)c3CCN(C(=O)c23)c4ccc(cc4)N5CCCCC5=O")
rvx = Chem.MolFromSmiles("Clc1ccc(s1)C(=O)NC[C@H]2CN(C(=O)O2)c3ccc(cc3)N4CCOCC4=O")

Draw.MolsToGridImage([apx, rvx], legends=["apixaban", "rivaroxaban"])

As you can see from the structure, they are somewhat similar. Both compounds are known to bind to the same pocket of the serine protease, FXa, in the same way and inhibit the function of proteins.

apx_fp = AllChem.GetMorganFingerprint(apx, 2)
rvx_fp = AllChem.GetMorganFingerprint(rvx, 2)

DataStructs.TanimotoSimilarity(apx_fp, rvx_fp)
# 0.40625

The similarity was about 40%. In fact, both apixaban and rivaroxaban have their complex crystal structures solved and were drawn using PyMOL.

Note	It does not explain how to use PyMOL because it exceeds the contents of this document, but if you are interested, Please refer to here.

As you can see from the figure, apixaban and rivaroxaban are beautifully overlapping in three dimensions. In particular, methoxyphenyl and chlorothiol are located in a site called S1 pocket and are said to have some kind of strong interaction. As the ligand binding sites (pockets) of proteins become clearer, it becomes easier for the medicinal chemist to develop a strategy for the next modification, and the success rate and progress rate of the project will increase.

An approach that optimizes the structure based on the shape of the protein determined by X-ray or cryo-electric testing is called Structure Based Drug Design (SBDD). One the pockets are know you can screen for compounds that physically bind to the pocket, which is called structure-based virtual screening (SBVS), and ligand-based virtual screening as you did in the previous chapter (LBVS).

History of Xa inhibitors and the importance of quantum chemistry calculation

Although it is far from the content of chemoinformatics in this book, it is quite useful in molecular design to trace the history of FXa inhibitors and understand what improvements have been made over the generations. Also, the interpretation of the interaction of the S1 pocket is extremely difficult by visual and classical mechanics, and can be interpreted only by quantum chemical calculations such as the Fragment Molecular Orbital Method.Therefore, quantum chemical calculations will be indispensable in future molecular design.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ch06_similarity.asciidoc

ch06_similarity.asciidoc

Chapter 6: Evaluating Compound Similarity

What does it mean that compounds are similar？

Descriptor

Fingerprint

calculate similarity

Virtual screening

Clustering

Structure Based Drug Design(SBDD)

Files

ch06_similarity.asciidoc

Latest commit

History

ch06_similarity.asciidoc

File metadata and controls

Chapter 6: Evaluating Compound Similarity

What does it mean that compounds are similar？

Descriptor

Fingerprint

calculate similarity

Virtual screening

Clustering

Structure Based Drug Design(SBDD)