# Building a Similarity Comparison Set, Revisited 2021

Goal: construct a set of molecular pairs that can be used to compare similarity methods to each other.

Update from http://rdkit.blogspot.com/2016/04/revisiting-similarity-comparison-set.html
The earlier version of this notebook (http://rdkit.blogspot.ch/2013/10/building-similarity-comparison-set-goal.html or https://github.com/greglandrum/rdkit_blog/blob/master/notebooks/Building%20A%20Similarity%20Comparison%20Set.ipynb)included a number of molecules that have counterions (from salts). Because this isn't really what we're interested in (and because the single-atom fragments that make up many salts triggered a bug in the RDKit's Morgan fingerprint implementation), I repeat the analysis here and restrict it to single-fragment molecules (those that do not include a `.` in the SMILES).

The other big difference from the previous post is that an updated version of ChEMBL is used; this time it's ChEMBL21.

I want to start with molecules that have some connection to each other, so I will pick pairs that have a baseline similarity: a Tanimoto similarity using count based Morgan0 fingerprints of at least 0.7. I also create a second set of somewhat more closely related molecules where the baseline similarity is 0.6 with a Morgan1 fingerprint. Both thresholds were selected empirically.

**Note:** this notebook and the data it uses/generates can be found in the github repo: https://github.com/greglandrum/rdkit_blog

I'm going to use ChEMBL as my data source, so I'll start by adding a table with Morgan0 fingerprints that only contains molecules with molwt<=600 and a single fragment (we recognize this because there is no '.' in the SMILES):

    chembl_21=# select molregno,morgan_fp(m,0) mfp0 into table rdk.tfps_smaller from rdk.mols 
    join compound_properties using (molregno) 
    join compound_structures using (molregno) 
    where mw_monoisotopic<=600 and canonical_smiles not like '%.%';
    SELECT 1372487
    chembl_21=# create index sfps_mfp0_idx on rdk.tfps_smaller using gist(mfp0);
    CREATE INDEX
   

And now I'll build the set of pairs using Python. This is definitely doable in SQL, but my SQL-fu isn't that strong.

Start by getting a set of 35K random small molecules with MW<=600:

In [1]:
from rdkit import Chem
from rdkit import rdBase
print(rdBase.rdkitVersion)
import time
print(time.asctime())

2021.03.1
Sun May 16 07:37:35 2021


In [36]:
import psycopg2
cn = psycopg2.connect(host='localhost',dbname='chembl_28')
curs = cn.cursor()
curs.execute("select chembl_id,m from rdk.mols join rdk.tfps_smaller using (molregno)"
             " join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND')"
             " order by random() limit 35000")
qs = curs.fetchall()

And now find one neighbor for 25K of those from the mfp0 table of smallish molecules:

In [37]:
cn.rollback()
curs.execute('set rdkit.tanimoto_threshold=0.7')

keep=[]
for i,row in enumerate(qs):
    curs.execute("select chembl_id,m from rdk.mols join (select chembl_id,molregno from rdk.tfps_smaller "
                 "join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND') "
                 "where mfp0%%morgan_fp(%s,0) "
                 "and chembl_id!=%s limit 1) t2 using (molregno)",(row[1],row[0]))
    d = curs.fetchone()
    if not d: continue
    keep.append((row[0],row[1],d[0],d[1]))
    if len(keep)==25000: break
    if not i%1000: print('Done: %d'%i)


Done: 0
Done: 1000
Done: 2000
Done: 3000
Done: 4000
Done: 5000
Done: 6000
Done: 7000
Done: 8000
Done: 9000
Done: 10000
Done: 11000
Done: 12000
Done: 13000
Done: 14000
Done: 15000
Done: 16000
Done: 17000
Done: 18000
Done: 19000
Done: 20000
Done: 21000
Done: 22000
Done: 23000
Done: 24000
Done: 25000


Finally, write those out to a file so that we can use them elsewhere:

In [38]:
import gzip
outf = gzip.open('../data/chembl28_25K.pairs.txt.gz','wb+')
for cid1,smi1,cid2,smi2 in keep: outf.write(f'{cid1}\t{smi1}\t{cid2}\t{smi2}\n'.encode('UTF-8'))
outf=None


In [39]:
!zcat ../data/chembl28_25K.pairs.txt.gz | head

CHEMBL3448871	CN(C)C(=O)CN1CCC2(CC1)C[C@@H](O)[C@H](c1ccccc1)NC2=O	CHEMBL3547849	CN1C[C@@H](C(=O)N(C)C)[C@@]2(CCc3ccccc3C(=O)N2)C1
CHEMBL159874	CCN1CC(=O)N(c2ccc(C)cc2C)C1=S	CHEMBL3438192	CCN1CCC(=O)N(C)Cc2cc(F)ccc21
CHEMBL1495377	CCOC(=O)C(=O)Nc1nc(-c2ccc3c(c2)CCN3C(=O)C2CC2)c(C)s1	CHEMBL3409151	CCCCn1cc(C(=O)NC2CCCC2)c(=O)c2cccc(OC)c21
CHEMBL3099949	NC(=O)Nc1ccccc1OC[C@@H](O)CN1CCC2(CC1)Cc1cc(Cl)ccc1O2	CHEMBL3906956	Nc1ncnc(N2CCC(N)(C(=O)N[C@@H](CCO)c3ccc(Cl)cc3)CC2)c1Cl
CHEMBL488412	CC[C@H]1C[C@H]2C[C@@]3(C(=O)OC)c4[nH]c5ccccc5c4CCN(C2=O)[C@@H]13	CHEMBL3547849	CN1C[C@@H](C(=O)N(C)C)[C@@]2(CCc3ccccc3C(=O)N2)C1
CHEMBL4446195	CCN(CC)c1ccc(C=O)c(OCc2cn(CCCOc3ccc4ccc(=O)oc4c3)nn2)c1	CHEMBL3547120	CN(CCCCCN1C(=O)c2ccccc2C1=O)Cc1ccccc1
CHEMBL64560	Clc1ncc[nH]1	CHEMBL293391	Cc1ncc[nH]1
CHEMBL3487260	Cc1nccc(CN2CCC(c3nnsc3S(C)(=O)=O)CC2)n1	CHEMBL3466761	Cc1nc(S(=O)(=O)CCn2cccn2)n(C2CCCCC2)c1C
CHEMBL4547889	COC1NC(=N)CC(=O)N1	CHEMBL2229111	C[C@@H]1NC(=O)CNC1=O
CHEMBL3692987	COc1cc(S(

# Try molecules that are a bit more similar.
Use a similarity threshold for the pairs using MFP1 bits.

As above, start by adding a table with Morgan1 fingerprints for the smaller molecules:

    chembl_21=# select molregno,morgan_fp(m,1) mfp1 into table rdk.tfps1_smaller from rdk.mols 
    join compound_properties using (molregno) 
    join compound_structures using (molregno) 
    where mw_monoisotopic<=600 and canonical_smiles not like '%.%';
    SELECT 1372487
    chembl_21=# create index sfps_mfp1_idx on rdk.tfps1_smaller using gist(mfp1);
    CREATE INDEX
   

In [40]:
cn = psycopg2.connect(host='localhost',dbname='chembl_28')
curs = cn.cursor()
curs.execute("select chembl_id,m from rdk.mols join rdk.tfps1_smaller using (molregno)"
             " join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND')"
             " order by random() limit 35000")
qs = curs.fetchall()

In [41]:
cn.rollback()
curs.execute('set rdkit.tanimoto_threshold=0.6')

keep=[]
for i,row in enumerate(qs):
    curs.execute("select chembl_id,m from rdk.mols join (select chembl_id,molregno from rdk.tfps1_smaller "
                 "join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND') "
                 "where mfp1%%morgan_fp(%s,1) "
                 "and chembl_id!=%s limit 1) t2 using (molregno)",(row[1],row[0]))
    d = curs.fetchone()
    if not d: continue
    keep.append((row[0],row[1],d[0],d[1]))
    if len(keep)==25000: break
    if not i%1000: print('Done: %d'%i)


Done: 0
Done: 1000
Done: 2000
Done: 3000
Done: 4000
Done: 5000
Done: 6000
Done: 7000
Done: 8000
Done: 9000
Done: 10000
Done: 11000
Done: 12000
Done: 13000
Done: 14000
Done: 15000
Done: 16000
Done: 17000
Done: 18000
Done: 19000
Done: 20000
Done: 21000
Done: 22000
Done: 23000
Done: 24000
Done: 25000


In [42]:
import gzip
outf = gzip.open('../data/chembl28_25K.mfp1.pairs.txt.gz','wb+')
for cid1,smi1,cid2,smi2 in keep: outf.write(f'{cid1}\t{smi1}\t{cid2}\t{smi2}\n'.encode('UTF-8'))
outf=None


In [43]:
with gzip.open('/scratch/cheminformatics_datasets/files/chembl28_25K.mfp1.pairs.txt.gz','wt+') as outf:
    outf.write('pair_index\tchembl_id\tsmiles\n')
    for i,(cid1,smi1,cid2,smi2) in enumerate(keep):
        outf.write(f'Pair{i+1}\t{cid1}\t{smi1}\n')
        outf.write(f'Pair{i+1}\t{cid2}\t{smi2}\n')