# MaxMinPicker

Notebook that investigates the improvements to MaxMinPicker that will be present in the 2017_09 RDkit release.
This work is described in Roger Sayle's talk at the 
[2017 UGM](https://github.com/rdkit/UGM_2017/blob/master/Presentations/Sayle_RDKitDiversity_Berlin17.pdf).

In [52]:
from rdkit import Chem
from rdkit.Chem import Draw,rdMolDescriptors,AllChem
from rdkit import SimDivFilters,DataStructs
import gzip, time, platform

In [53]:
print('Python Version', platform.python_version())
print('RDKit Version:', Chem.rdBase.rdkitVersion)

Python Version 3.6.2
RDKit Version: 2017.09.1.b1


We'll use Andrew Dalke's [benzodiazepine dataset](http://dalkescientific.com/writings/benzodiazepine.sdf.gz) as its drug like and of a useful size for these studies. Download the file and name it `benzodiazepine.sdf.gz`.
We need to generate fingerprints for all molecules.

In [54]:
start = time.time()
benzodiazepines = []
inf = gzip.open('benzodiazepine.sdf.gz')
suppl = Chem.ForwardSDMolSupplier(inf)
for mol in suppl:
    if mol is None: next
    benzodiazepines.append(rdMolDescriptors.GetMorganFingerprintAsBitVect(mol,2))    
inf.close()
end = time.time()
print('Read', len(benzodiazepines), 'molecules in', str(end - start)+' sec')

Read 12386 molecules in 12.944286823272705 sec


First we'll use the LazyPick function. This allows a function to be passed in to generate the distance between the fingerprints. Here is that funtion - Tamimoto distance for the Morgan fingerprints.

We also create the picker.

In [55]:
def fn(i,j,fps=benzodiazepines):
    return 1.-DataStructs.TanimotoSimilarity(fps[i],fps[j])
mmp = SimDivFilters.MaxMinPicker()

LazyPick parameters:
1. the function that's used to generate the distance between 2 mols
2. the total number of molecules
3. the total number of molcules to return (including the initial seed molecules)
4. the indexes of the initial seed molecules (optional) e.g. [1,2,3,4 ...]

What's returned is the IDs of the initial seed molecules followed by the newly picked molecules

In [56]:
start_with = 100
how_many_to_pick = 100
for i in 1,2,3:
    start = time.time()
    picks = mmp.LazyPick(fn, len(benzodiazepines), start_with + how_many_to_pick, list(range(start_with)))
    end = time.time()
    print('Picking', how_many_to_pick, 'from', len(benzodiazepines) - start_with, 'starting with', start_with, 'generated', len(picks) - start_with, 'picks')
    print('Picking took', str(end - start)+' sec')

Picking 100 from 12286 starting with 100 generated 100 picks
Picking took 1.3343174457550049 sec
Picking 100 from 12286 starting with 100 generated 100 picks
Picking took 1.3073861598968506 sec
Picking 100 from 12286 starting with 100 generated 100 picks
Picking took 1.287372350692749 sec


That's pretty good, but we can do better. The LazyBitVectorPick is much faster as it does less too and fro between the C++ and Python layers. The one downside is that it assumes you want the Tanimoto distance between 2 bit vectors, but that's exactly what we were doing anyway with the distance function.

In [57]:
start_with = 100
how_many_to_pick = 100
for i in 1,2,3:
    start = time.time()
    picks = mmp.LazyBitVectorPick(benzodiazepines, len(benzodiazepines), start_with + how_many_to_pick, list(range(start_with)))
    end = time.time()
    print('Picking', how_many_to_pick, 'from', len(benzodiazepines) - start_with, 'starting with', start_with, 'generated', len(picks) - start_with, 'picks')
    print('Picking took', str(end - start)+' sec')

Picking 100 from 12286 starting with 100 generated 100 picks
Picking took 0.1678164005279541 sec
Picking 100 from 12286 starting with 100 generated 100 picks
Picking took 0.15154695510864258 sec
Picking 100 from 12286 starting with 100 generated 100 picks
Picking took 0.16774797439575195 sec


That's almost 10 faster than the old algorithm for picking 100 given 100 seeds, faster still for larger sets (you'll need to run with different versions of RDKit to check this).
You can investigate the timings yourself by changing the `start_with` and `how_many_to_pick` variables.

So let's continue using LazyBitVectorPick.

Let's see how it can work on a real example. We'll take the [NCI250 data](https://cactus.nci.nih.gov/download/nci/) set with ~250K molecules as our starting set (assume these are the compounds we already have in our collection) and we want to pick 500 molecules from the benzodiazepine dataset to add to that set. Download the smiles file and name it NCI.smiles.

So let's read in the NCI dataset

In [58]:
start = time.time()
nciFps = []
suppl2 = Chem.SmilesMolSupplier('NCI.smiles', delimiter=' ', smilesColumn=1)
for mol in suppl2:
    if mol is None: continue
    nciFps.append(rdMolDescriptors.GetMorganFingerprintAsBitVect(mol,2))    
end = time.time()
print('Read', len(nciFps), 'molecules in', str(end - start)+' sec')

Read 247477 molecules in 79.72341656684875 sec


Now combine the nci and benzodiazepine fingerprints as we need a single list

In [59]:
allFps = []
allFps.extend(nciFps)
allFps.extend(benzodiazepines)
len(allFps)

259863

In [60]:
how_many_to_pick = 1000
seed_count = len(nciFps)
seeds = list(range(seed_count))
fp_num = len(allFps)
for i in 1,2,3:
    start = time.time()
    picks = mmp.LazyBitVectorPick(allFps, fp_num, seed_count + how_many_to_pick, seeds)
    end = time.time()
    print('Picking', how_many_to_pick, 'from', len(allFps), 'starting with', seed_count, 'generated', len(picks) - seed_count, 'picks and took', str(end - start)+' sec')

Picking 1000 from 259863 starting with 247477 generated 1000 picks and took 1248.1252934932709 sec
Picking 1000 from 259863 starting with 247477 generated 1000 picks and took 1203.1685574054718 sec
Picking 1000 from 259863 starting with 247477 generated 1000 picks and took 1257.1654748916626 sec


So that's a reasonably representative selection of 1000 compounds done in about 20 minutes on a modestly speced laptop.
Not bad! You certainly couldn't do that with the old algorithm.

Finally, let's look at a different approach. Instead of choosing how many to pick we specify a similarity threshold and keep picking until we have picked all the available molecules that are at least that distance from any that have already been picked. This uses the new function LazyBitVectorPickWithThreshold which returns the picks and the threshold of the last pick.

In [61]:
start = time.time()
picks, thresh = mmp.LazyBitVectorPickWithThreshold(benzodiazepines, len(benzodiazepines), 10000, 0.7)
end = time.time()
print('Picking generated', len(picks), 'picks, final threshold was', thresh)
print('Picking took', str(end - start)+' sec')

Picking generated 146 picks, final threshold was 0.7
Picking took 0.12487959861755371 sec


One thing to note here is that the last parameter seems to be a **disimilarity** score which is a bit unexpected. With a value of 0.7 you get around 146 picks, but with 0.4 you get 2065 indicating a closer placement of the picks:

In [62]:
start = time.time()
picks, thresh = mmp.LazyBitVectorPickWithThreshold(benzodiazepines, len(benzodiazepines), 10000, 0.4)
end = time.time()
print('Picking generated', len(picks), 'picks, final threshold was', thresh)
print('Picking took', str(end - start)+' sec')

Picking generated 2065 picks, final threshold was 0.4
Picking took 4.998914480209351 sec


Note that in both cases we specify a stupidly high value for the number to pick so that this is not limiting.
The picker terminates once it can find no more to pick.
Of course you can specify a lower limit here so that it terminates when it can find more more to pick or has picked the required number.

Adios. Hats off to Roger and Greg!