# RDKit Experiments
***

## What is [RDKit](http://rdkit.org)?
RDkit is a collection of cheminformatics and machine learning software written in the Python and C++. In this notebook/blog, we will be using the language Python.

Below are some imports from RDKit

In [1]:
from rdkit import Chem

#### Creating molecules using RDKit
using **MolFromSmiles**

In [2]:
mol = Chem.MolFromSmiles('Cc1cccc1')

or from a sdf file (lots of mols): **SDMolSupplier**

In [3]:
suppl = Chem.SDMolSupplier('db/chembl_01.sdf')

for more, visit https://www.rdkit.org/docs/GettingStartedInPython.html

#### Displaying molecules
After knowing how to create molecules using RDKit, I was curious how I could visualize them. The function below allows you to draw a 2D SVG molecule. 

###### reference: [RDKit in Jupyter Notebook](http://davies-lee.com/index.php/2018/10/06/rdkit-in-jupyter-notebooks/)

In [4]:
# imports 
from IPython.display import SVG
from rdkit.Chem import rdDepictor as rdd
from rdkit.Chem.Draw import rdMolDraw2D as draw2d

def draw2D(x):
    m = x
    molSize = (450, 150)
    mc = Chem.Mol(m.ToBinary())
    if not mc.GetNumConformers():
        rdd.Compute2DCoords(mc)
    drawer = draw2d.MolDraw2DSVG(molSize[0],molSize[1])
    drawer.DrawMolecule(mc)
    drawer.FinishDrawing()
    svg = drawer.GetDrawingText()
    display(SVG(svg.replace('svg:','')))

#### Similarity Search
Speaking of RDKit, lots of people use this collection to do similarity search between molecules. 
There are several kinds of [similarity searches](https://en.wikipedia.org/wiki/Similarity_search), including Tanimoto, Dice, Cosine, Sokal, Russel... and more.

In this section, we will be focusing on the Tanimoto Similarity using [RDKit](http://www.rdkit.org/docs/GettingStartedInPython.html#fingerprinting-and-molecular-similarity)

First, let's import some modules we will need

In [5]:
from rdkit import DataStructs
from rdkit.Chem.Fingerprints.FingerprintMols import FingerprintMol

Before we start doing Tanimoto Similarity, let's import a [database from chEMBL](ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/)

Here, I'm downloading the sdf file and store it in the folder called db:

In [6]:
suppl = Chem.SDMolSupplier('db/chembl_01.sdf')

now, let's create a function that converts the suppl into a list of molecules (Mol type).

In [7]:
def suppl_to_lst(suppl, sample_size):
    # declare a list to store the mols
    ms = []
    # sample size: can be changed accordingly to the size of database
    for i in range(sample_size):
        ms.append(suppl[i])
    
    return ms

In [8]:
small_moldb100 = suppl_to_lst(suppl, 100)
small_moldb200 = suppl_to_lst(suppl, 200)
small_moldb500 = suppl_to_lst(suppl, 500)
small_moldb1000 = suppl_to_lst(suppl, 1000)

In [9]:
from gensim.models import Word2Vec
    
model = Word2Vec.load('db/model_300dim.pkl')

In [10]:
from mol2vec.features import mol2alt_sentence, MolSentence, DfVec, sentences2vec
from mol2vec.helpers import depict_identifier, mol_to_svg, IdentifierTable, plot_2D_vectors

def clean_db(db, radius):
    not_in_model = []
    none_type = []
    
    # check for nonetype
    for i in range(len(db)):
        mol = db[i]
        # check for nonetype
        if type(mol) == type(None):
            none_type.append(mol)
        
    cleaned1 = list(set(db)-set(none_type))
    
    for i in range(len(cleaned1)):
        mol = cleaned1[i]
        # check if sentence in model or not
        sentences = mol2alt_sentence(mol, radius)
        for s in sentences:
            if s not in model:
                not_in_model.append(mol)
    
    cleaned = list(set(cleaned1)-set(not_in_model))
    return cleaned

In [11]:
radius = 1
c_small_moldb100 = clean_db(small_moldb100, radius)
c_small_moldb200 = clean_db(small_moldb200, radius)
c_small_moldb500 = clean_db(small_moldb500, radius)
c_small_moldb1000 = clean_db(small_moldb1000, radius)



After getting a list of molecules, let's create a function to find the Tanimoto similarity:

In [12]:
def tanimoto_similarity(ms, idx):
    # convert Mol to datastructure type using a list comprehension
    fps = [FingerprintMol(i) for i in ms]

    # set a query molecule: can be changed 
    query = fps[idx]
    # declare new list for similarities
    sim = []
    # loop through to find Tanimoto similarity
    for i in range(len(fps)):
        # tuple: (idx, similarity)
        sim.append((i, DataStructs.FingerprintSimilarity(query, fps[i])))
    
    # sort sim using the similarities
    sim.sort(key=lambda x:x[1], reverse=True)
    
    return sim

To better visualize the generated similarities, let's sort them in descending order and display them:

In [13]:
# get gt_ranking
for i in range(10):
    gt_rankings = tanimoto_similarity(c_small_moldb1000, i)
# top 10 of the list
gt_rankings[:10]

[(9, 1.0),
 (879, 0.8318425760286225),
 (665, 0.698905109489051),
 (374, 0.6984834968777877),
 (928, 0.6939338235294118),
 (201, 0.6903914590747331),
 (282, 0.687333923826395),
 (472, 0.6873315363881402),
 (52, 0.6857899382171226),
 (131, 0.6841637010676157)]

we can display all the top ones using the function draw2D created earlier

So that is the Tanimoto Similarity Search using RDKit directly. And now we are wondering whether we can convert a molecule to a graph somehow to perform the similarity search. 

## End of RDKit; Next up: Mol2vec
***