A primer on using [RDKit](https://www.rdkit.org/ )
---

This is a popular package for cheminformatics calculations and manipulations. The [documentation](https://rdkit.org/docs/index.html) contains many examples, recipes, and links to other resources. A full history is given [here](https://rdkit.org/docs/Overview.html#open-source-toolkit-for-cheminformatics). Due to its popularity, many other open source tools used RDKit and related/compatible data structures.

The [RDKit book](https://rdkit.org/docs/RDKit_Book.html) explains conventions, approaches, and features.

These notes will focus on using rdkit in python and focus on only portions of the toolkit. The [cookbook](https://rdkit.org/docs/Cookbook.html) has a lot of recipes, many of which are used below. RDKit provides attribution for these original examples, which may be modified below.

In [None]:
!pip install rdkit

In [None]:
import rdkit 
rdkit.__version__

# The Basics

There are 2 basic languages used to describe molecules: [SMILES](https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html) and [SMARTS](https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html)


## SMILES

* The original SMILES specification was developed by a company called Daylight; their description is [here](https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html).

* The [opensmiles](http://opensmiles.org/) specification is described [here](http://opensmiles.org/opensmiles.html)

* Wikipedia's description is [here](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system).

In [None]:
from rdkit import Chem
smi = "CN1C(NC2=NC=CC=C2)=CC=C1"
mol = Chem.MolFromSmiles(smi)
print(Chem.MolToSmiles(mol))

In [None]:
Chem.Kekulize(mol) # Kekule SMILES
print(Chem.MolToSmiles(mol, kekuleSmiles=True))

In [None]:
?Chem.Kekulize

In [None]:
# You can detect chemistry problems automatically
from rdkit import Chem
m = Chem.MolFromSmiles('CN(C)(C)C', sanitize=False)
problems = Chem.DetectChemistryProblems(m)

In [None]:
problems[0].GetType()

In [None]:
problems[0].GetAtomIdx()

In [None]:
problems[0].Message()

In [None]:
m = Chem.MolFromSmiles('CN(C)(C)C', sanitize=True)

In [None]:
type(m) # Santization does not return an invalid molecule

## SMARTS

# Visualization
Let's look at some existing molecules and how to manipulate them.

## Drawing a molecule

### Option 1: Map index to atom location directly

In [None]:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
IPythonConsole.ipython_useSVG=True  #< set this to False if you want PNGs instead of SVGs

In [None]:
def mol_with_atom_index(mol):
    for atom in mol.GetAtoms():
        atom.SetAtomMapNum(atom.GetIdx())
    return mol

In [None]:
# Test in a kinase inhibitor
mol = Chem.MolFromSmiles("C1CC2=C3C(=CC=C2)C(=CN3C1)[C@H]4[C@@H](C(=O)NC4=O)C5=CNC6=CC=CC=C65")
# Default
mol

In [None]:
# With atom index
mol_with_atom_index(mol)

### Option 2: Using Jupyter display settings directly (better)

In [None]:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
IPythonConsole.drawOptions.addAtomIndices = True
IPythonConsole.molSize = 300,300

In [None]:
mol = Chem.MolFromSmiles("C1CC2=C3C(=CC=C2)C(=CN3C1)[C@H]4[C@@H](C(=O)NC4=O)C5=CNC6=CC=CC=C65")
mol

In [None]:
IPythonConsole.drawOptions.addAtomIndices = False

## Display calculated features and annotations

In [None]:
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.molSize = 250,250

In [None]:
m = Chem.MolFromSmiles('c1ncncc1C(=O)[O-]')
AllChem.ComputeGasteigerCharges(m)
m

In [None]:
# The 'atomNote' property lets you add arbitrary annotations
m2 = Chem.Mol(m)
for at in m2.GetAtoms():
    lbl = '%.2f'%(at.GetDoubleProp("_GasteigerCharge"))
    at.SetProp('atomNote',lbl)
m2

In [None]:
# You can combine this with other annotations - by default they are separated by a comma
IPythonConsole.drawOptions.addAtomIndices = True
m2

In [None]:
IPythonConsole.drawOptions.addAtomIndices = False

## Stereochemistry

In [None]:
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.drawOptions.addAtomIndices = False
IPythonConsole.drawOptions.addStereoAnnotation = True

In [None]:
# Default Representation uses legacy FindMolChiralCenters() code
m1 = Chem.MolFromSmiles('C1CC1[C@H](F)C1CCC1')
m2 = Chem.MolFromSmiles('F[C@H]1CC[C@H](O)CC1')
Draw.MolsToGridImage((m1,m2), subImgSize=(250,250))

In [None]:
# new stereochemistry code with more accurate CIP labels, 2020.09 release
from rdkit.Chem import rdCIPLabeler
rdCIPLabeler.AssignCIPLabels(m1)
rdCIPLabeler.AssignCIPLabels(m2)
Draw.MolsToGridImage((m1,m2), subImgSize=(250,250))

## Highlighting substructures

In [None]:
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.drawOptions.addAtomIndices = True

In [None]:
m = Chem.MolFromSmiles('c1cc(C(=O)O)c(OC(=O)C)cc1')
substructure = Chem.MolFromSmarts('C(=O)O')

In [None]:
Draw.MolsToGridImage((m, substructure))

In [None]:
subs = m.GetSubstructMatches(substructure)
# ?m.GetSubstructMatches
# Returns tuples of the indices of the molecule's atoms that match a substructure query.

In [None]:
subs

In [None]:
m # The structure is colored in-place

In [None]:
Chem.MolFragmentToSmiles(m, subs[0])

In [None]:
Chem.MolFragmentToSmiles(m, subs[1])

In [None]:
# You can also manually set the atoms that should be highlighted:
m.__sssAtoms = [0,1,2,6,11,12]
m

In [None]:
IPythonConsole.drawOptions.addAtomIndices = False

## Highlighting differences between molecules

In [None]:
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdFMCS
from rdkit.Chem.Draw import rdDepictor
rdDepictor.SetPreferCoordGen(True)
IPythonConsole.drawOptions.minFontSize=20

In [None]:
mol1 = Chem.MolFromSmiles('FC1=CC=C2C(=C1)C=NN2')
mol2 = Chem.MolFromSmiles('CCC1=C2NN=CC2=CC(Cl)=C1')

In [None]:
Draw.MolsToGridImage([mol1, mol2])

In [None]:
# This function will find the maximum common substructure
# See documentation for more details: https://rdkit.org/docs/GettingStartedInPython.html#maximum-common-substructure

def view_difference(mol1, mol2):
    mcs = rdFMCS.FindMCS([mol1,mol2])
    mcs_mol = Chem.MolFromSmarts(mcs.smartsString)

    match1 = mol1.GetSubstructMatch(mcs_mol)
    target_atm1 = []
    for atom in mol1.GetAtoms():
        if atom.GetIdx() not in match1:
            target_atm1.append(atom.GetIdx())

    match2 = mol2.GetSubstructMatch(mcs_mol)
    target_atm2 = []
    for atom in mol2.GetAtoms():
        if atom.GetIdx() not in match2:
            target_atm2.append(atom.GetIdx())

    return Draw.MolsToGridImage([mol1, mol2], highlightAtomLists=[target_atm1, target_atm2])

In [None]:
view_difference(mol1,mol2)

In [None]:
rdDepictor.SetPreferCoordGen(False)

## Functional group abbreviations

In [None]:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import rdAbbreviations

In [None]:
m = Chem.MolFromSmiles('COc1ccc(C(=O)[O-])cc1')
m

In [None]:
abbrevs = rdAbbreviations.GetDefaultAbbreviations()
nm = rdAbbreviations.CondenseMolAbbreviations(m,abbrevs)
nm

In [None]:
# abbreviations that cover more than 40% of the molecule won't be applied by default
m = Chem.MolFromSmiles('c1c[nH]cc1C(F)(F)F')
nm1 = rdAbbreviations.CondenseMolAbbreviations(m,abbrevs)
nm2 = rdAbbreviations.CondenseMolAbbreviations(m,abbrevs,maxCoverage=0.8)
Draw.MolsToGridImage((m,nm1,nm2),legends=('','default','maxCoverage=0.8'))

In [None]:
# See available abbreviations
abbrevs = rdAbbreviations.GetDefaultAbbreviations()
for a in abbrevs:
    print(a.label)

## Representing large molecules

> "Some molecules like macrocycles are not represented well using the default RDKit drawing code. As a result, it may be preferable to use the CoordGen integration."

You can set this preference globally like:
~~~python
from rdkit.Chem.Draw import rdDepictor
rdDepictor.SetPreferCoordGen(True)
~~~

In [None]:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.molSize = 350,300
from rdkit.Chem import Draw

In [None]:
# default drawing
mol = Chem.MolFromSmiles("C/C=C/CC(C)C(O)C1C(=O)NC(CC)C(=O)N(C)CC(=O)N(C)C(CC(C)C)C(=O)NC(C(C)C)C(=O)N(C)C(CC(C)C)C(=O)NC(C)C(=O)NC(C)C(=O)N(C)C(CC(C)C)C(=O)N(C)C(CC(C)C)C(=O)N(C)C(C(C)C)C(=O)N1C")
mol

In [None]:
from rdkit.Chem import rdCoordGen
rdCoordGen.AddCoords(mol)
mol

## Integrating with matplotlib

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole

In [None]:
x = np.arange(0, 180, 1)
y = np.sin(x)

In [None]:
mol = Chem.MolFromSmiles('C1CNCCC1C(=O)C')
im = Chem.Draw.MolToImage(mol) # Convert to image

In [None]:
type(im)

In [None]:
fig = plt.figure(figsize=(10,5))
plt.plot(x, y)
plt.ylim(-1, 5)
ax = plt.axes([0.6, 0.47, 0.38, 0.38], frameon=True)
ax.imshow(im) # Use imshow to display
ax.axis('off')

# Identifying features
Let's see to look at what features a molecule has, including properties of the atoms that make up the molecule.

## Counting ring systems

In [None]:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole

In [None]:
def GetRingSystems(mol, includeSpiro=False):
    ri = mol.GetRingInfo()
    systems = []
    for ring in ri.AtomRings():
        ringAts = set(ring)
        nSystems = []
        for system in systems:
            nInCommon = len(ringAts.intersection(system))
            if nInCommon and (includeSpiro or nInCommon>1):
                ringAts = ringAts.union(system)
            else:
                nSystems.append(system)
        nSystems.append(ringAts)
        systems = nSystems
    return systems

In [None]:
mol = Chem.MolFromSmiles('CN1C(=O)CN=C(C2=C1C=CC(=C2)Cl)C3=CC=CC=C3')
print(GetRingSystems(mol))

In [None]:
IPythonConsole.drawOptions.addAtomIndices = True
mol

In [None]:
IPythonConsole.drawOptions.addAtomIndices = False

## Identify aromatic rings

https://rdkit.org/docs/RDKit_Book.html#aromaticity

In [None]:
from rdkit import Chem
IPythonConsole.drawOptions.addAtomIndices = True
IPythonConsole.drawOptions.addBondIndices = True
m = Chem.MolFromSmiles('c1cccc2c1CCCC2')
m

In [None]:
ri = m.GetRingInfo()
# You can interrogate the RingInfo object to tell you the atoms that make up each ring:
print(ri.AtomRings())

In [None]:
# or the bonds that make up each ring:
print(ri.BondRings())

In [None]:
# To detect aromatic rings, loop over the bonds in each ring and
# flag the ring as aromatic if all bonds are aromatic:
def isRingAromatic(mol, bondRing):
        for id in bondRing:
            if not mol.GetBondWithIdx(id).GetIsAromatic():
                return False
        return True

In [None]:
print(isRingAromatic(m, ri.BondRings()[0]))

In [None]:
print(isRingAromatic(m, ri.BondRings()[1]))

## Identify atom properties

### Direct method

In [None]:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
IPythonConsole.drawOptions.addAtomIndices = True

In [None]:
mol = Chem.MolFromSmiles("c1ccccc1C=CCC")
aromatic_carbon = Chem.MolFromSmarts("c")
print(mol.GetSubstructMatches(aromatic_carbon))

In [None]:
mol

In [None]:
# The RDKit includes a SMARTS extension that allows hybridization queries,
# here we query for SP2 aliphatic carbons:
olefinic_carbon = Chem.MolFromSmarts("[C^2]")
print(mol.GetSubstructMatches(olefinic_carbon))

In [None]:
IPythonConsole.drawOptions.addAtomIndices = False

### Using rdqueries

 See [rdqueries](https://rdkit.org/docs/source/rdkit.Chem.rdqueries.html?highlight=rdqueries) module for more examples on how to use these query tools.

In [None]:
from rdkit.Chem import rdqueries

In [None]:
mol = Chem.MolFromSmiles("c1ccccc1C=CCC")
q = rdqueries.IsAromaticQueryAtom()
print([x.GetIdx() for x in mol.GetAtomsMatchingQuery(q)])

In [None]:
q = rdqueries.HybridizationEqualsQueryAtom(Chem.HybridizationType.SP2)
print([x.GetIdx() for x in mol.GetAtomsMatchingQuery(q)])

In [None]:
qcombined = rdqueries.IsAliphaticQueryAtom()
qcombined.ExpandQuery(q)
print([x.GetIdx() for x in mol.GetAtomsMatchingQuery(qcombined)])

## Substructure matching

In [None]:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.drawOptions.addAtomIndices = True

sucrose = "C([C@@H]1[C@H]([C@@H]([C@H]([C@H](O1)O[C@]2([C@H]([C@@H]([C@H](O2)CO)O)O)CO)O)O)O)O"
sucrose_mol = Chem.MolFromSmiles(sucrose)
primary_alcohol = Chem.MolFromSmarts("[CH2][OH1]")

sucrose_mol

In [None]:
primary_alcohol

In [None]:
print(sucrose_mol.GetSubstructMatches(primary_alcohol))

In [None]:
secondary_alcohol = Chem.MolFromSmarts("[CH1][OH1]")
print(sucrose_mol.GetSubstructMatches(secondary_alcohol))

In [None]:
# Macrocycles
erythromycin = Chem.MolFromSmiles("CC[C@@H]1[C@@]([C@@H]([C@H](C(=O)[C@@H](C[C@@]([C@@H]([C@H]([C@@H]([C@H](C(=O)O1)C)O[C@H]2C[C@@]([C@H]([C@@H](O2)C)O)(C)OC)C)O[C@H]3[C@@H]([C@H](C[C@H](O3)C)N(C)C)O)(C)O)C)C)O)(C)O")
erythromycin

In [None]:
# Define SMARTS pattern with ring size > 12
# This is an RDKit SMARTS extension
macro = Chem.MolFromSmarts("[r{12-}]")
print(erythromycin.GetSubstructMatches(macro))

In [None]:
erythromycin

In [None]:
# Match only within a fragment

# Define a function where matches are contained in a single fragment
def fragsearch(m,p):
    matches = [set(x) for x in m.GetSubstructMatches(p)]
    frags = [set(y) for y in Chem.GetMolFrags(m)] # had to add this line for code to work
    for frag in frags:
        for match in matches:
            if match.issubset(frag):
                return match
    return False

In [None]:
p = Chem.MolFromSmarts('O.N')
m1 = Chem.MolFromSmiles('OCCCN.CCC') # Set of 2 molecules
m2 = Chem.MolFromSmiles('OCCC.CCCN') # Set of another 2 molecules

Draw.MolsToGridImage((p, m1, m2))

In [None]:
print(m1.HasSubstructMatch(p), fragsearch(m1, p))

In [None]:
print(m2.HasSubstructMatch(p), fragsearch(m2, p))

In [None]:
IPythonConsole.drawOptions.addAtomIndices = False

# Manipulating molecules
Now, let's look at how to make your own molecules and bonds from scratch.

## Creating a molecule

## Fragmenting a molecule

### Fragmentation tools

In [None]:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole, MolsToGridImage
IPythonConsole.drawOptions.addBondIndices = True
IPythonConsole.drawOptions.addAtomIndices = True

# I have put explicit bonds in the SMILES definition to facilitate comprehension:
mol = Chem.MolFromSmiles("O-C-C-C-C-N")
mol1 = Chem.Mol(mol)
mol2 = Chem.Mol(mol)
mol1

In [None]:
# Chem.FragmentOnBonds() will fragment all specified bond indices at once, and return a single molecule
# with all specified cuts applied. By default, addDummies=True, so empty valences are filled with dummy atoms:
mol1_f = Chem.FragmentOnBonds(mol1, (0, 2, 4))
mol1_f

In [None]:
# This molecule can be split into individual fragments using Chem.GetMolFrags():
MolsToGridImage(Chem.GetMolFrags(mol1_f, asMols=True))

In [None]:
# Chem.FragmentOnSomeBonds() will fragment according to all permutations of numToBreak bonds at a time
# (numToBreak defaults to 1), and return tuple of molecules with numToBreak cuts applied. By default,
# addDummies=True, so empty valences are filled with dummy atoms:
mol2_f_tuple = Chem.FragmentOnSomeBonds(mol2, (0, 2, 4), numToBreak=1)

In [None]:
Draw.MolsToGridImage(Chem.GetMolFrags(mol2_f_tuple[0], asMols=True), subImgSize=(250,250))

In [None]:
Draw.MolsToGridImage(Chem.GetMolFrags(mol2_f_tuple[1], asMols=True), subImgSize=(250,250))

In [None]:
Draw.MolsToGridImage(Chem.GetMolFrags(mol2_f_tuple[2], asMols=True), subImgSize=(250,250))

### Manually remove bonds

In [None]:
# Finally, you can manually cut bonds using Chem.RWMol.RemoveBonds:
with Chem.RWMol(mol) as rwmol:
  for b_idx in [0, 2, 4]:
    b = rwmol.GetBondWithIdx(b_idx)
    rwmol.RemoveBond(b.GetBeginAtomIdx(), b.GetEndAtomIdx())
# And then call Chem.GetMolFrags() to get sanitized fragments where empty valences were filled with implicit hydrogens:
MolsToGridImage(Chem.GetMolFrags(rwmol, asMols=True))

In [None]:
IPythonConsole.drawOptions.addBondIndices = False
IPythonConsole.drawOptions.addAtomIndices = False

## Enumerating products / functionalized results

In [None]:
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import AllChem

from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.drawOptions.addAtomIndices = True

In [None]:
# Replace sidechains on a core and enumerate the combinations
core = '*c1c(C)cccc1(O)'
core_mol = Chem.MolFromSmiles(core)
chain = 'CN*'
chain_mol = Chem.MolFromSmiles(chain)

In [None]:
Draw.MolsToGridImage((core_mol, chain_mol), subImgSize=(250,250))

In [None]:
rxn = AllChem.ReactionFromSmarts('[c:1][#0].[#0][*:2]>>[c:1]-[*:2]')
reacts = (core_mol, chain_mol)
products = rxn.RunReactants(reacts) # tuple
print(len(products))

In [None]:
print(Chem.MolToSmiles(products[0][0]))

In [None]:
products[0][0]

In [None]:
# The above reaction-based approach is flexible, however if you can generate your
# sidechains in such a way that the atom you want to attach to the core
# is the first one (atom zero), there's a somewhat easier way to do this
# kind of simple replacement:

products = Chem.ReplaceSubstructs(core_mol, Chem.MolFromSmarts('[#0]'), chain_mol) # tuple
print(Chem.MolToSmiles(products[0]))

In [None]:
products[0]

In [None]:
# Here is an example in a loop for an imidazolium core with alkyl chains

core_mol = Chem.MolFromSmiles('*[n+]1cc[nH]c1')
chains = ['C','CC','CCC','CCCC','CCCCC','CCCCCC']
chainMols = [Chem.MolFromSmiles(chain) for chain in chains]

product_smi = []
for chainMol in chainMols:
    product_mol = Chem.ReplaceSubstructs(core_mol, Chem.MolFromSmarts('[#0]'), chainMol)
    product_smi.append(Chem.MolToSmiles(product_mol[0]))
print(product_smi)

In [None]:
# View the enumerated molecules:
Draw.MolsToGridImage([Chem.MolFromSmiles(smi) for smi in product_smi])

In [None]:
IPythonConsole.drawOptions.addAtomIndices = False

# Descriptors

## Molecule Hash Strings

In [None]:
from rdkit import Chem
from rdkit.Chem import rdMolHash
import rdkit

IPythonConsole.drawOptions.addAtomIndices = True

In [None]:
s = Chem.MolFromSmiles('CC(C(C1=CC(=C(C=C1)O)O)O)N(C)C(=O)OCC2=CC=CC=C2')
s

In [None]:
#  View all of the MolHash hashing functions types with the names method.
molhashf = rdMolHash.HashFunction.names
molhashf

In [None]:
# Generate MolHashes for molecule 's' with all defined hash functions.
for i, j in molhashf.items():
    print(i, rdMolHash.MolHash(s, j))

## Wiener index

The [Wiener index](https://en.wikipedia.org/wiki/Wiener_index) is a topological index for a molecule, defined as the sum of the lengths of the shortest paths between all pairs of vertices in the chemical graph representing the non-hydrogen atoms in the molecule.

In [None]:
from rdkit import Chem
def wiener_index(m):
    res = 0
    amat = Chem.GetDistanceMatrix(m)
    num_atoms = m.GetNumAtoms()
    for i in range(num_atoms):
        for j in range(i+1,num_atoms):
            res += amat[i][j]
    return res

In [None]:
butane = Chem.MolFromSmiles('CCCC')
print(wiener_index(butane))

In [None]:
Chem.GetDistanceMatrix(butane)

In [None]:
butane

In [None]:
isobutane = Chem.MolFromSmiles('CC(C)C')
print(wiener_index(isobutane))

In [None]:
Chem.GetDistanceMatrix(isobutane)

In [None]:
isobutane

## Tanimoto similarity