In [15]:
!pip install useful_rdkit_utils mols2grid





In [4]:
import pandas as pd
from rdkit import Chem
import mols2grid
import useful_rdkit_utils as uru
from tqdm.auto import tqdm
from itertools import chain

In this notebook we'll analyze a set of marketed drugs from the ChEMBL database and find the most commonly occuring ring systems.  To do this, we'll follow these steps. 
1. Read the drugs as SMILES
2. Convert the SMILES to RDKit Molecules
3. Indentify the ring systems in the molecules
4. Collect the individual ring systems and count their frequencies

This analysis is similar to the one performed in Taylor, R. D., MacCoss, M., & Lawson, A. D. (2014). [Rings in drugs: Miniperspective](https://pubs.acs.org/doi/10.1021/jm4017625), Journal of Medicinal Chemistry, 57(14), 5845-5859.

Enable progress_apply in Pandas

In [5]:
tqdm.pandas()

### 1. Read drugs from ChEMBL as SMILES
Read the drugs from the ChEMBL database

In [6]:
chembl_drugs_url = "https://raw.githubusercontent.com/PatWalters/datafiles/main/chembl_drugs.smi"
df = pd.read_csv(chembl_drugs_url,sep=" ",names=["SMILES","Name"])

### 2. Convert the SMILES to RDKit Molecules
Add a molecule column to the dataframe

In [7]:
df['mol'] = df.SMILES.progress_apply(Chem.MolFromSmiles)

  0%|          | 0/1203 [00:00<?, ?it/s]

### 3. Indentify the ring systems in the molecules
Instantiate a RingSystemFinder object

In [8]:
ring_system_finder = uru.RingSystemFinder()

Find the ring systems in the ChEMBL drugs

In [9]:
df['ring_systems'] = df.mol.progress_apply(ring_system_finder.find_ring_systems)

  0%|          | 0/1203 [00:00<?, ?it/s]

### 4. Collect the individual ring systems and count their frequencies
The ring_system column in **df** is a list of lists.  We need to flatten that list so we can count the number of times each ring system occurs.  The **chain** method in the itertools package provides a convenient was to do this. 

In [10]:
ring_list = chain(*df.ring_systems.values)
ring_list

<itertools.chain at 0x7fee3203ba00>

The **chain** method used above returns an iterator.  We can use that iterator to create a Pandas series. 

In [11]:
ring_series = pd.Series(ring_list)
ring_series

0                           c1ccccc1
1                           c1ccncc1
2              O=C1CC(=O)NC(=O)[N-]1
3              O=C1C=CC(=O)c2ccccc21
4       O=c1[nH]c(=O)c2[nH]cnc2[nH]1
                    ...             
2453                        c1ccccc1
2454                        c1ccccc1
2455       O=c1[nH]c(=O)c2ccsc2[nH]1
2456                        c1ccnnc1
2457                        c1ccccc1
Length: 2458, dtype: object

Now that we have a Pandas series, we can use the value_counts method to count the occurences of the different ring systems.

In [12]:
ring_series.value_counts()

c1ccccc1                                                       911
c1ccncc1                                                        94
C1CNCCN1                                                        83
C1CCNCC1                                                        74
C1CC1                                                           48
                                                              ... 
O=C1Nc2cccnc2Nc2ncccc21                                          1
c1nc2c(s1)CCCC2                                                  1
O=C1CCc2ccccc21                                                  1
C=C1C[C@H]2[C@@H]3CCC(=O)[C@H]3CC[C@@H]2[C@H]2C=CC(=O)C=C12      1
c1ccnnc1                                                         1
Length: 415, dtype: int64

In order to make the **value_counts** output easier to work with, we'll convert it into a dataframe. 

In [13]:
ring_df = pd.DataFrame(ring_series.value_counts()).reset_index()
ring_df.columns = ["SMILES","Count"]
ring_df

Unnamed: 0,SMILES,Count
0,c1ccccc1,911
1,c1ccncc1,94
2,C1CNCCN1,83
3,C1CCNCC1,74
4,C1CC1,48
...,...,...
410,O=C1Nc2cccnc2Nc2ncccc21,1
411,c1nc2c(s1)CCCC2,1
412,O=C1CCc2ccccc21,1
413,C=C1C[C@H]2[C@@H]3CCC(=O)[C@H]3CC[C@@H]2[C@H]2...,1


Now that we have our results in a dataframe, we can use mols2grid to display the chemical structures of the ring systems along with the associated counts. 

In [14]:
mols2grid.display(ring_df,smiles_col="SMILES",subset=["img","Count"],selection=False)

MolGridWidget()