# Property Distribution of the Mobley Database

_(30.09.19)_

## Aim:

- Introduction to Matplotlib and Pandas.
- Refresher for RDKit
- Represent property distribution of the Mobley database using histograms.
- Visualise ligands for property extremes.

### Extracting database
Structure data format (SDF) files were exported from GitHub (https://github.com/MobleyLab/FreeSolv) and the tar.gz extracted.

In [None]:
!tar -xf sdffiles.tar.gz

### Calculating descriptors
Dictionary with keys for entry ID and molecular descriptos was created with empty lists as values.

Mobley IDs and the following molecular descriptors were calcualted and appneded to the dictionary.
1. Molecular weight (MW)
1. Crippen atomic permeability partition coefficient (AlogP)
1. Number of hydrogen bond acceptors (HBA)
1. Number of hydrogen bond donors (HBD)
1. __Active Groups__

In [None]:
from glob import glob
from rdkit.Chem import SDMolSupplier, Descriptors, Crippen, Lipinski, Fragments

# Empty dictionary for ligand IDs and corresponding descriptros.
data = {'Entry ID': [], 'MW (Da)': [], 'AlogP': [], 
        'HBA': [], 'HBD': [], '#NH_OH': [], '#Active groups': []}


def num_active_groups(mol):
    '''Counts the number of active groups in a molecule according to 
    biologically active functional groups.'''
    
    counter = 0
    
    counter += int(Fragments.fr_COO(mol))
    # Carbonyl excluding carboxylic acids
    counter += int(Fragments.fr_C_O_noCOO(mol))
    # Primary amines
    counter += int(Fragments.fr_NH2(mol))
    # Secondary amines
    counter += int(Fragments.fr_NH1(mol))
    # Thiols
    counter += int(Fragments.fr_SH(mol))
    # Sulfonamides
    counter += int(Fragments.fr_sulfonamd(mol))
    # Aromatic rings
    counter += int(Lipinski.NumAromaticRings(mol))
    # Aliphatic hydroxy groups
    counter += int(Fragments.fr_Al_OH(mol))
    # Aromatic hydroxy groups
    counter += int(Fragments.fr_Ar_OH(mol))
        
    return counter

# Location of SDF files.
sdf_dr = '../sdffiles/'

for sdf in glob(sdf_dr + '*.sdf'):
    
    # add ligand Mobley IDs
    ID = sdf.strip('sdffiles/.')
    data['Entry ID'].append(ID)
    
    # add calculated descriptors
    suppl = SDMolSupplier(sdf)
    for mol in suppl:
        data['MW (Da)'].append(Descriptors.MolWt(mol))
        data['AlogP'].append(Crippen.MolLogP(mol))
        data['HBA'].append(Lipinski.NumHAcceptors(mol))
        data['HBD'].append(Lipinski.NumHDonors(mol))
        data['#NH_OH'].append(Lipinski.NHOHCount(mol))
        data['#Active groups'].append(num_active_groups(mol))

Use Pandas to store calculated descriptors and for simple Matplotlib implimentation.

In [None]:
import pandas as pd

df = pd.DataFrame(data)
print(df)

#### Drug-like chemical space

In [None]:
from matplotlib import pyplot as plt

df.plot(kind='scatter', x='MW (Da)', 
        y='AlogP', color='black', edgecolor='black', s=1)

#### Descriptor histograms
A function was written to plot a histogram for any specified DataFrame column.

In [None]:
def plot_hist(property, bin_range):
    '''Enter column name from df as property 
    and specify range() as bin range.'''
    
    df[property].plot(kind='hist', bins=bin_range, color='white', 
                       edgecolor='black', xticks=bin_range)
    plt.xlabel(property)
    plt.show()

A function to display the 2D structur of the ith structure within a DataFrame sorted in ascending order according to a specified descriptor.

In [None]:
from rdkit.Chem import Draw
from rdkit.Chem import AllChem

def draw_ith_structure(dataframe, descriptor, ith_position):
    '''Enter the dataframe, descriptor and ith_position you wish 
    to see ith structure for.'''
    
    # Sort DataFrame in ascending order by descriptor column.
    sorted_df = dataframe.sort_values(descriptor)
    sorted_df = sorted_df.reset_index(drop=True)
    
    # Define the Mobley ID of the ith structure in the newly ordered DataFrame
    entry_ID_index = df.columns.get_loc('Entry ID')
    descriptor_index = df.columns.get_loc(descriptor)
    ID = sorted_df.iloc[ith_position, entry_ID_index]
    
    # Supply SDF file and draw 2D structure.
    suppl = SDMolSupplier(sdf_dr + ID + '.sdf')
    for mol in suppl:
        AllChem.Compute2DCoords(mol)
    
    # Provide value for ith sturcutre and descriptor.
    value = sorted_df.iloc[ith_position, descriptor_index]
    value = round(value, 2)
    print(descriptor + ' = ' + str(value))
    print('Structure:')
    
    return Draw.MolToImage(mol)

#### 1. Molecular weight

In [None]:
plot_hist('MW (Da)', range(0, 500, 50))

In [None]:
print('Lowest MW')
draw_ith_structure(df, 'MW (Da)', 0)

In [None]:
print('Highest MW')
draw_ith_structure(df, 'MW (Da)', -1)

#### 2. AlogP

In [None]:
plot_hist('AlogP', range(-4, 10, 1))

In [None]:
print('Lowest AlogP')
draw_ith_structure(df, 'AlogP', 0)

In [None]:
print('Highest AlogP')
draw_ith_structure(df, 'AlogP', -1)

#### 3. H-bond acceptors

In [None]:
plot_hist('HBA', range(0, 10, 1))

In [None]:
print('Lowest HBA')
draw_ith_structure(df, 'HBA', 0)

In [None]:
print('Highest HBA')
draw_ith_structure(df, 'HBA', -1)

In [None]:
#### 4. H-bond donors

In [None]:
plot_hist('HBD', range(0, 7, 1))

In [None]:
print('Lowest HBD')
draw_ith_structure(df, 'HBD', 2)

In [None]:
print('Highest HBD')
draw_ith_structure(df, 'HBD', -1)

#### 5. Number of NH and OH

In [None]:
plot_hist('#NH_OH', range(0, 7, 1))

In [None]:
print('Lowest #NH_OH')
draw_ith_structure(df, '#NH_OH', 2)

In [None]:
print('Highest #NH_OH')
draw_ith_structure(df, '#NH_OH', -1)

#### 5. Number of biologically active functional groups

In [None]:
plot_hist('#Active groups', range(0, 7, 1))

In [None]:
print('Lowest #Active groups')
draw_ith_structure(df, '#Active groups', 3)

In [None]:
print('Highest #Active groups')
draw_ith_structure(df, '#Active groups', -1)

__Group 1__
- Linear alkanes
- Branched alkanes
- Linear alkenes
- Branched alkenes
- "with various polar functional groups"

__Group 2__ 
- Methoxyphenol
- Guaiacol
- Chlorinated derivatives
- "other rather similar compounds"

__Group 3__
- 137 cyclohexane derivatives
- "often also with an attached oxygen or hydroxyl"

__Group 4__
- Anthracene derivatives
- "many of which have attached polar functional groups"

__Group 5__
- Polyfunctional
- "other compounds"
- Largest compounds
- Most flexible compounds

"
In some past SAMPLs, specific functional groups or
classes of molecules have proven particularly challenging.
In an anticipation that this might also be true here, we
conceptually divided the set into five groups. Group 1
consists of linear or branched alkanes or alkenes with
various polar functional groups. Group 2 consists of
methoxyphenol and guaiacol, chlorinated derivatives, and
other rather similar compounds. Group 3 consists of
137
cyclohexane derivatives, often also with an attached oxy-
gen or hydroxyl. Group 4 consists of anthracene deriva-
tives, many of which have attached polar functional
groups. And group 5 is polyfunctional or other compounds,
and contains the largest and most flexible compounds in the
set.
"

D. L. Mobley, K. L. Wymer, N. M. Lim, J. Peter Guthrie, J. Comput. Aided Mol. Des., 2014, 28, 135–150.