# Part 7: Filter generated compounds by Lipinski's rule of five

For the novel compounds it is important that they are also drug-like. Therefore, these generated molecules will be marked if they fulfill Lipinski's rule of 5, to determine if these drugs would be orally bioavailable. This will be performed in the same way as part 2 with the ChEMBL data. 

Import required libraries

In [None]:
from pathlib import Path
import math

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import matplotlib.patches as mpatches
from rdkit import Chem
from rdkit.Chem import Descriptors, Draw, PandasTools

ModuleNotFoundError: No module named 'rdkit'

Set path to this notebook

In [None]:
HERE = Path(_dh[-1])
DATA = HERE / "data"

## Investigate compliance with Ro5

Define function to test if the molecule fulfills the lipinski's rule of five

In [None]:
def calculate_ro5_properties(smiles):
    """
    Test if input molecule (SMILES) fulfills Lipinski's rule of five.

    Parameters
    ----------
    smiles : str
        SMILES for a molecule.

    Returns
    -------
    pandas.Series
        Molecular weight, number of hydrogen bond acceptors/donor and logP value
        and Lipinski's rule of five compliance for input molecule.
    """
    # RDKit molecule from SMILES
    molecule = Chem.MolFromSmiles(smiles)
    # Calculate Ro5-relevant chemical properties
    molecular_weight = Descriptors.ExactMolWt(molecule)
    n_hba = Descriptors.NumHAcceptors(molecule)
    n_hbd = Descriptors.NumHDonors(molecule)
    logp = Descriptors.MolLogP(molecule)
    # Check if Ro5 conditions fulfilled
    conditions = [molecular_weight <= 500, n_hba <= 10, n_hbd <= 5, logp <= 5]
    ro5_fulfilled = sum(conditions) >= 3
    # Return True if no more than one out of four conditions is violated
    return pd.Series(
        [molecular_weight, n_hba, n_hbd, logp, ro5_fulfilled],
        index=["molecular_weight", "n_hba", "n_hbd", "logp", "ro5_fulfilled"],
    )

Read list of novel drawn compounds

In [None]:
molecules = pd.read_csv(DATA / "generated.tsv", sep = "\t")
print(molecules.shape)
molecules.head()

Apply function to all molecules in the dataset

In [None]:
ro5_properties = molecules["SMILES"].apply(calculate_ro5_properties)
ro5_properties.head()

In [None]:
molecules = pd.concat([molecules, ro5_properties], axis=1)
molecules.head()

Separate the data that violate the ro5 from the data that fulfill

In [None]:
molecules_ro5_fulfilled = molecules[molecules["ro5_fulfilled"]]
molecules_ro5_violated = molecules[~molecules["ro5_fulfilled"]]

print(f"# compounds in unfiltered data set: {molecules.shape[0]}")
print(f"# compounds in filtered data set: {molecules_ro5_fulfilled.shape[0]}")
print(f"# compounds not compliant with the Ro5: {molecules_ro5_violated.shape[0]}")
print(f"# Percentage compounds not compliant with the Ro5: {molecules_ro5_violated.shape[0]/molecules.shape[0]*100 }%")

61 (6.1%) of the generated compounds violate more than one rule of the Ro5, which makes them less expected to be orally avaiable. However, as discussed in part 2, this is only a rule of thumb. For that reason, they are not removed from the list, but only marked. Futher research can look at those compounds manually. 

Save the data to a csv file of the molecules with marking if the ro5 is fulfilled 

In [None]:
molecules.to_csv(DATA / "generated_lipinski.csv")
molecules.head()