# Ligand-based screening pipeline 

Getting Started   


## Task 1: Data set assembly

**Query [ChEMBL](https://www.ebi.ac.uk/chembl/)** 

* You can either manually query the database for EGFR compounds 
* or do this programmatically. 

Since using the API requires quite some steps, we will walk you through it
* using the TeachOpenCADD material ([T001: Compound data acquisition (ChEMBL)](https://projects.volkamerlab.org/teachopencadd/talktorials/T001_query_chembl.html)), 
* but this could become a part of your pipeline in the future.

Additional material
* [T011 - Querying online api webservices](https://projects.volkamerlab.org/teachopencadd/talktorials/T011_query_online_api_webservices.html)
* [T013 - Query PubChem](https://projects.volkamerlab.org/teachopencadd/talktorials/T013_query_pubchem.html)

## Task 2: Some basic molecular data set computations 

#### Import all necessary libraries

In [None]:
#data handling
import numpy as np
import pandas as pd

# chemistry
from rdkit import Chem
from rdkit.Chem import Descriptors, Draw, PandasTools, DataStructs, rdFMCS
from rdkit.Chem.AllChem import GetMorganFingerprintAsBitVect
from rdkit.Chem.Draw import IPythonConsole, rdMolDraw2D
from rdkit.ML.Cluster import Butina

### 2.1: Read and prepare your data set

#### 2.1.1. Read the input data 

Data can be found in a dataframe (`./data/EGFR_compounds_ChEMBL27.csv'`) 

In [None]:
df=pd.read_csv('./data/EGFR_compounds_ChEMBL27.csv')
# df.drop(['IC50','units','Unnamed: 0'], axis=1, inplace=True)
df.head()

#### 2.1.2.Generate molecules and calculate fingerprints

##### We can do that stepwise in lists ...
* Generate molecules ...

In [None]:
mols = []
for entry in df['smiles']:
    mols.append(Chem.MolFromSmiles(entry))

* Calculate fingerprints ...

In [None]:
fps = []
for mol in mols:
    fps.append(GetMorganFingerprintAsBitVect(mol,2))

* and draw them ...

In [None]:
# add names for legend
names= []
for name in df["molecule_chembl_id"]:
    names.append(name)

In [None]:
Draw.MolsToGridImage(mols[:6], legends=names[:6])

##### And we can do the same directly on the dataframe
* Add molecule column to dataframe

In [None]:
PandasTools.AddMoleculeColumnToFrame(df,'smiles','molecule',includeFingerprints=True)

* Generate Morgan fingerprints for the compounds and add this column to the dataframe

In [None]:
df['morgan'] = df['molecule'].map(lambda x:GetMorganFingerprintAsBitVect(x,2))

* Draw the molecules

In [None]:
PandasTools.FrameToGridImage(df.head(6), column='molecule', legendsCol='molecule_chembl_id')

#### 2.1.3 Calculate other information on our dataset

In the same manner we can other information on our dataset (exemplifies some pandas functionality)

* Add a column to the dataframe indicating the activity of the compounds.

    Set the activity cutoff to pIC50 = 6.3 (which corresponds to 500nM), the higher the pIC50 value, the more active the compound. Set the values 1, 0 for active, inactive compounds respectively. 

In [None]:
# TODO
df['active'] = df['pIC50'].map(lambda x:x>6.3).astype(int)

* Add other molecular descriptors

In [None]:
df["molWt"] = df["molecule"].apply(Descriptors.ExactMolWt)
df.head()

* And we can also directly plot the data

In [None]:
df["molWt"].hist()

* Or we can filter the data ... and get some statistics on the values

In [None]:
df=df[df['molWt']<900]
df.describe()

## Task 3: Similarity search

Calculate the similarity between the compounds and a selected known inhibitor (e.g. Gefitinib) using a **circular fingerprint** and the **Tanimoto similarity** metric. 

Helpful talktorial: [T004-Compound similarity](https://projects.volkamerlab.org/teachopencadd/talktorials/T004_compound_similarity.html)

### 3.1. Select query compound

FDA approved EGFR inhibitor Gefitinib: `COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN1CCOCC1`

In [None]:
gefitinib = Chem.MolFromSmiles("COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN1CCOCC1")

### 3.2. Let's define a function that can do the job ...

In [None]:
def get_dataframe_with_x_most_similar_compounds_to_query(query, mol_df, molCol='molecule', x=10):
    query_fp = GetMorganFingerprintAsBitVect(query,2)
    mol_df['similarity'] = mol_df['morgan'].map(lambda x:DataStructs.TanimotoSimilarity(query_fp, x))
    mol_df.sort_values(['similarity'], ascending=False, inplace=True)
    return mol_df[:x]           

### 3.2. Now we call the function ...

In [None]:
sim_df = get_dataframe_with_x_most_similar_compounds_to_query(gefitinib, df)

In [None]:
PandasTools.FrameToGridImage(sim_df.head(6), column='molecule', legendsCol='similarity')

## Task 4: Cluster compounds

* Cluster your compounds using Butina clustering with Tanimoto **dis**similarity as distance metric.
    * Calculate distance matrix
    * Do Butina clustering

Helpful talktorial: [T005-Compound clustering](https://projects.volkamerlab.org/teachopencadd/talktorials/T005_compound_clustering.html) 

Define a function to calculate distance matrix

In [None]:
def tanimoto_distance_matrix(fp_list):
    """Calculate distance matrix for fingerprint list"""
    dissimilarity_matrix = []
    # Notice how we are deliberately skipping the first and last items in the list
    # because we don't need to compare them against themselves
    for i in range(1, len(fp_list)):
        # Compare the current fingerprint against all the previous ones in the list
        similarities = DataStructs.BulkTanimotoSimilarity(fp_list[i], fp_list[:i])
        # Since we need a distance matrix, calculate 1-x for every element in similarity matrix
        dissimilarity_matrix.extend([1 - x for x in similarities])
    return dissimilarity_matrix

Prepare data for clustering

In [None]:
# Fingerprints as lists
fingerprints=df["morgan"].to_list()
# Calculate distance matrix
dist_matrix = tanimoto_distance_matrix(fingerprints)
# Define distance cut-off
cutoff=0.7

Cluster the data with the implemented Butina algorithm

In [None]:
clusters = Butina.ClusterData(dist_matrix, len(fingerprints), cutoff, isDistData=True)

Sort the clusters by size

In [None]:
clusters = sorted(clusters, key=len, reverse=True)

In [None]:
# Molecules as list
mols = df["molecule"].to_list()

In [None]:
# Draw molecules
Draw.MolsToGridImage(
    [mols[i] for i in clusters[0][:10]],
    molsPerRow=5,
)

## Task 5: Maximum common substructure

[*if time allows*]

Identify and highlight the maximum common substructure (MCS) within a cluster.

Helpful talktorial: [T006-Maximum common substructures](https://projects.volkamerlab.org/teachopencadd/talktorials/T006_compound_maximum_common_substructures.html)

* Get moleculesfrom first cluster

In [None]:
subset = [mols[i] for i in clusters[0][:10]]

* Find MCS within subset

In [None]:
mcs = rdFMCS.FindMCS(subset)

In [None]:
print("MCS SMARTS string:", mcs.smartsString)

* Draw substructure from Smarts

In [None]:
mol_pattern = Chem.MolFromSmarts(mcs.smartsString)
Draw.MolToImage(mol_pattern, legend="MCS")