<b>[Author]</b> Nicolas Bosc
<br><b>[Year]</b> 2020

# Data extraction from ChEMBL
This notebook shows how to extract bioactivity data from the ChEMBL database to get them in a model training-friendly format. <br>
It makes use of the Python client library. <u>Therefore, it does not require a local installation of ChEMBL to run.</u>

To work, it only needs a protein name (by default COX-2) or alternatively its ChEMBL identifier. If data are found it writes a csv file with the relevant data

<b>Note</b>: there are several ways to achieve the same result and this notebook only show one possibility. Further documentation and examples are available [here](https://chembl.gitbook.io/chembl-interface-documentation/web-services/chembl-data-web-services). For remarks and comments please contact Nicolas Bosc <nbosc@ebi.ac.uk>

In [1]:
# Tested with Python 3.7
# You can install the required packages if they are not already installed. Just uncomment the next three lines.
# import sys
# !conda install --yes --prefix {sys.prefix} pandas ipywidgets
# !{sys.executable} -m pip install chembl-webresource-client

In [2]:
import pandas as pd
from chembl_webresource_client.new_client import new_client
from ipywidgets import interactive
from rdkit import Chem
from rdkit.Chem import PandasTools

##  Download activities for a given protein target

#### Step 1

#### You are looking for a target but you do not have its ChEMBL id (if you know the ChEMBL id, go to [step 2](#Step-2))

In [3]:
def f(protein):
    return protein
userArguments = interactive(f, protein='cox2')
userArguments

interactive(children=(Text(value='cox2', description='protein'), Output()), _dom_classes=('widget-interact',))

In [4]:
# create a target query
target = new_client.target
protein_name = userArguments.kwargs['protein']
# in this example we assume this is a 'single protein' present in the human species
response = target.filter(target_synonym__icontains=protein_name, organism='Homo sapiens', target_type='SINGLE PROTEIN')

pd.DataFrame(response)

Unnamed: 0,cross_references,organism,pref_name,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P35354', 'xref_name': None, 'xre...",Homo sapiens,Cyclooxygenase-2,False,CHEMBL230,"[{'accession': 'P35354', 'component_descriptio...",SINGLE PROTEIN,9606
1,"[{'xref_id': 'P00403', 'xref_name': None, 'xre...",Homo sapiens,Cytochrome c oxidase subunit 2,False,CHEMBL6174,"[{'accession': 'P00403', 'component_descriptio...",SINGLE PROTEIN,9606


From these results, it is obvious that we are interested by the first protein (Cyclooxygenase-2) whose ChEMBL id is <b>CHEMBL230</b>

#### Step 2

#### You are looking for a target and you have its ChEMBL id 

In [5]:
def f(chembl_id):
    return chembl_id
userArguments2 = interactive(f, chembl_id='CHEMBL230')
userArguments2

interactive(children=(Text(value='CHEMBL230', description='chembl_id'), Output()), _dom_classes=('widget-inter…

In [6]:
# Create an activity query
activities = new_client.activity
chembl_id = userArguments2.kwargs['chembl_id']
# Select only activities with a pchembl_value (-log(IC50, Ki, Kd, EC50...).
# We also use the chembl flags to remove the duplicates and the records where there is a validity comment
response = activities.filter(target_chembl_id=chembl_id, pchembl_value__isnull=False,\
                             potential_duplicate=False, data_validity_comment__isnull=True )

# create a dataframe with the activity data
df_activities = pd.DataFrame(response)
assays = new_client.assay

# select assays.
response = assays.filter(assay_chembl_id__in=list(df_activities.assay_chembl_id.unique()))

# create a dataframe with the assay data
df_assays = pd.DataFrame(response)

# keep only the assays where the link between the protein target and the assay is direct
df_assays = df_assays[df_assays.confidence_score==9]

df_activities = df_activities[df_activities.assay_chembl_id.isin(df_assays.assay_chembl_id)]

# print (df_activities)

# keep only the columns you need
df_res = df_activities[['assay_description','molecule_chembl_id','molecule_pref_name', 'canonical_smiles','pchembl_value',\
               'standard_type','standard_relation','standard_value','standard_units','target_pref_name',
               'target_organism']]

# export the resulting data
# df_res.to_csv(f"{userArguments.kwargs['protein']}_chembl_data.csv", index=False)

#### Before using this dataset for training your model, you should check for any duplicate activities and decide what to do with them. Finally, you will have to describe the compounds using the features of your choice.

In [7]:
df_res.head()

Unnamed: 0,assay_description,molecule_chembl_id,molecule_pref_name,canonical_smiles,pchembl_value,standard_type,standard_relation,standard_value,standard_units,target_pref_name,target_organism
11,Inhibition of PGE-2 production by arachidonic ...,CHEMBL91832,,CC1(C)C(=O)C(c2ccc(F)cc2)=C1c1ccc(S(C)(=O)=O)cc1,8.3,IC50,=,5.0,nM,Cyclooxygenase-2,Homo sapiens
12,Inhibition of PGE-2 production by arachidonic ...,CHEMBL91118,,C=C1CC(c2ccc(S(C)(=O)=O)cc2)=C1c1ccccc1,8.92,IC50,=,1.2,nM,Cyclooxygenase-2,Homo sapiens
13,Inhibition of PGE-2 production by arachidonic ...,CHEMBL92443,,C=C1C(c2ccccc2)=C(c2ccc(S(C)(=O)=O)cc2)C1(C)C,8.66,IC50,=,2.2,nM,Cyclooxygenase-2,Homo sapiens
20,Inhibition of PGE-2 production by arachidonic ...,CHEMBL328003,,CS(=O)(=O)c1ccc(C2=C(c3ccccc3)C(=O)C2)cc1,6.96,IC50,=,110.0,nM,Cyclooxygenase-2,Homo sapiens
23,Inhibition of PGE-2 production by arachidonic ...,CHEMBL330516,,CC1(C)C(c2ccc(S(C)(=O)=O)cc2)=C(c2ccccc2)/C1=N/O,7.21,IC50,=,61.0,nM,Cyclooxygenase-2,Homo sapiens


#### The function 'write_sdf' is in charge of dumping the result of our search into an annotated SDFile

In [10]:
def write_sdf(data, smiles_column, id_column, output_name):
    PandasTools.AddMoleculeColumnToFrame(data, smiles_column)

    # Uncomment the two lines below if a NoneType error appears when executing WriteSDF
    #     no_mol = data[data['ROMol'].isna()]
    #     data.drop(no_mol.index, axis=0, inplace=True)

    # add H
    # data.loc[:,'ROMol'] = [Chem.AddHs(x) for x in data.loc[:,'ROMol'].values.tolist()]

    PandasTools.WriteSDF(data, output_name, molColName='ROMol', properties=list(data.columns), idName=id_column)

## Write the SDFile

Adapted by Eric Marc and Manuel Pastor (UPF), 2021

#### Remove all the lines of this tables containing compounds without structure (the "canonical_smiles" is a na) and Write the SDFile

In [11]:
df_res.drop(df_res[df_res['canonical_smiles'].isna()].index, axis=0, inplace=True)
write_sdf(df_res, 'canonical_smiles', 'molecule_chembl_id', 'chembl_data.sdf')