<b>[Author]</b> Nicolas Bosc
<br><b>[Year]</b> 2022

# Data extraction from ChEMBL
This notebook shows how to extract bioactivity data from the ChEMBL database to get them in a model training-friendly format. <br>
It makes use of the Python client library. <u>Therefore, it does not require a local installation of ChEMBL to run.</u>

To work, it only needs a protein name (by default COX-2) or alternatively its ChEMBL identifier. If data are found it writes a csv file with the relevant data

<b>Note</b>: there are several ways to achieve the same result and this notebook only show one possibility. Further documentation and examples are available [here](https://chembl.gitbook.io/chembl-interface-documentation/web-services/chembl-data-web-services). For remarks and comments please contact Nicolas Bosc <nbosc@ebi.ac.uk>

In [1]:
# Tested with Python 3.7
# You can install the required packages if they are not already installed. Just uncomment the next three lines.
# import sys
# !conda install --yes --prefix {sys.prefix} pandas ipywidgets
# !{sys.executable} -m pip install chembl-webresource-client rdkit-pypi

In [None]:
import pandas as pd
from chembl_webresource_client.new_client import new_client
import ipywidgets as w
#from IPython.display import display, Javascript
from rdkit.Chem import PandasTools

def find_target_in_chembl(widget_args, species='Homo sapiens'):
    protein_name = widget_args.kwargs['protein']
    # create a target query
    target = new_client.target
    # assume this is a 'single protein' present in the user-defined species
    response = target.filter(target_synonym__icontains=protein_name, organism=species, target_type='SINGLE PROTEIN')
    df_res = pd.DataFrame(response)
    return(df_res[['pref_name','target_chembl_id','organism']])

def find_activity_data(target_selection, argument):
    '''
    input: target selected from the list or chembl_id entered in the field
    Look for all the bioactivity in ChEMBL for this target. Restricted to data with pchembl values (-log(IC50, Ki, Kd, EC50...))
    Apply several sanity filters to keep only high confidence data
    ouput: dataframe with all the activities that pass the check
    '''
    
    if argument.kwargs['chembl_id'] == '':
        target_id = target_selection.value
    else:
        target_id = argument.kwargs['chembl_id']
    
    # Create an activity query
    activities = new_client.activity

    # Select only activities with a pchembl_value (-log(IC50, Ki, Kd, EC50...).
    # We also use the chembl flags to remove the duplicates and the records where there is a validity comment
    response = activities.filter(target_chembl_id=target_id, pchembl_value__isnull=False,\
                                 potential_duplicate=False, data_validity_comment__isnull=True )

    # create a dataframe with the activity data
    df_activities = pd.DataFrame(response)

    # create an assay query
    assays = new_client.assay
    # select assays.
    response = assays.filter(assay_chembl_id__in=list(df_activities.assay_chembl_id.unique()))

    # create a dataframe with the assay data
    df_assays = pd.DataFrame(response)

    # keep only the assays where the link between the protein target and the assay is direct
    df_assays = df_assays[df_assays.confidence_score==9]

    df_activities = df_activities[df_activities.assay_chembl_id.isin(df_assays.assay_chembl_id)]
    df_activities = df_activities.astype({'pchembl_value':float, 'standard_value':float})

    # keep only the columns you need
    df_res = df_activities[['assay_chembl_id','assay_description','parent_molecule_chembl_id','molecule_chembl_id','canonical_smiles','pchembl_value',\
                   'standard_type','standard_relation','standard_value','standard_units','target_pref_name',\
                   'target_chembl_id', 'target_organism']]
    print(f'{df_res.shape[0]} datapoint were found')
    return(df_res)

def remove_duplicates(df, do):
    '''
    if keep==True, remove duplicated data points
    based on all the values availables for a given compound on a given target,
    if the standard deviation < 1, then calculate the median value
    else don't keep the values
    '''
    if do:
        df_res = pd.DataFrame.copy(df)
        for cpd_id in df['parent_molecule_chembl_id'].unique():
            std = df[df.parent_molecule_chembl_id==cpd_id]['pchembl_value'].std()
            if std > 1:
                df_res = df_res[df_res.parent_molecule_chembl_id!=cpd_id]
        pchembl_median = df_res.groupby('parent_molecule_chembl_id')['pchembl_value'].median().reset_index()['pchembl_value']
        df_res  = df_res.drop_duplicates(['parent_molecule_chembl_id'])
        df_res = df_res.assign(pchembl_median=pchembl_median.values).drop('pchembl_value',axis=1)
        return(df_res)
    else:
        return(df)

def assay_summary(df):
    aff = df[(df.assay_description.str.contains('affinity', case=False))]['assay_chembl_id'].to_list()
    disp = df[(df.assay_description.str.contains('displacement', case=False))]['assay_chembl_id'].to_list()
    inhi = df[(df.assay_description.str.contains('inhibition', case=False))]['assay_chembl_id'].to_list()
    return(pd.DataFrame({'assay_type':['affinity','displacement','inhibition'], 'data':[len(aff),len(disp),len(inhi)]}))

def write_sdf(data, smiles_column, id_column, output_name):
    PandasTools.AddMoleculeColumnToFrame(data, smiles_column)

    # Uncomment the two lines below if a NoneType error appears when executing WriteSDF
    #     no_mol = data[data['ROMol'].isna()]
    #     data.drop(no_mol.index, axis=0, inplace=True)

    # add H
    # data.loc[:,'ROMol'] = [Chem.AddHs(x) for x in data.loc[:,'ROMol'].values.tolist()]

    PandasTools.WriteSDF(data, output_name, molColName='ROMol', properties=list(data.columns), idName=id_column)

##  Download activities for a given protein target

### Step 1: Looking for a target without have its ChEMBL id (ChEMBL id is known, go to [step 2](#Step-2))

In [5]:
def f(protein):
    return protein
target_argument = w.interactive(f, protein='')
target_argument

interactive(children=(Text(value='', description='protein'), Output()), _dom_classes=('widget-interact',))

In [26]:
targets = find_target_in_chembl(target_argument, species='Homo sapiens')
targets

Unnamed: 0,pref_name,target_chembl_id,organism
0,Serotonin 2b (5-HT2b) receptor,CHEMBL1833,Homo sapiens


In [27]:
target_selection = w.Select(
    options=[val for val in zip(targets['pref_name'],targets['target_chembl_id'])],
    description='Targets',
    disabled=False
)
print('Select the protein of interest from the list below')
target_selection

Select the protein of interest from the list below


Select(description='Targets', options=(('Serotonin 2b (5-HT2b) receptor', 'CHEMBL1833'),), value='CHEMBL1833')

### Step 2:  Looking for a target by ChEMBL id (written CHEMBL1234)
### Ignore if you followed step 1

In [8]:
def f(chembl_id):
    return chembl_id
chemblid_argument = w.interactive(f, chembl_id='')
chemblid_argument

interactive(children=(Text(value='', description='chembl_id'), Output()), _dom_classes=('widget-interact',))

### Step 3: Retrieve the activity data in ChEMBL 

In [32]:
df_activities = find_activity_data(target_selection, chemblid_argument)

HttpApplicationError: Error for url https://www.ebi.ac.uk/chembl/api/data/activity.json, server response: <!doctype html>
<!-- paulirish.com/2008/conditional-stylesheets-vs-css-hacks-answer-neither/ -->
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en"> <![endif]-->
<!-- Consider adding an manifest.appcache: h5bp.com/d/Offline -->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en"> <!--<![endif]-->
<head>
    <meta charset="utf-8">

    <!-- Use the .htaccess and remove these lines to avoid edge case issues.
 More info: h5bp.com/b/378 -->
    <!-- <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> --> <!-- Not yet implemented -->

    <title>Server error &lt; EMBL-EBI</title>
    <meta name="description" content="EMBL-EBI"><!-- Describe what this page is about -->
    <meta name="keywords" content="bioinformatics, europe, institute"><!-- A few keywords that relate to the content of THIS PAGE (not the whol project) -->
    <meta name="author" content="EMBL-EBI"><!-- Your [project-name] here -->

    <!-- Mobile viewport optimized: j.mp/bplateviewport -->
    <meta name="viewport" content="width=device-width,initial-scale=1">

    <!-- Place favicon.ico and apple-touch-icon.png in the root directory: mathiasbynens.be/notes/touch-icons -->

    <!-- CSS: implied media=all -->
    <!-- CSS concatenated and minified via ant build script-->
    <link rel="stylesheet" href="//www.ebi.ac.uk/web_guidelines/css/compliance/develop/boilerplate-style.css">
    <link rel="stylesheet" href="//www.ebi.ac.uk/web_guidelines/css/compliance/develop/ebi-global.css" type="text/css" media="screen">
    <link rel="stylesheet" href="//www.ebi.ac.uk/web_guidelines/css/compliance/develop/ebi-visual.css" type="text/css" media="screen">
    <link rel="stylesheet" href="//www.ebi.ac.uk/web_guidelines/css/compliance/develop/984-24-col-fluid.css" type="text/css" media="screen">

    <!-- you can replace this with [projectname]-colours.css. See http://frontier.ebi.ac.uk/web/style/colour for details of how to do this -->
    <!-- also inform ES so we can host your colour palette file -->
    <link rel="stylesheet" href="//www.ebi.ac.uk/web_guidelines/css/compliance/develop/embl-petrol-colours.css" type="text/css" media="screen">

    <!-- for production the above can be replaced with -->
    <!--
    <link rel="stylesheet" href="//www.ebi.ac.uk/web_guidelines/css/compliance/mini/ebi-fluid-embl.css">
    -->

    
    <!-- end CSS-->

        
    <!-- All JavaScript at the bottom, except for Modernizr / Respond.
Modernizr enables HTML5 elements & feature detects; Respond is a polyfill for min/max-width CSS3 Media Queries
For optimal performance, use a custom Modernizr build: www.modernizr.com/download/ -->

    <!-- Full build -->
    <!-- <script src="//www.ebi.ac.uk/web_guidelines/js/libs/modernizr.minified.2.1.6.js"></script> -->

    <!-- custom build (lacks most of the "advanced" HTML5 support -->
    <script src="//www.ebi.ac.uk/web_guidelines/js/libs/modernizr.custom.49274.js"></script>

</head>

<body class="level1 page-error"><!-- add any of your classes or IDs -->
<div id="skip-to">
    <ul>
        <li><a href="#content">Skip to main content</a></li>
        <li><a href="#local-nav">Skip to local navigation</a></li>
        <li><a href="#global-nav">Skip to EBI global navigation menu</a></li>
        <li><a href="#global-nav-expanded">Skip to expanded EBI global navigation menu (includes all sub-sections)</a></li>
    </ul>
</div>

<div id="wrapper" class="container_24">
    <header>
                                <div id="global-masthead" class="masthead grid_24">
            <!--This has to be one line and no newline characters-->
            <a href="//www.ebi.ac.uk/" title="Go to the EMBL-EBI homepage"><img src="//www.ebi.ac.uk/web_guidelines/images/logos/EMBL-EBI/EMBL_EBI_Logo_white.png" alt="EMBL European Bioinformatics Institute"></a>

            <nav>
                <ul id="global-nav">
                    <!-- set active class as appropriate -->
                                        <li id="services" class=" first "><a href="//www.ebi.ac.uk/services" title="Services">Services</a></li>
                                        <li id="research" class=""><a href="//www.ebi.ac.uk/research" title="Research">Research</a></li>
                                        <li id="training" class=""><a href="//www.ebi.ac.uk/training" title="Training">Training</a></li>
                                        <li id="industry" class=""><a href="//www.ebi.ac.uk/industry" title="Industry">Industry</a></li>
                                        <li id="about" class=" last"><a href="//www.ebi.ac.uk/about" title="About us">About us</a></li>
                                    </ul>
            </nav>

        </div>
                                <div id="local-masthead" class="masthead grid_24 nomenu">

            <!-- local-title -->
            <!-- NB: for additional title style patterns, see http://frontier.ebi.ac.uk/web/style/patterns -->

        <div class="" id="local-title">
                                                                    <h1><a href="/" title="Back to Server error homepage">Server error</a></h1>
                                            </div>

        <!-- /local-title -->

        
        

</div>
</header>

<div id="content" role="main" class="grid_24 clearfix">
        <!-- Example layout containers -->
    <section>
        <section class="grid_24">
          			                        										
              
    
  <div class="content">
    <div>

  
      
  
  <div class="content">
    <div>
    <div>
          <div>
<h2 class="alert">Something has gone wrong with our web server</h2>
<p>Our web server says this is a <span class="alert">500 internal server error</span>: the request cannot be carried out by the server.<br />
This problem means that the service you are trying to access is currently unavailable. We're very sorry.</p>
<p>Please try again but if it keeps happening, you can <a fix="h-" href="//www.ebi.ac.uk/support">contact us</a> and we will try to help you.</p>
</div>
      </div>
</div>
  </div>

  
  
</div>
  </div>


    
  <div class="content">
    <form id="ebi_search" action="/ebisearch/search.ebi">
  <fieldset><legend><span>Explore the EBI:</span></legend>
    <input id="query" title="EB-eye Search" tabindex="1" type="text" name="query" value="" size="35" maxlength="2048" style="width: 80%" /><input id="search_submit" class="submit" tabindex="2" type="submit" value="Search" name="submit" /><input id="allebi" type="hidden" name="db" value="allebi" checked="checked" /><input type="hidden" name="requestFrom" value="ebi_error" /><div>
     <p id="example">Examples: <a href="/ebisearch/search.ebi?db=allebi&amp;requestFrom=ebi_error&amp;query=blast">blast</a>, <a href="/ebisearch/search.ebi?db=allebi&amp;query=keratin&amp;requestFrom=ebi_error">keratin</a>, <a href="/ebisearch/search.ebi?db=allebi&amp;query=bfl1&amp;requestFrom=ebi_error">bfl1</a>...</p>
    </div>
  </fieldset></form>  </div>

										
      										
		</section>    </section>

        <!-- End example layout containers -->

</div>

<footer>

    <!-- Optional local footer (insert citation / project-specific copyright / etc here -->
        <!--
        <div id="local-footer" class="grid_24 clearfix">
      <p>How to reference this page: ...</p>
    </div>
        -->
        <!-- End optional local footer -->
        
    <div id="global-footer" class="grid_24">

        <nav id="global-nav-expanded">

            <div class="grid_4 alpha">
                <h3 class="embl-ebi"><a href="//www.ebi.ac.uk/" title="EMBL-EBI">EMBL-EBI</a></h3>
            </div>

            <div class="grid_4">
                <h3 class="services"><a href="//www.ebi.ac.uk/services">Services</a></h3>
            </div>

            <div class="grid_4">
                <h3 class="research"><a href="//www.ebi.ac.uk/research">Research</a></h3>
            </div>

            <div class="grid_4">
                <h3 class="training"><a href="//www.ebi.ac.uk/training">Training</a></h3>
            </div>

            <div class="grid_4">
                <h3 class="industry"><a href="//www.ebi.ac.uk/industry">Industry</a></h3>
            </div>

            <div class="grid_4 omega">
                <h3 class="about"><a href="//www.ebi.ac.uk/about">About us</a></h3>
            </div>

        </nav>

        <section id="ebi-footer-meta">
            <p class="address">EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK &nbsp; &nbsp; +44 (0)1223 49 44 44</p>
            <p class="legal">Copyright &copy; EMBL-EBI 2013 | EBI is an Outstation of the <a href="http://www.embl.org">European Molecular Biology Laboratory</a> | <a href="/about/privacy">Privacy</a> | <a href="/about/cookies">Cookies</a> | <a href="/about/terms-of-use">Terms of use</a></p>
        </section>

    </div>
        
</footer>
</div> <!--! end of #wrapper -->


<!-- JavaScript at the bottom for fast page loading -->

<!-- Grab Google CDN's jQuery, with a protocol relative URL; fall back to local if offline -->
<!--
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>
<script>window.jQuery || document.write('<script src="../js/libs/jquery-1.6.2.min.js"><\/script>')</script>
-->


<!-- Your custom JavaScript file can go here... change names accordingly -->
<script defer="defer" src="//www.ebi.ac.uk/web_guidelines/js/cookiebanner.js"></script>
<script defer="defer" src="//www.ebi.ac.uk/web_guidelines/js/foot.js"></script>
<!-- end scripts-->

<!-- Google Analytics details... -->
<!-- Change UA-XXXXX-X to be your site's ID -->
<!--
<script>
  window._gaq = [['_setAccount','UAXXXXXXXX1'],['_trackPageview'],['_trackPageLoadTime']];
  Modernizr.load({
    load: ('https:' == location.protocol ? '//ssl' : '//www') + '.google-analytics.com/ga.js'
  });
</script>
-->


<!-- Prompt IE 6 users to install Chrome Frame. Remove this if you want to support IE 6.
chromium.org/developers/how-tos/chrome-frame-getting-started -->
<!--[if lt IE 7 ]>
<script src="//ajax.googleapis.com/ajax/libs/chrome-frame/1.0.3/CFInstall.min.js"></script>
<script>window.attachEvent('onload',function(){CFInstall.check({mode:'overlay'})})</script>
<![endif]-->


</body>
</html>


#### Data endpoints available 

In [10]:
df_endpoints = pd.DataFrame(df_activities.standard_type.value_counts()).rename({'standard_type':'data points'},axis=1)
df_endpoints

Unnamed: 0,data points
IC50,2401
Ki,16
Kd,7
EC50,1


### Step 4: Based on the data retrieved, select the activity endpoint to use in the model.
Multiple values can be selected with <kbd>shift</kbd> and/or <kbd>ctrl</kbd> (or <kbd>command</kbd>) pressed and mouse clicks or arrow keys.

In [11]:
endpoint_selection = w.SelectMultiple(
    options=df_endpoints.index,
    description='Endpoints',
    disabled=False
)
endpoint_selection

SelectMultiple(description='Endpoints', options=('IC50', 'Ki', 'Kd', 'EC50'), value=())

In [12]:
df_activities = df_activities[df_activities.standard_type.isin(endpoint_selection.value)]

### Step 5: Should the duplicted value be removed? 

In [13]:
duplicate_selection = w.Select(
    options=[('Yes',True),('No',False)],
#     value=['CHEMBL1862'],
    description='remove duplicates?',
    disabled=False,
    style= {'description_width': 'initial'}
)
duplicate_selection

Select(description='remove duplicates?', options=(('Yes', True), ('No', False)), style=DescriptionStyle(descri…

In [14]:
df_activities = remove_duplicates(df_activities, duplicate_selection.value)

### Step 6: Select the type of assays
By defaults, assays are divided in 3 categories depending on whether their description contain certain words:
- affinity assay
- displacement assay
- inhibition assay

In [15]:
df_assays = assay_summary(df_activities)
df_assays

Unnamed: 0,assay_type,data
0,affinity,38
1,displacement,0
2,inhibition,2269


In [16]:
assay_selection = w.SelectMultiple(
    options=df_assays.assay_type,
    description='Assay types',
    disabled=False
)
assay_selection

SelectMultiple(description='Assay types', options=('affinity', 'displacement', 'inhibition'), value=())

In [17]:
df_activities = df_activities[df_activities.assay_description.str.contains('|'.join(assay_selection.value), case=False)]

In [18]:
df_activities.shape

(38, 13)

### Step 6: Export data for Flame 

Adapted by Eric Marc and Manuel Pastor (UPF), 2021
<br>Remove all the lines of this tables containing compounds without structure (the "canonical_smiles" is a na) and Write the SDFile

In [24]:
df_activities.drop(df_activities[df_activities['canonical_smiles'].isna()].index, axis=0, inplace=True)
write_sdf(df_activities, 'canonical_smiles', 'molecule_chembl_id', 'chembl_data.sdf')