<a href="https://colab.research.google.com/github/nunososorio/bhs/blob/main/NSO_PracticalClass_I.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practical Training: The Role of Databases in Drug Discovery

By **Nuno S. Os√≥rio** üñãÔ∏è

üëã Welcome to this tutorial! We will explore how to access and use drug-related databases. We will focus on retrieving data from **ChEMBL** using their respective Python packages, and then analyze the retrieved data. This tutorial is designed to be run on jupyter notebook environments and includes exercises that involve running Python code. üêçüíª

You can access an interactive cloud version of the notebook here. Let's dive in! üèä‚Äç‚ôÇÔ∏è


## Introduction

The use of databases is crucial in the steps of **Target-to-Hit** and **Hit-to-Lead** in drug discovery. üéØüíä These databases provide a wealth of information about potential drug targets and the compounds that could interact with these targets. Accessing and analyzing this data can help in the identification of potential new drugs. üß™üî¨

Accessing these databases can be done via their respective websites. However, for reproducible and large-scale analysis, accessing the database programmatically via code is more efficient. In this tutorial, we will guide you on how to do this. üñ•Ô∏èüìö

The **ChEMBL** database is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs. üß¨üí°

The `chembl_webresource_client` is the official Python client library for accessing ChEMBL data. üêçüì¶


## Setup
First, we need to install the necessary Python packages. Run the following commands in your environment:


In [1]:
!pip install chembl_webresource_client


Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/55.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m55.2/55.2 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.0-py3-none-any.whl (61 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m61.4/61.4 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-23.2.3-py3-none-any.whl (57 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚

Now, import the library:

In [2]:
from chembl_webresource_client.new_client import new_client


## Explore

Lets start by learning all types of data or information we can retreive from ChEMBL database using the chembl_webresource_client. You can list available data entities using the following code:

In [5]:
available_resources = dir(new_client)
#available_resources = [resource for resource in dir(new_client) if not resource.startswith('_')]

available_resources


['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'activity',
 'activity_supplementary_data_by_activity',
 'assay',
 'assay_class',
 'atc_class',
 'binding_site',
 'biotherapeutic',
 'cell_line',
 'chembl_id_lookup',
 'compound_record',
 'compound_structural_alert',
 'description',
 'document',
 'document_similarity',
 'drug',
 'drug_indication',
 'go_slim',
 'image',
 'mechanism',
 'metabolism',
 'molecule',
 'molecule_form',
 'official',
 'organism',
 'protein_classification',
 'similarity',
 'source',
 'substructure',
 'target',
 'target_component',
 'target_relation',
 'tissue',
 'xref_source']

In Python, attributes that start with an underscore are typically used for internal purposes and are not meant to be accessed directly.

Lets peek into the 'molecule' data entity in the ChEMBL database:

In [12]:
new_client.molecule



As you can see, it returns a dictionary containing a wealth of information about a specific molecule in the ChEMBL database. The keys in this dictionary represent different attributes of the molecule, and the values associated with these keys provide specific information about these attributes.

Here‚Äôs a brief explanation of some of the keys in the dictionary:

- **'atc_classifications'**: The Anatomical Therapeutic Chemical (ATC) classification system codes for the molecule.
- **'availability_type'**: The availability type of the molecule.
- **'biotherapeutic'**: Information about the biotherapeutic properties of the molecule.
- **'black_box_warning'**: Indicates if there is a black box warning for the molecule.
- **'molecule_chembl_id'**: The ChEMBL ID of the molecule.
- **'molecule_hierarchy'**: The hierarchy of the molecule in the ChEMBL database.
- **'molecule_properties'**: Various properties of the molecule, such as its molecular weight, number of hydrogen bond acceptors and donors, etc.
- **'molecule_structures'**: The structures of the molecule in various formats, such as SMILES, InChI, and molfile.
- **'molecule_type'**: The type of the molecule (e.g., ‚ÄòSmall molecule‚Äô).
- **'structure_type'**: The type of the structure (e.g., ‚ÄòMOL‚Äô).
- **'pref_name'**: This stands for ‚Äúpreferred name‚Äù. It is the preferred name of the molecule in the ChEMBL database. If the value is None, it means that a preferred name has not been assigned or is not available for this molecule in the database.
- **'molecule_synonyms'**: This is a list of synonyms for the molecule. Synonyms are different names that can be used to refer to the same molecule. These could include names used in different databases, common names, scientific names, etc. If the list is empty, it means that no synonyms have been recorded or are available for this molecule in the database.



If we want to find a molecule by preferred name we can use:

In [13]:
# Create a 'molecule' object that allows you to access the 'molecule' data entity in the ChEMBL database.
molecule = new_client.molecule

# Use the 'filter' method of the 'molecule' object to retrieve all molecules whose preferred name is exactly 'aspirin'.
# The 'iexact' lookup is used to perform case-insensitive exact match.
mols = molecule.filter(pref_name__iexact='aspirin')

# 'mols' now contains a list of all molecules in the ChEMBL database whose preferred name is 'aspirin'.
mols





Not easy to read... Don't worry you can convert the 'mols' object into a pandas DataFrame easier reading and manipulation. Here‚Äôs how you can do it:

In [18]:
mols_df

In [None]:
import pandas as pd
mols_df = pd.DataFrame.from_records(mols)
mols_df
