<h1>Dataset Collection from ChEMBL</h1>

<hr>

You can import the python libraries using the following command:

In [1]:
import pandas as pd
from chembl_webresource_client.new_client import new_client # to download a molecule dataset from ChEMBL

We will collect chemical data from ChEMBL dataset using [the official Python ChEMBL webresource client library](https://github.com/chembl/chembl_webresource_client) which is supported by ChEMBL group. The design of the client is based on Django QuerySet. Click on the link to learn more about the [ChEMBL database schema](https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/) to undestand what information you need to download.

To gather small molecules dataset, first you need to download all ChEMBL IDs of interested protein targets. Together with protein ids we will collect information about target organism and target name.

In [2]:
# gether protein chembl id of interested target from ChEMBL
def download_targetid_chembl(taget_name):
    target = new_client.target

    cyp = target.filter(pref_name__iexact=taget_name).only(
    ['target_chembl_id', 'target_organism', 'target_pref_name'])
    
    return cyp

The download_mols_chembl() function will give a set of assays which have been measured 'IC50', 'EC50', 'Ki', 'Kd' and have standart units 'mM', 'nm', 'nM', 'uM' by target chembl id.

In [3]:
def download_mols_chembl(t):
    activity = new_client.activity
    
    # here we specify by which parameters we will filter compounds
    mols = activity.filter(target_chembl_id=t['target_chembl_id'], 
                           standard_type__in=['IC50', 'EC50', 'Ki', 'Kd'],
                           standard_units__in=['mM', 'nm', 'nM', 'uM']
                          ).only(
    
    # list of columns which we will extract
    [
        'molecule_chembl_id', 
        'standard_type', 'standard_relation', 'standard_value', 'standard_units', 
        'canonical_smiles',
        'assay_description', 
        'document_journal', 'document_year'
    ])
    
    return mols

It is convenient to work with data in a data frame format. Therefore, we need to convert the data downloaded from ChEMBL into a data frame format. To work with datat we will use pandas. <b>Pandas</b> is easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

It is possible to convert data to the pandas format with the following command
<i>df = pd.DataFrame(data=mols)</i>
, but it is a long time. 
Therefore, it is proposed to divide the data into parts, convert and combine the parts into one dataframe.

queryset2df() converts input data from the QuerySet format to Pandas dataframe.

In [4]:
def queryset2df(mols):
    i = 0
    j = 10000
    df = pd.DataFrame()
    mols = list(mols)
    while j <= len(mols):
        dft = pd.DataFrame(mols[i:j])
        df = pd.concat([df, dft])
        i = j
        j = j + 10000

    dft = pd.DataFrame(mols[i:])
    df = pd.concat([df, dft], sort=False)
    return df

<h2>Extracting with documented activities against Cytochrome P450 3A4 from ChEMBL</h2>

<hr>

In [5]:
# We will extract ChEMBL target id by the target-protein name using download_targetid_chembl()
trg_cyp = download_targetid_chembl('Cytochrome P450 3A4')

In [6]:
# print the reults
for el in trg_cyp:
    print(el)

{'organism': 'Homo sapiens', 'pref_name': 'Cytochrome P450 3A4', 'target_chembl_id': 'CHEMBL340'}


We extracted one protein target id of Cytochrome P450 3A4 from ChEMBL. This protein structure belongs to the <i>Homo sapiens</i> organism.

In [7]:
# download compounds with known activities to CYP 3A4
mols_cyp = download_mols_chembl(trg_cyp[0])

In [8]:
print(len(mols_cyp), 'compounds with documented activities to CYP 3A4')

10290 compounds with documented activities to CYP 3A4


In [9]:
# Let's take a look at downloaded data on an example of one molecule
mols_cyp[0]

{'assay_description': 'Inhibition of cytochrome P450 3A4 of isolated guinea pig heart',
 'canonical_smiles': 'Cc1nc2cc(OC[C@H](O)CN3CCN(CC(=O)Nc4cccc(-c5ccccc5)c4)CC3)ccc2s1',
 'document_journal': 'Bioorg. Med. Chem. Lett.',
 'document_year': 2004,
 'molecule_chembl_id': 'CHEMBL152968',
 'relation': '=',
 'standard_relation': '=',
 'standard_type': 'IC50',
 'standard_units': 'nM',
 'standard_value': '37000.0',
 'type': 'IC50',
 'units': 'uM',
 'value': '37.0'}

In [10]:
# Let's convert data into pandas dataframe format
%time df_cyp = queryset2df(mols_cyp)

CPU times: user 865 ms, sys: 85.4 ms, total: 950 ms
Wall time: 1.27 s


In [11]:
# Running the code below you can save the collected data
df_cyp.to_csv('Cytochrome_P450_3A4-chembl31.csv', sep='\t')

<h2>Task 1</h2>

<hr>

57 CYP isoforms have been documented in humans. How many CYP protein structures are available in ChEMBL?

In [None]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

#
# Your solution here
#

<i>Hint</i>: to solve Task 1 you need to read more about ChEMBL webresource client

<a href="A02_Data_Collection_Solution.ipynb">Click for our solution</a>