## Computational Drug Discovery - Bioactivity Data

In this Jupyter notebook, I will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 1**, I will be performing Data Collection and Pre-Processing from the ChEMBL Database.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. 

## Installing libraries

In [3]:
#Install the ChEMBL web service package so that I can download the biological activity data directly from the ChEMBL Database.
! pip install chembl_webresource_client



## Importing libraries

In [5]:
# necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## Search for Target protein

### Target search for Beta Secretase 1
I will be choosing a single protein for further investigation.

In [273]:
# target search for hepatitis c
target = new_client.target
target_query = target.search('Beta-secretase')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Homo sapiens,Beta-secretase (BACE),20.0,False,CHEMBL2111390,"[{'accession': 'Q9Y5Z0', 'component_descriptio...",PROTEIN FAMILY,9606.0
1,[],Rattus norvegicus,Beta-secretase 2,19.0,False,CHEMBL2331066,"[{'accession': 'Q6IE75', 'component_descriptio...",SINGLE PROTEIN,10116.0
2,"[{'xref_id': 'Q9Y5Z0', 'xref_name': None, 'xre...",Homo sapiens,Beta secretase 2,18.0,False,CHEMBL2525,"[{'accession': 'Q9Y5Z0', 'component_descriptio...",SINGLE PROTEIN,9606.0
3,"[{'xref_id': 'Beta-secretase_1', 'xref_name': ...",Homo sapiens,Beta-secretase 1,18.0,False,CHEMBL4822,"[{'accession': 'P56817', 'component_descriptio...",SINGLE PROTEIN,9606.0
4,"[{'xref_id': 'P56818', 'xref_name': None, 'xre...",Mus musculus,Beta-secretase 1,18.0,False,CHEMBL4593,"[{'accession': 'P56818', 'component_descriptio...",SINGLE PROTEIN,10090.0
...,...,...,...,...,...,...,...,...,...
1160,[],Homo sapiens,Caspase,1.0,False,CHEMBL3831289,"[{'accession': 'P49662', 'component_descriptio...",PROTEIN FAMILY,9606.0
1161,[],Homo sapiens,mTORC1,1.0,False,CHEMBL4296661,"[{'accession': 'P42345', 'component_descriptio...",PROTEIN COMPLEX,9606.0
1162,[],Homo sapiens,mTORC2,1.0,False,CHEMBL4523999,"[{'accession': 'P42345', 'component_descriptio...",PROTEIN COMPLEX,9606.0
1163,"[{'xref_id': 'C3TDZ2', 'xref_name': None, 'xre...",Escherichia coli,3-oxoacyl-[acyl-carrier-protein] synthase 3,0.0,False,CHEMBL1795135,"[{'accession': 'C3TDZ2', 'component_descriptio...",SINGLE PROTEIN,562.0


## Select and retrieve bioactivity data for Beta-secretase 1 (fourth entry)

I will assign the fourth entry (which corresponds to the target protein, Beta-secretase 1) to the **selected_target**.

In [274]:
selected_target = targets.target_chembl_id[3]
selected_target

'CHEMBL4822'

Here, I will retrieve only bioactivity data for Beta-secretase 1 (CHEMBL4822) that are reported as values in nM (nanomolar) unit.

In [269]:
activity = new_client.activity
#define a specific standard type so the dataset is more uniform (there will not be a mixture of different bioactivity units)
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [271]:
df = pd.DataFrame.from_dict(res)

In [272]:
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,78857,[],CHEMBL653511,Inhibitory activity against Beta-secretase 1 w...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,nM,UO_0000065,,413.0
1,,391560,[],CHEMBL653332,Compound was tested for its inhibitory activit...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,0.002
2,,391983,[],CHEMBL653512,Inhibition of human Beta-secretase 1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,0.46
3,,395858,[],CHEMBL653512,Inhibition of human Beta-secretase 1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,9.0
4,,395859,[],CHEMBL653512,Inhibition of human Beta-secretase 1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,5.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10426,,23311823,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4844257,Inhibition of human recombinant BACE-1 express...,B,,,BAO_0000190,BAO_0000019,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,1.5
10427,,23311824,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4844257,Inhibition of human recombinant BACE-1 express...,B,,,BAO_0000190,BAO_0000019,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,4.18
10428,Not Determined,23311825,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4844257,Inhibition of human recombinant BACE-1 express...,B,,,BAO_0000190,BAO_0000019,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,,,,
10429,Not Determined,23311826,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4844257,Inhibition of human recombinant BACE-1 express...,B,,,BAO_0000190,BAO_0000019,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,,,,


In [275]:
df.standard_type.unique()

array(['IC50'], dtype=object)

In [276]:
df.to_csv('bioactivity_data.csv', index=False)

## Handling missing data

In [277]:
#dropping any missing values for the standard_value column
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

  df2 = df2[df.canonical_smiles.notna()]


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,78857,[],CHEMBL653511,Inhibitory activity against Beta-secretase 1 w...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,nM,UO_0000065,,413.0
1,,391560,[],CHEMBL653332,Compound was tested for its inhibitory activit...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,0.002
2,,391983,[],CHEMBL653512,Inhibition of human Beta-secretase 1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,0.46
3,,395858,[],CHEMBL653512,Inhibition of human Beta-secretase 1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,9.0
4,,395859,[],CHEMBL653512,Inhibition of human Beta-secretase 1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,5.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10423,,23300103,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4841886,Binding affinity to human recombinant BACE-1 (...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,nM,UO_0000065,,0.6
10424,,23311821,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4844257,Inhibition of human recombinant BACE-1 express...,B,,,BAO_0000190,BAO_0000019,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,10.8
10425,,23311822,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4844257,Inhibition of human recombinant BACE-1 express...,B,,,BAO_0000190,BAO_0000019,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,2.71
10426,,23311823,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4844257,Inhibition of human recombinant BACE-1 express...,B,,,BAO_0000190,BAO_0000019,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,1.5


In [278]:
len(df2.canonical_smiles.unique())

7234

In [279]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,78857,[],CHEMBL653511,Inhibitory activity against Beta-secretase 1 w...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,nM,UO_0000065,,413.0
1,,391560,[],CHEMBL653332,Compound was tested for its inhibitory activit...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,0.002
2,,391983,[],CHEMBL653512,Inhibition of human Beta-secretase 1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,0.46
3,,395858,[],CHEMBL653512,Inhibition of human Beta-secretase 1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,9.0
4,,395859,[],CHEMBL653512,Inhibition of human Beta-secretase 1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,5.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10419,,23300099,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4841885,Inhibition of recombinant BACE-1 (unknown orig...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,4.4
10420,,23300100,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4841885,Inhibition of recombinant BACE-1 (unknown orig...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,9.2
10421,,23300101,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4841885,Inhibition of recombinant BACE-1 (unknown orig...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,10.0
10424,,23311821,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4844257,Inhibition of human recombinant BACE-1 express...,B,,,BAO_0000190,BAO_0000019,...,Homo sapiens,Beta-secretase 1,9606,,,IC50,uM,UO_0000065,,10.8


## Data pre-processing of the bioactivity data

### Combine molecule_chembl_id,canonical_smiles,standard_value and bioactivity_class into a DataFrame

In [280]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3  

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL406146,CC(C)C[C@H](NC(=O)[C@@H](NC(=O)[C@@H](N)CCC(=O...,413.0
1,CHEMBL78946,CC(C)C[C@H](NC(=O)[C@H](CC(N)=O)NC(=O)[C@@H](N...,2.0
2,CHEMBL324109,CCC(C)C[C@H](NC(=O)[C@H](CC(C)C)NC(C)=O)[C@@H]...,460.0
3,CHEMBL114147,CC(=O)NCC(=O)N[C@@H](Cc1ccccc1)[C@@H](O)CC(=O)...,9000.0
4,CHEMBL419949,CC(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](Cc1ccccc1...,5600.0
...,...,...,...
10419,CHEMBL4849016,CC[C@H](C)[C@H](NC(=O)[C@@H](N)CCC(=O)O)C(=O)N...,4400.0
10420,CHEMBL4872824,CC[C@H](C)[C@H](NC(=O)[C@@H](N)CCC(=O)O)C(=O)N...,9200.0
10421,CHEMBL4853052,CC[C@H](C)[C@H](NC(=O)[C@@H](N)CCC(=O)O)C(=O)N...,10000.0
10424,CHEMBL4862716,CC1=CC2Cc3nc4cc(Cl)ccc4c(NCCCCCCCCCNS(=O)(=O)c...,10800.0


In [281]:
df3.to_csv('beta_secretase_data_preprocessed.csv')

### Labeling compounds as either being active, inactive, or intermediate

The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered **active**. Values in between 1,000 and 10,000 nM will be refered to as **intermediate** and values greater than 10,000 nM will be considered **inactive**. 

In [282]:
df4 = pd.read_csv('data/beta_secretase_data_preprocessed.csv')

In [283]:
bioactivity_threshold = []
for i in df4.standard_value:
    if float(i) >= 10000:
        bioactivity_threshold.append("inactive")
    elif float(i) <= 1000:
        bioactivity_threshold.append("active")
    else:
        bioactivity_threshold.append("intermediate")

In [284]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,0,CHEMBL406146,CC(C)C[C@H](NC(=O)[C@@H](NC(=O)[C@@H](N)CCC(=O...,413.0,active
1,1,CHEMBL78946,CC(C)C[C@H](NC(=O)[C@H](CC(N)=O)NC(=O)[C@@H](N...,2.0,active
2,2,CHEMBL324109,CCC(C)C[C@H](NC(=O)[C@H](CC(C)C)NC(C)=O)[C@@H]...,460.0,active
3,3,CHEMBL114147,CC(=O)NCC(=O)N[C@@H](Cc1ccccc1)[C@@H](O)CC(=O)...,9000.0,intermediate
4,4,CHEMBL419949,CC(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](Cc1ccccc1...,5600.0,intermediate
...,...,...,...,...,...
7229,10419,CHEMBL4849016,CC[C@H](C)[C@H](NC(=O)[C@@H](N)CCC(=O)O)C(=O)N...,4400.0,intermediate
7230,10420,CHEMBL4872824,CC[C@H](C)[C@H](NC(=O)[C@@H](N)CCC(=O)O)C(=O)N...,9200.0,intermediate
7231,10421,CHEMBL4853052,CC[C@H](C)[C@H](NC(=O)[C@@H](N)CCC(=O)O)C(=O)N...,10000.0,inactive
7232,10424,CHEMBL4862716,CC1=CC2Cc3nc4cc(Cl)ccc4c(NCCCCCCCCCNS(=O)(=O)c...,10800.0,inactive


In [285]:
df5.to_csv('beta_secretase_bioactivity_data_curated.csv', index=False)