Installing libraries
Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client

Importing libraries

In [None]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

Search for Target protein
Target search for coronavirus

In [None]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('coronavirus')
targets = pd.DataFrame.from_dict(target_query)
targets
     

Select and retrieve bioactivity data for SARS coronavirus 3C-like proteinase (fifth entry)
We will assign the fifth entry (which corresponds to the target protein, coronavirus 3C-like proteinase) to the selected_target variable

In [None]:
selected_target = targets.target_chembl_id[4]
selected_target

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df = pd.DataFrame.from_dict(res)
df.head(3)

In [None]:
df.to_csv('bioactivity_data_raw.csv', index=False)

Handling missing data
If any compounds has missing value for the standard_value column then drop it

In [None]:
df2 = df[df.standard_value.notna()]
df2

Data pre-processing of the bioactivity data
Labeling compounds as either being active, inactive or intermediate
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be active while those greater than 10,000 nM will be considered to be inactive. As for those values in between 1,000 and 10,000 nM will be referred to as intermediate.

In [None]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  #else:
  #  bioactivity_class.append("intermediate")

Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame

In [None]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2[selection]
df3

In [None]:
bioactivity_class = pd.Series(bioactivity_class, name='bioactivity_class')
df4 = pd.concat([df3, bioactivity_class], axis=1)
df4

In [None]:
df4.to_csv('bioactivity_data_preprocessed.csv', index=False)

In [None]:
! ls -l