<a href="https://colab.research.google.com/github/kithmini-wijesiri/SMILES/blob/main/bioactivity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Part 1**
performing Data Collection and Pre-Processing from the ChEMBL Database

1) Installing *libraries*

This command will download and install the ChEMBL web resource client library along with its dependencies.

In [None]:
! pip install chembl_webresource_client

2) importing libraries


imports the pandas library and gives it the alias pd.
imports the new_client module from the chembl_webresource_client library.

In [None]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

3) search for target protein


creates a reference to the target resource within the ChEMBL web resource client.
performs a search for targets related to coronavirus using the search method of the target resource
converts the result of the target search, which is initially in dictionary format, into a Pandas DataFrame
prints or returns the DataFrame containing information about the targets related to coronavirus

In [None]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('coronavirus')
targets = pd.DataFrame.from_dict(target_query)
targets

4) Select and retrieve bioactivity data for SARS coronavirus 3C-like proteinase

select the target with the chembl ID 6

In [None]:
selected_target = targets.target_chembl_id[6]
selected_target

'CHEMBL3927'

5) retrieve bioactivity data for coronavirus 3C-like proteinase (CHEMBL3927) that are reported as IC 50  values in nM (nanomolar) unit. lower the IC 50, more potent it is and they are considered more effective in inhibiting the target activity.

This code snippet is using the ChEMBL web resource client to retrieve bioactivity data for the coronavirus 3C-like proteinase (CHEMBL3927) that are reported as IC50 values in nanomolar (nM) units. The result of these filtering operations is stored in the variable res. Then takes the bioactivity data, assumed to be in a dictionary format, and creates a Pandas DataFrame named df. The df.head(3) command is used to display the first 3 rows of the DataFrame df.

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df = pd.DataFrame.from_dict(res)

In [None]:
df.head(3)

To see what standard type of unique data we have,

In [None]:
df.standard_type.unique()

array(['IC50'], dtype=object)

6) save the resulting bioactivity data to a CSV file bioactivity_data.csv.

In [None]:
df.to_csv('bioactivity_data_raw.csv', index=False)

7) copy files to google drive

drive module provides functions to mount and interact with the Google Drive. This code imports the drive module from the google.colab package, and mounts your Google Drive into the Colab environment. force_remount=True option forces Google Colab to remount the drive, even if it has been mounted before

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


8) create a data folder in colab notebooks folder on google drive

In [None]:
! mkdir "/content/gdrive/My Drive/Colab Notebooks/data"

In [None]:
! cp bioactivity_data_raw.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [None]:
!ls -l "/content/gdrive/My Drive/Colab Notebooks/data"

total 67
-rw------- 1 root root 68403 Feb  4 01:36 bioactivity_data_raw.csv


9) check what .csv files we have so far

In [None]:
! ls

bioactivity_data_raw.csv  gdrive  sample_data


10) check the .csv file we just created

In [None]:
! head bioactivity_data_raw.csv

11) handling missing data

In [None]:
df2 = df[df.standard_value.notna()]
df2

**Part 2 data pre-processing**

1) labeling the compounds as active, inactive or intermediate


 IC50 <  1000 nM = active
 IC50 > 10,000 nM = inactive
 1,000 < IC50 > 10,000 nM = intermediate

In [None]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

2) iterate the *molecule_chembl_id* to a list

This code iterates through each value in the 'molecule_chembl_id' column of the DataFrame df2 and appends each value to the list mol_cid. After running this code, mol_cid will contain all the 'molecule_chembl_id' values present in the DataFrame.

In [None]:
df2.molecule_chembl_id

In [None]:
mol_cid = []
for i in df2.molecule_chembl_id:
  mol_cid.append(i)

In [None]:
mol_cid

Canonical smiles, or SMILES (Simplified Molecular Input Line Entry System), is a widely used and standardized notation for representing chemical structures in a text format. This code iterates through each value in the 'canonical_smiles' column of the DataFrame df2 and appends each value to the list canonical_smiles.

In [None]:
canonical_smiles = []
for i in df2.canonical_smiles:
 canonical_smiles.append(i)

In [None]:
canonical_smiles

IC50 values

In [None]:
standard_value = []
for i in df2.standard_value:
 standard_value.append(i)


In [None]:
standard_value

**Alternative simple method**

In [None]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df3 = df2[selection]

In [None]:
df3

the above data will be combined with the bioactivity class in the followin code.

In [None]:
pd.concat([df3,pd.Series(bioactivity_class)], axis=1)

create a .csv file for the pre-processed data

In [None]:
df3.to_csv('bioacivity_preprocessed_data.csv', index=False)

In [None]:
! ls -l

copy to google drive

In [None]:
! cp bioacivity_preprocessed_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [None]:
! ls "/content/gdrive/My Drive/Colab Notebooks/data"

bioacivity_preprocessed_data.csv  bioactivity_data_raw.csv
