# Helpful Links
Tutorial provided by Professor Chanin Nantasenamat. 

[Data Professor YouTube Channel](https://www.youtube.com/dataprofessor/)

[Link to YouTube Tutorial](https://www.youtube.com/watch?v=jBlTQjcKuaY&t=640s)


*First modified: 20 May 2022*

*Last modified:* 	

*copyright (c) 2022 - Samantha Martell*



# Background
Aim: exploratory data analysis and build a machine learning model to gain data-driven insights to be used in drug discovery. 

# Data Collection
The "molecular solubility value" is important physoio-chemical property of chemicals that describes to what extent they can be solubilised in water. 

First, must retrieve data from the [ChEMBL database](https://www.ebi.ac.uk/chembl/).

## Virtual Environment Set Up
        python -m venv new-env
        new-env\Scripts\activate.bat
        import sys
        sys.path

## Installing libraries
Install the ChEMBL web service package so data can be retrieved from the database.

Set up a venv on Windows + VSCode using [this tutorial](https://www.youtube.com/watch?v=fBBAGXjg2fk) by [42technoman](https://www.youtube.com/c/42technoman).

Run the following command in the terminal:

        pip3 install chembl_webresource_client

In [3]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

In [7]:
# Search for a target protein
target = new_client.target
target_query = target.search("coronavirus")
# convert dictionary to dataframe
targets = pd.DataFrame.from_dict(target_query)

In [12]:
# explore data frame
targets.head()
# targets.info()

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Coronavirus,Coronavirus,17.0,False,CHEMBL613732,[],ORGANISM,11119
1,[],SARS coronavirus,SARS coronavirus,15.0,False,CHEMBL612575,[],ORGANISM,227859
2,[],Feline coronavirus,Feline coronavirus,15.0,False,CHEMBL612744,[],ORGANISM,12663
3,[],Human coronavirus 229E,Human coronavirus 229E,13.0,False,CHEMBL613837,[],ORGANISM,11137
4,"[{'xref_id': 'P0C6U8', 'xref_name': None, 'xre...",SARS coronavirus,SARS coronavirus 3C-like proteinase,10.0,False,CHEMBL3927,"[{'accession': 'P0C6U8', 'component_descriptio...",SINGLE PROTEIN,227859


In [13]:
# filter by target type
single_proteins = targets.loc[targets["target_type"] == "SINGLE PROTEIN"]
single_proteins.head()

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
4,"[{'xref_id': 'P0C6U8', 'xref_name': None, 'xre...",SARS coronavirus,SARS coronavirus 3C-like proteinase,10.0,False,CHEMBL3927,"[{'accession': 'P0C6U8', 'component_descriptio...",SINGLE PROTEIN,227859
6,"[{'xref_id': 'P0C6X7', 'xref_name': None, 'xre...",SARS coronavirus,Replicase polyprotein 1ab,4.0,False,CHEMBL5118,"[{'accession': 'P0C6X7', 'component_descriptio...",SINGLE PROTEIN,227859
7,[],Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,4.0,False,CHEMBL4523582,"[{'accession': 'P0DTD1', 'component_descriptio...",SINGLE PROTEIN,2697049


## Select and retrieve bioactivity data for *SARS coronavirus 3C-like proteinase*

In [23]:
selected_target = "CHEMBL3927"

activity = new_client.activity
# apply two filters
# select only the IC50
res = activity.filter(target_chembl_id = selected_target).filter(standard_type="IC50")

# check the type
print(type(res))

# must convert dict to dataframe
df = pd.DataFrame.from_dict(res)

df.head()

<class 'chembl_webresource_client.query_set.QuerySet'>


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,1480935,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,BAO_0000357,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,7.2
1,,1480936,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,BAO_0000357,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,9.4
2,,1481061,[],CHEMBL830868,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,BAO_0000357,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,13.5
3,,1481065,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,BAO_0000357,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,13.11
4,,1481066,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,BAO_0000357,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,2.0


In [24]:
# check IC50 is the only standard_type present
df.standard_type.unique()

array(['IC50'], dtype=object)

In [25]:
# write the dataframe to a .csv
df.to_csv("data/bioactivity_data.csv", index = False)