<h1>Bioinformatics Project - Computational Drug Discovery</h1>

ChEMBL Database
The ChEMBL Database is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 84,000 documents, 1.2 million assays and the data spans 14,855 targets and 1,800 cells and 33,000 indications. [Data as of March 25, 2020; ChEMBL version 26].

In [2]:
!pip install chembl_webresource_client



In [3]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

In [4]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('coronavirus')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Coronavirus,Coronavirus,17.0,False,CHEMBL613732,[],ORGANISM,11119
1,[],SARS coronavirus,SARS coronavirus,15.0,False,CHEMBL612575,[],ORGANISM,227859
2,[],Feline coronavirus,Feline coronavirus,15.0,False,CHEMBL612744,[],ORGANISM,12663
3,[],Human coronavirus 229E,Human coronavirus 229E,13.0,False,CHEMBL613837,[],ORGANISM,11137
4,"[{'xref_id': 'P0C6U8', 'xref_name': None, 'xre...",SARS coronavirus,SARS coronavirus 3C-like proteinase,10.0,False,CHEMBL3927,"[{'accession': 'P0C6U8', 'component_descriptio...",SINGLE PROTEIN,227859
5,[],Middle East respiratory syndrome-related coron...,Middle East respiratory syndrome-related coron...,9.0,False,CHEMBL4296578,[],ORGANISM,1335626
6,"[{'xref_id': 'P0C6X7', 'xref_name': None, 'xre...",SARS coronavirus,Replicase polyprotein 1ab,4.0,False,CHEMBL5118,"[{'accession': 'P0C6X7', 'component_descriptio...",SINGLE PROTEIN,227859
7,[],Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,4.0,False,CHEMBL4523582,"[{'accession': 'P0DTD1', 'component_descriptio...",SINGLE PROTEIN,2697049


Select and retreive bioactivity data for Severe acute respiratory syndrome coronavirus 2 {EIGHTH ENTRY}

In [7]:
selected_target = targets.target_chembl_id[7]
selected_target

'CHEMBL4523582'

We will retreive bioactivity data for Severe acute respiratory syndrome coronavirus 2 (CHEMBL4523582) that are reported as IC50 values in nM

What is IC50 -via wikipedia
The half maximal inhibitory concentration (IC50) is a measure of the potency of a substance in inhibiting a specific biological or biochemical function. IC50 is a quantitative measure that indicates how much of a particular inhibitory substance (e.g. drug) is needed to inhibit, in vitro, a given biological process or biological component by 50%. The biological component could be an enzyme, cell, cell receptor or microorganism. IC50 values are typically expressed as molar concentration.

In [8]:
activity = new_client.activity
res = activity.filter(target_chembl_id = selected_target).filter(standard_type="IC50")

In [9]:
df = pd.DataFrame.from_dict(res)

In [29]:
pd.set_option('display.max_columns', None)
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,Dtt Insensitive,19964199,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1,,,CHEMBL4495564,,2020,,CHEMBL480,LANSOPRAZOLE,CHEMBL480,6.41,0,http://www.openphacts.org/units/Nanomolar,3341963,=,52,1,=,,IC50,nM,,390.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.39
1,Dtt Insensitive,19964200,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,Cc1c(-c2cnccn2)ssc1=S,,,CHEMBL4495564,,2020,,CHEMBL178459,OLTIPRAZ,CHEMBL178459,6.68,0,http://www.openphacts.org/units/Nanomolar,3341991,=,52,1,=,,IC50,nM,,210.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.21
2,Dtt Insensitive,19964201,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,O=c1sn(-c2cccc3ccccc23)c(=O)n1Cc1ccccc1,,,CHEMBL4495564,,2020,,CHEMBL3545157,TIDEGLUSIB,CHEMBL3545157,7.10,0,http://www.openphacts.org/units/Nanomolar,3342067,=,52,1,=,,IC50,nM,,80.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.08
3,Dtt Insensitive,19964202,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,O=C(O[C@@H]1Cc2c(O)cc(O)cc2O[C@@H]1c1cc(O)c(O)...,,,CHEMBL4495564,,2020,,CHEMBL297453,EPIGALOCATECHIN GALLATE,CHEMBL297453,5.80,0,http://www.openphacts.org/units/Nanomolar,3342156,=,52,1,=,,IC50,nM,,1580.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.58
4,Dtt Insensitive,19964203,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,O=C1C=Cc2cc(Br)ccc2C1=O,,,CHEMBL4495564,,2020,,CHEMBL4303595,,CHEMBL4303595,7.40,0,http://www.openphacts.org/units/Nanomolar,3342307,=,52,1,=,,IC50,nM,,40.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112,Dtt Insensitive,19964311,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,C=CC(=O)c1ccc2ccccc2c1,,,CHEMBL4495564,,2020,,CHEMBL154580,,CHEMBL154580,5.91,0,http://www.openphacts.org/units/Nanomolar,3350392,=,52,1,=,,IC50,nM,,1240.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.24
113,Dtt Insensitive,19964312,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,C[n+]1c2cc(N)ccc2cc2ccc(N)cc21.[Cl-],,,CHEMBL4495564,,2020,,CHEMBL354349,EUFLAVINE,CHEMBL1184529,5.30,0,http://www.openphacts.org/units/Nanomolar,3350497,=,52,1,=,,IC50,nM,,4980.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,4.98
114,Dtt Insensitive,19964313,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,Nc1ccc(S(=O)(=O)[N-]c2ncccn2)cc1.[Ag+],,,CHEMBL4495564,,2020,,CHEMBL1382627,"SULFADIAZINE, SILVER",CHEMBL439,6.12,0,http://www.openphacts.org/units/Nanomolar,3350585,=,52,1,=,,IC50,nM,,750.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.75
115,Dtt Insensitive,19964314,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,,,,CHEMBL4495564,,2020,,CHEMBL4303664,,CHEMBL4303664,6.06,0,http://www.openphacts.org/units/Nanomolar,3350604,=,52,1,=,,IC50,nM,,880.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.88


In [11]:
df.standard_type.unique()

array(['IC50'], dtype=object)

We will save the resulting bioactivity data to a CSV file

In [30]:
df.to_csv('bioactivity_data.csv', index=False)

<h2>Handling missing data</h2>
If any compounds has missing value for the standard_value column then we drop it.

In [35]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,Dtt Insensitive,19964199,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1,,,CHEMBL4495564,,2020,,CHEMBL480,LANSOPRAZOLE,CHEMBL480,6.41,0,http://www.openphacts.org/units/Nanomolar,3341963,=,52,1,=,,IC50,nM,,390.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.39
1,Dtt Insensitive,19964200,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,Cc1c(-c2cnccn2)ssc1=S,,,CHEMBL4495564,,2020,,CHEMBL178459,OLTIPRAZ,CHEMBL178459,6.68,0,http://www.openphacts.org/units/Nanomolar,3341991,=,52,1,=,,IC50,nM,,210.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.21
2,Dtt Insensitive,19964201,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,O=c1sn(-c2cccc3ccccc23)c(=O)n1Cc1ccccc1,,,CHEMBL4495564,,2020,,CHEMBL3545157,TIDEGLUSIB,CHEMBL3545157,7.10,0,http://www.openphacts.org/units/Nanomolar,3342067,=,52,1,=,,IC50,nM,,80.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.08
3,Dtt Insensitive,19964202,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,O=C(O[C@@H]1Cc2c(O)cc(O)cc2O[C@@H]1c1cc(O)c(O)...,,,CHEMBL4495564,,2020,,CHEMBL297453,EPIGALOCATECHIN GALLATE,CHEMBL297453,5.80,0,http://www.openphacts.org/units/Nanomolar,3342156,=,52,1,=,,IC50,nM,,1580.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.58
4,Dtt Insensitive,19964203,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,O=C1C=Cc2cc(Br)ccc2C1=O,,,CHEMBL4495564,,2020,,CHEMBL4303595,,CHEMBL4303595,7.40,0,http://www.openphacts.org/units/Nanomolar,3342307,=,52,1,=,,IC50,nM,,40.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111,Dtt Insensitive,19964310,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,COc1nc2ccc(Br)cc2cc1[C@@H](c1ccccc1)[C@@](O)(C...,,,CHEMBL4495564,,2020,,CHEMBL376488,BEDAQUILINE,CHEMBL376488,5.36,0,http://www.openphacts.org/units/Nanomolar,3350188,=,52,1,=,,IC50,nM,,4360.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,4.36
112,Dtt Insensitive,19964311,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,C=CC(=O)c1ccc2ccccc2c1,,,CHEMBL4495564,,2020,,CHEMBL154580,,CHEMBL154580,5.91,0,http://www.openphacts.org/units/Nanomolar,3350392,=,52,1,=,,IC50,nM,,1240.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.24
113,Dtt Insensitive,19964312,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,C[n+]1c2cc(N)ccc2cc2ccc(N)cc21.[Cl-],,,CHEMBL4495564,,2020,,CHEMBL354349,EUFLAVINE,CHEMBL1184529,5.30,0,http://www.openphacts.org/units/Nanomolar,3350497,=,52,1,=,,IC50,nM,,4980.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,4.98
114,Dtt Insensitive,19964313,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,Nc1ccc(S(=O)(=O)[N-]c2ncccn2)cc1.[Ag+],,,CHEMBL4495564,,2020,,CHEMBL1382627,"SULFADIAZINE, SILVER",CHEMBL439,6.12,0,http://www.openphacts.org/units/Nanomolar,3350585,=,52,1,=,,IC50,nM,,750.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.75


In [36]:
len(df2.canonical_smiles.unique())

102

In [38]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,Dtt Insensitive,19964199,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1,,,CHEMBL4495564,,2020,,CHEMBL480,LANSOPRAZOLE,CHEMBL480,6.41,0,http://www.openphacts.org/units/Nanomolar,3341963,=,52,1,=,,IC50,nM,,390.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.39
1,Dtt Insensitive,19964200,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,Cc1c(-c2cnccn2)ssc1=S,,,CHEMBL4495564,,2020,,CHEMBL178459,OLTIPRAZ,CHEMBL178459,6.68,0,http://www.openphacts.org/units/Nanomolar,3341991,=,52,1,=,,IC50,nM,,210.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.21
2,Dtt Insensitive,19964201,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,O=c1sn(-c2cccc3ccccc23)c(=O)n1Cc1ccccc1,,,CHEMBL4495564,,2020,,CHEMBL3545157,TIDEGLUSIB,CHEMBL3545157,7.10,0,http://www.openphacts.org/units/Nanomolar,3342067,=,52,1,=,,IC50,nM,,80.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.08
3,Dtt Insensitive,19964202,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,O=C(O[C@@H]1Cc2c(O)cc(O)cc2O[C@@H]1c1cc(O)c(O)...,,,CHEMBL4495564,,2020,,CHEMBL297453,EPIGALOCATECHIN GALLATE,CHEMBL297453,5.80,0,http://www.openphacts.org/units/Nanomolar,3342156,=,52,1,=,,IC50,nM,,1580.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.58
4,Dtt Insensitive,19964203,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,O=C1C=Cc2cc(Br)ccc2C1=O,,,CHEMBL4495564,,2020,,CHEMBL4303595,,CHEMBL4303595,7.40,0,http://www.openphacts.org/units/Nanomolar,3342307,=,52,1,=,,IC50,nM,,40.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111,Dtt Insensitive,19964310,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,COc1nc2ccc(Br)cc2cc1[C@@H](c1ccccc1)[C@@](O)(C...,,,CHEMBL4495564,,2020,,CHEMBL376488,BEDAQUILINE,CHEMBL376488,5.36,0,http://www.openphacts.org/units/Nanomolar,3350188,=,52,1,=,,IC50,nM,,4360.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,4.36
112,Dtt Insensitive,19964311,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,C=CC(=O)c1ccc2ccccc2c1,,,CHEMBL4495564,,2020,,CHEMBL154580,,CHEMBL154580,5.91,0,http://www.openphacts.org/units/Nanomolar,3350392,=,52,1,=,,IC50,nM,,1240.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,1.24
113,Dtt Insensitive,19964312,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,C[n+]1c2cc(N)ccc2cc2ccc(N)cc21.[Cl-],,,CHEMBL4495564,,2020,,CHEMBL354349,EUFLAVINE,CHEMBL1184529,5.30,0,http://www.openphacts.org/units/Nanomolar,3350497,=,52,1,=,,IC50,nM,,4980.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,4.98
114,Dtt Insensitive,19964313,[],CHEMBL4495583,SARS-CoV-2 3CL-Pro protease inhibition IC50 de...,F,,,BAO_0000190,BAO_0000019,assay format,Nc1ccc(S(=O)(=O)[N-]c2ncccn2)cc1.[Ag+],,,CHEMBL4495564,,2020,,CHEMBL1382627,"SULFADIAZINE, SILVER",CHEMBL439,6.12,0,http://www.openphacts.org/units/Nanomolar,3350585,=,52,1,=,,IC50,nM,,750.0,CHEMBL4523582,Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,2697049,,,IC50,uM,UO_0000065,,0.75


<h2> Data pre-processing of the bioactivity data </h2>

Labelling compunds as either being active, inactive or intermediate

The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be active while those greater than 10,000 nM will be considered to be inactive. As for those values in between 1,000 and 10,000, nM will be referred to as intermediate.

In [39]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL480,Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1,390.0
1,CHEMBL178459,Cc1c(-c2cnccn2)ssc1=S,210.0
2,CHEMBL3545157,O=c1sn(-c2cccc3ccccc23)c(=O)n1Cc1ccccc1,80.0
3,CHEMBL297453,O=C(O[C@@H]1Cc2c(O)cc(O)cc2O[C@@H]1c1cc(O)c(O)...,1580.0
4,CHEMBL4303595,O=C1C=Cc2cc(Br)ccc2C1=O,40.0
...,...,...,...
111,CHEMBL376488,COc1nc2ccc(Br)cc2cc1[C@@H](c1ccccc1)[C@@](O)(C...,4360.0
112,CHEMBL154580,C=CC(=O)c1ccc2ccccc2c1,1240.0
113,CHEMBL354349,C[n+]1c2cc(N)ccc2cc2ccc(N)cc21.[Cl-],4980.0
114,CHEMBL1382627,Nc1ccc(S(=O)(=O)[N-]c2ncccn2)cc1.[Ag+],750.0


In [None]:
df4 = pd.read_csv(