# Bioinformatics Project - Computational Drug Discovery - Influenza virus A matrix protein M2  
Michael Bahchevanov  
***

## Data Collection & Preprocessing 🔨  
This notebook will be looking into the data source - **ChEMBL** database and the protein of interest - the matrix protein M2 of the Influenza virus A. This protein is responsible for keeping homeostatis inside the cell. Its recognized function is to equilibrate pH across the viral membrane during cell entry and across the trans-Golgi membrane of infected cells during viral maturation.

### 1. What is the ChEMBL Database  
The <a href="https://www.ebi.ac.uk/chembl/">*ChEMBL Database*</a> is a manually curated, freely available database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs. It contains information on more than 1.8 million compounds and over 15 million records of their effects on biological systems. [Data as of June 11, 2021; <a href="https://www.ebi.ac.uk/training/online/courses/chembl-quick-tour/what-is-chembl">*Reference*</a>]  
<img src="assets/Data_included_in_ChEMBL.png">

### 2. Installing libraries and tooling 🔧  
We need to install the *ChEMBL* web client package in order to retrieve the bioactivity data from the database.

In [1]:
!pip install chembl_webresource_client



We will also be installing *pandas* for data wrangling.

In [2]:
!pip install pandas



#### 2.1 Imporing libraries

We are importing the libraries and removing the limit from *pandas'* view limit in order to better view the available data.

In [3]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.options.mode.chained_assignment = None

### 3. Searching and retrieving the target protein

#### 3.1 Target search for Influenza virus A matrix protein M2

In [4]:
# Target search for the influenza
target = new_client.target
query = target.search('influenza')
targets = pd.DataFrame.from_dict(query)
print(targets.shape)
targets

(12, 9)


Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],unidentified influenza virus,unidentified influenza virus,12.0,False,CHEMBL613128,[],ORGANISM,11309
1,[],Influenza B virus,Influenza B virus,12.0,False,CHEMBL613129,[],ORGANISM,11520
2,[],Influenza A virus,Influenza A virus,12.0,False,CHEMBL613740,[],ORGANISM,11320
3,[],Influenza C virus,Influenza C virus,12.0,False,CHEMBL612783,[],ORGANISM,11552
4,"[{'xref_id': 'P03438', 'xref_name': None, 'xre...",Influenza A virus (strain A/X-31 H3N2),Influenza A virus Hemagglutinin,10.0,False,CHEMBL4918,"[{'accession': 'P03438', 'component_descriptio...",SINGLE PROTEIN,132504
5,[],Influenza A virus (H5N1),Influenza A virus (H5N1),10.0,False,CHEMBL613845,[],ORGANISM,102793
6,[],Influenza A virus H3N2,Influenza A virus H3N2,10.0,False,CHEMBL2366902,[],ORGANISM,41857
7,[],Unidentified Influenza A virus (H1N2),Unidentified Influenza A virus (H1N2),9.0,False,CHEMBL2367089,[],ORGANISM,1323429
8,"[{'xref_id': 'P63231', 'xref_name': None, 'xre...",Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,8.0,False,CHEMBL2052,"[{'accession': 'P0DOF8', 'component_descriptio...",SINGLE PROTEIN,381517
9,[],Influenza B virus (B/Lee/40),Influenza B virus (B/Lee/40),8.0,False,CHEMBL612452,[],ORGANISM,107412


The retrieved data consists of all the entries of *influenza* together with their *target_chembl_id* and their *target_type* which are the main points of interest.  
We can see that the protein we are looking for is on the 8th index so we are going to retrieve it together with its respective *chembl_id* for further querying.

#### 3.2 Select and retrieve bioactivity data for *Influenza virus A matrix protein M2*

In [5]:
# Selecting the 8th entry from the queried data
selected_target = targets.target_chembl_id[8]
selected_target

'CHEMBL2052'

Here, we will retrieve only bioactivity data for the *Influenza virus A matrix protein M2*, filtered on the *EC50* measurement of concentration. The *EC50* indicates how much of a drug is needed to achieve 50% of the maximum response. The more potent a drug, the smaller the *EC50* will be.

In [6]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type = 'EC50')
df = pd.DataFrame.from_dict(res)
print(df.shape)
df.head(100)

(128, 45)


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,18753172,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccccc1F,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '17.07', 'le': '0.36', 'lle': '2.55', ...",CHEMBL4287824,,CHEMBL4287824,5.54,False,http://www.openphacts.org/units/Nanomolar,3115886,=,1,True,=,,EC50,nM,,2900.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,2.9
1,,18753173,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1cccc(F)c1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '20.95', 'le': '0.44', 'lle': '3.81', ...",CHEMBL4277167,,CHEMBL4277167,6.8,False,http://www.openphacts.org/units/Nanomolar,3115887,=,1,True,=,,EC50,nM,,160.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.16
2,,18753174,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(F)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '21.13', 'le': '0.45', 'lle': '3.86', ...",CHEMBL4284419,,CHEMBL4284419,6.85,False,http://www.openphacts.org/units/Nanomolar,3115888,=,1,True,=,,EC50,nM,,140.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.14
3,,18753175,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(Cl)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '18.11', 'le': '0.40', 'lle': '2.66', ...",CHEMBL4292316,,CHEMBL4292316,6.17,False,http://www.openphacts.org/units/Nanomolar,3115889,=,1,True,=,,EC50,nM,,670.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.67
4,,18753176,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(Br)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '16.55', 'le': '0.41', 'lle': '2.76', ...",CHEMBL4294990,,CHEMBL4294990,6.38,False,http://www.openphacts.org/units/Nanomolar,3115890,=,1,True,=,,EC50,nM,,420.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.42
5,,18753177,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,Cc1ccc(S(=O)(=O)N(CCC#N)Cc2cccs2)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '19.51', 'le': '0.41', 'lle': '3.09', ...",CHEMBL4286741,,CHEMBL4286741,6.25,False,http://www.openphacts.org/units/Nanomolar,3115891,=,1,True,=,,EC50,nM,,560.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.56
6,,18753178,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,COc1ccc(S(=O)(=O)N(CCC#N)Cc2cccs2)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '15.19', 'le': '0.32', 'lle': '2.25', ...",CHEMBL4285916,,CHEMBL4285916,5.11,False,http://www.openphacts.org/units/Nanomolar,3115892,=,1,True,=,,EC50,nM,,7730.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,7.73
7,,18753179,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(C#N)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '24.43', 'le': '0.50', 'lle': '5.38', ...",CHEMBL4277990,,CHEMBL4277990,8.1,False,http://www.openphacts.org/units/Nanomolar,3115893,=,1,True,=,,EC50,nM,,8.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.008
8,,18753180,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(C(F)(F)F)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '13.73', 'le': '0.29', 'lle': '1.27', ...",CHEMBL4282531,,CHEMBL4282531,5.14,False,http://www.openphacts.org/units/Nanomolar,3115894,=,1,True,=,,EC50,nM,,7210.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,7.21
9,,18753181,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc([N+](=O)[O-])cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '14.16', 'le': '0.30', 'lle': '2.22', ...",CHEMBL4293132,,CHEMBL4293132,4.98,False,http://www.openphacts.org/units/Nanomolar,3115895,=,1,True,=,,EC50,nM,,10530.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,10.53


Finally, we will be saving the resulting bioactivity data to a *.csv* file.

In [7]:
df.to_csv('./data/influenza_virus_A_matrix_M2_protein_01_bioactivity_data_raw.csv', index=False)

### 4. Handling missing data  
If any of the compounds are missing their value for the *standard_value* and *canonical_smiles* column we will be dropping it. This is due to the fact that we are interested in those columns with *standard_value* providing a value for the activity of the compound and *canonical_smiles* providing the **SMILES** (Simplified molecular-input line-entry system) notation of the compound.

In [8]:
df_filtered = df[df.standard_value.notna()]
df_filtered = df_filtered[df_filtered.canonical_smiles.notna()]
print(df_filtered.shape)
df_filtered

(48, 45)


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,18753172,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccccc1F,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '17.07', 'le': '0.36', 'lle': '2.55', ...",CHEMBL4287824,,CHEMBL4287824,5.54,False,http://www.openphacts.org/units/Nanomolar,3115886,=,1,True,=,,EC50,nM,,2900.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,2.9
1,,18753173,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1cccc(F)c1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '20.95', 'le': '0.44', 'lle': '3.81', ...",CHEMBL4277167,,CHEMBL4277167,6.8,False,http://www.openphacts.org/units/Nanomolar,3115887,=,1,True,=,,EC50,nM,,160.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.16
2,,18753174,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(F)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '21.13', 'le': '0.45', 'lle': '3.86', ...",CHEMBL4284419,,CHEMBL4284419,6.85,False,http://www.openphacts.org/units/Nanomolar,3115888,=,1,True,=,,EC50,nM,,140.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.14
3,,18753175,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(Cl)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '18.11', 'le': '0.40', 'lle': '2.66', ...",CHEMBL4292316,,CHEMBL4292316,6.17,False,http://www.openphacts.org/units/Nanomolar,3115889,=,1,True,=,,EC50,nM,,670.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.67
4,,18753176,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(Br)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '16.55', 'le': '0.41', 'lle': '2.76', ...",CHEMBL4294990,,CHEMBL4294990,6.38,False,http://www.openphacts.org/units/Nanomolar,3115890,=,1,True,=,,EC50,nM,,420.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.42
5,,18753177,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,Cc1ccc(S(=O)(=O)N(CCC#N)Cc2cccs2)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '19.51', 'le': '0.41', 'lle': '3.09', ...",CHEMBL4286741,,CHEMBL4286741,6.25,False,http://www.openphacts.org/units/Nanomolar,3115891,=,1,True,=,,EC50,nM,,560.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.56
6,,18753178,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,COc1ccc(S(=O)(=O)N(CCC#N)Cc2cccs2)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '15.19', 'le': '0.32', 'lle': '2.25', ...",CHEMBL4285916,,CHEMBL4285916,5.11,False,http://www.openphacts.org/units/Nanomolar,3115892,=,1,True,=,,EC50,nM,,7730.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,7.73
7,,18753179,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(C#N)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '24.43', 'le': '0.50', 'lle': '5.38', ...",CHEMBL4277990,,CHEMBL4277990,8.1,False,http://www.openphacts.org/units/Nanomolar,3115893,=,1,True,=,,EC50,nM,,8.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.008
8,,18753180,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(C(F)(F)F)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '13.73', 'le': '0.29', 'lle': '1.27', ...",CHEMBL4282531,,CHEMBL4282531,5.14,False,http://www.openphacts.org/units/Nanomolar,3115894,=,1,True,=,,EC50,nM,,7210.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,7.21
9,,18753181,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc([N+](=O)[O-])cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '14.16', 'le': '0.30', 'lle': '2.22', ...",CHEMBL4293132,,CHEMBL4293132,4.98,False,http://www.openphacts.org/units/Nanomolar,3115895,=,1,True,=,,EC50,nM,,10530.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,10.53


Now we will also be checking for molecules with duplicated *canonical_smiles*. If those exist, we are going to drop them.

In [9]:
len(df_filtered.canonical_smiles.unique())

47

In [10]:
df_filtered = df_filtered.drop_duplicates(['canonical_smiles'])
print(df_filtered.shape)
df_filtered

(47, 45)


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,18753172,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccccc1F,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '17.07', 'le': '0.36', 'lle': '2.55', ...",CHEMBL4287824,,CHEMBL4287824,5.54,False,http://www.openphacts.org/units/Nanomolar,3115886,=,1,True,=,,EC50,nM,,2900.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,2.9
1,,18753173,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1cccc(F)c1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '20.95', 'le': '0.44', 'lle': '3.81', ...",CHEMBL4277167,,CHEMBL4277167,6.8,False,http://www.openphacts.org/units/Nanomolar,3115887,=,1,True,=,,EC50,nM,,160.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.16
2,,18753174,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(F)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '21.13', 'le': '0.45', 'lle': '3.86', ...",CHEMBL4284419,,CHEMBL4284419,6.85,False,http://www.openphacts.org/units/Nanomolar,3115888,=,1,True,=,,EC50,nM,,140.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.14
3,,18753175,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(Cl)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '18.11', 'le': '0.40', 'lle': '2.66', ...",CHEMBL4292316,,CHEMBL4292316,6.17,False,http://www.openphacts.org/units/Nanomolar,3115889,=,1,True,=,,EC50,nM,,670.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.67
4,,18753176,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(Br)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '16.55', 'le': '0.41', 'lle': '2.76', ...",CHEMBL4294990,,CHEMBL4294990,6.38,False,http://www.openphacts.org/units/Nanomolar,3115890,=,1,True,=,,EC50,nM,,420.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.42
5,,18753177,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,Cc1ccc(S(=O)(=O)N(CCC#N)Cc2cccs2)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '19.51', 'le': '0.41', 'lle': '3.09', ...",CHEMBL4286741,,CHEMBL4286741,6.25,False,http://www.openphacts.org/units/Nanomolar,3115891,=,1,True,=,,EC50,nM,,560.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.56
6,,18753178,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,COc1ccc(S(=O)(=O)N(CCC#N)Cc2cccs2)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '15.19', 'le': '0.32', 'lle': '2.25', ...",CHEMBL4285916,,CHEMBL4285916,5.11,False,http://www.openphacts.org/units/Nanomolar,3115892,=,1,True,=,,EC50,nM,,7730.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,7.73
7,,18753179,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(C#N)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '24.43', 'le': '0.50', 'lle': '5.38', ...",CHEMBL4277990,,CHEMBL4277990,8.1,False,http://www.openphacts.org/units/Nanomolar,3115893,=,1,True,=,,EC50,nM,,8.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,0.008
8,,18753180,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(C(F)(F)F)cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '13.73', 'le': '0.29', 'lle': '1.27', ...",CHEMBL4282531,,CHEMBL4282531,5.14,False,http://www.openphacts.org/units/Nanomolar,3115894,=,1,True,=,,EC50,nM,,7210.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,7.21
9,,18753181,[],CHEMBL4264733,Inhibition of wild type Influenza A virus (A/c...,B,,,BAO_0000188,BAO_0000019,assay format,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc([N+](=O)[O-])cc1,,,CHEMBL4261607,Eur J Med Chem,2018,"{'bei': '14.16', 'le': '0.30', 'lle': '2.22', ...",CHEMBL4293132,,CHEMBL4293132,4.98,False,http://www.openphacts.org/units/Nanomolar,3115895,=,1,True,=,,EC50,nM,,10530.0,CHEMBL2052,Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,381517,,,EC50,uM,UO_0000065,,10.53


We are left with 47 unique compounds, measured in nM(10$^{-9}$ M), based on their *EC50* measurement of concentration (activation).

### 5. Data pre-processing of the bioactivity data  

#### 5.1 Filter on the columns of interest (molecule_chembl_id, canonical_smiles, standard_value)

In [11]:
target_features = ['molecule_chembl_id','canonical_smiles','standard_value']
df_features = df_filtered[target_features]
df_features

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL4287824,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccccc1F,2900.0
1,CHEMBL4277167,N#CCCN(Cc1cccs1)S(=O)(=O)c1cccc(F)c1,160.0
2,CHEMBL4284419,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(F)cc1,140.0
3,CHEMBL4292316,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(Cl)cc1,670.0
4,CHEMBL4294990,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(Br)cc1,420.0
5,CHEMBL4286741,Cc1ccc(S(=O)(=O)N(CCC#N)Cc2cccs2)cc1,560.0
6,CHEMBL4285916,COc1ccc(S(=O)(=O)N(CCC#N)Cc2cccs2)cc1,7730.0
7,CHEMBL4277990,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(C#N)cc1,8.0
8,CHEMBL4282531,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(C(F)(F)F)cc1,7210.0
9,CHEMBL4293132,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc([N+](=O)[O-])cc1,10530.0


Finally, we will be saving the dataframe to a *.csv* file.

In [12]:
df_features.to_csv('./data/influenza_virus_A_matrix_M2_protein_02_bioactivity_data_preprocessed.csv', index=False)

### 6. Labeling compounds based on their activity

The bioactivity data is expressed in the *EC50* measurement of concentration. The mapping will be applied as following:
* *standard_value* <= 1000 nM - **active**
* 1000 nM > *standard_value* < 10,000 nM - **intermediate**
* *standard_value* >= 10,000 nM - **inactive**

In [13]:
def label_activity(row):
    if float(row) >= 10000:
        return "inactive"
    elif float(row) <= 1000:
        return "active"
    else:
        return "intermediate"

In [14]:
df_features['class'] = df_features['standard_value'].apply(label_activity)
df_features

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL4287824,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccccc1F,2900.0,intermediate
1,CHEMBL4277167,N#CCCN(Cc1cccs1)S(=O)(=O)c1cccc(F)c1,160.0,active
2,CHEMBL4284419,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(F)cc1,140.0,active
3,CHEMBL4292316,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(Cl)cc1,670.0,active
4,CHEMBL4294990,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(Br)cc1,420.0,active
5,CHEMBL4286741,Cc1ccc(S(=O)(=O)N(CCC#N)Cc2cccs2)cc1,560.0,active
6,CHEMBL4285916,COc1ccc(S(=O)(=O)N(CCC#N)Cc2cccs2)cc1,7730.0,intermediate
7,CHEMBL4277990,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(C#N)cc1,8.0,active
8,CHEMBL4282531,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc(C(F)(F)F)cc1,7210.0,intermediate
9,CHEMBL4293132,N#CCCN(Cc1cccs1)S(=O)(=O)c1ccc([N+](=O)[O-])cc1,10530.0,inactive


Finally, we will be saving the data as a *.csv* file.

In [15]:
df_features.to_csv('./data/influenza_virus_A_matrix_M2_protein_03_bioactivity_data_curated.csv', index=False)

***
### Overview  
In this notebook, we have extracted, loaded, and pre-processed the data. The next steps we are going to take are:  
* Calculating Molecular Descriptors  
* Exploratory Data Analysis  
* Statistical Analysis  
* Data Visualization