<a href="https://colab.research.google.com/github/leventdusunceli/QSAR_Model_P.aeruginosa/blob/main/QSAR_Modelling_P_aeruginosa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#QSAR Modelling for bioactive compounds against β-lactamase produced by *Pseudomonas aeruginosa* 

In this notebook we're going to build and compare a number of quantitative structure-activity relationship (QSAR) models for predicting chemically active compounds against beta-lactamase

In order to build our models we'll follow these steps: 


1.   Import previously identified active compounds' bioactivity data from ChEMBL database
2.   Calculate molecular descriptors using SMILES notation 
3. Calculate molecular fingerprints using SMILES notation and PaDEL software
4. Build regression models based on the features calculated and bioactivity data imported from ChEMBL database


*Note: SMILES (Simplified Molecular Input Entry System) notation is a way of representing the chemical structure of a molecule that allows to be interpreted by computers. [Read more...](https://archive.epa.gov/med/med_archive_03/web/html/smiles.html#:~:text=SMILES%20(Simplified%20Molecular%20Input%20Line,learn%20a%20handful%20of%20rules.)*

#Importing Bioactivity Data from ChEMBL

We'll install the ChEMBL web service package to search for chemical compounds active against beta-lactamase and import chemical structure and bioactivity data.

In [4]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
import os 
os.chdir("/content/drive/MyDrive/Online_Lecture_Notes/Bioinformatic_Project_from_Scratch/Portfolio Project")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
! pip install chembl_webresource_client

from chembl_webresource_client.new_client import new_client

###Target protein search for ß-lactamase

Following code will query a target search for beta-lactamase produced by all organisms.  

new_client.target.search("target protein") is used for target searching

In [None]:
target = new_client.target
target_query = target.search('beta lactamase')
targets = pd.DataFrame.from_dict(target_query)
targets.head(5)

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Pseudomonas aeruginosa,Beta Lactamase,20.0,False,CHEMBL1293246,"[{'accession': 'Q932Y6', 'component_descriptio...",SINGLE PROTEIN,287.0
1,"[{'xref_id': 'P52700', 'xref_name': None, 'xre...",Stenotrophomonas maltophilia,Beta-lactamase L1,18.0,False,CHEMBL3326,"[{'accession': 'P52700', 'component_descriptio...",SINGLE PROTEIN,40324.0
2,"[{'xref_id': 'P26918', 'xref_name': None, 'xre...",Aeromonas hydrophila,Beta-lactamase,18.0,False,CHEMBL1169593,"[{'accession': 'P26918', 'component_descriptio...",SINGLE PROTEIN,644.0
3,[],Chryseobacterium indologenes,Metallo-beta-lactamase IND-6,18.0,False,CHEMBL1667689,"[{'accession': 'Q08I33', 'component_descriptio...",SINGLE PROTEIN,253.0
4,[],Chryseobacterium indologenes,IND-like metallo-beta-lactamase,18.0,False,CHEMBL1667690,"[{'accession': 'Q6JE29', 'component_descriptio...",SINGLE PROTEIN,253.0


Let's just look at the targets produced by *Pseudomonas aeruginosa*

In [None]:
targets.loc[targets["organism"]=="Pseudomonas aeruginosa"]

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Pseudomonas aeruginosa,Beta Lactamase,20.0,False,CHEMBL1293246,"[{'accession': 'Q932Y6', 'component_descriptio...",SINGLE PROTEIN,287.0
14,"[{'xref_id': 'Q8KRJ3', 'xref_name': None, 'xre...",Pseudomonas aeruginosa,Beta-lactamase VIM-4,17.0,False,CHEMBL6146,"[{'accession': 'Q8KRJ3', 'component_descriptio...",SINGLE PROTEIN,287.0
15,"[{'xref_id': 'Q9XAY4', 'xref_name': None, 'xre...",Pseudomonas aeruginosa,Beta-lactamase VIM-1,17.0,False,CHEMBL1287601,"[{'accession': 'Q9XAY4', 'component_descriptio...",SINGLE PROTEIN,287.0
19,[],Pseudomonas aeruginosa,Metallo beta-lactamase,17.0,False,CHEMBL3832957,"[{'accession': 'Q6TUJ4', 'component_descriptio...",SINGLE PROTEIN,287.0
23,"[{'xref_id': 'P14489', 'xref_name': None, 'xre...",Pseudomonas aeruginosa,Beta-lactamase OXA-10,16.0,False,CHEMBL5482,"[{'accession': 'P14489', 'component_descriptio...",SINGLE PROTEIN,287.0
25,"[{'xref_id': 'Q9K2N0', 'xref_name': None, 'xre...",Pseudomonas aeruginosa,Beta-lactamase VIM-2,16.0,False,CHEMBL5798,"[{'accession': 'Q9K2N0', 'component_descriptio...",SINGLE PROTEIN,287.0
27,[],Pseudomonas aeruginosa,Metallo-b-lactamase,16.0,False,CHEMBL1287605,"[{'accession': 'Q840P9', 'component_descriptio...",SINGLE PROTEIN,287.0
30,[],Pseudomonas aeruginosa,Beta-lactamase PSE-4,16.0,False,CHEMBL1744489,"[{'accession': 'P16897', 'component_descriptio...",SINGLE PROTEIN,287.0
32,[],Pseudomonas aeruginosa,Beta-lactamase IMP-1,16.0,False,CHEMBL3562178,"[{'accession': 'Q8G9Q0', 'component_descriptio...",SINGLE PROTEIN,287.0
34,[],Pseudomonas aeruginosa,Beta-lactamase,16.0,False,CHEMBL3885668,"[{'accession': 'D2SSQ3', 'component_descriptio...",SINGLE PROTEIN,287.0


It seems that there are lots of redundant entries.  We'll use the very first entry for as our target protein since it has the highest score

###Retrieving bioactivity data for selected target

In [None]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL1293246'

In [None]:
#Filtering chemical compounds with IC50 as bioactivity measurement

res = new_client.activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")
bioactivity_data = pd.DataFrame.from_dict(res)
bioactivity_data.head(3)

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,inactive,7461627,[],CHEMBL1794438,PUBCHEM_BIOASSAY: Counterscreen for inhibitors...,B,,,BAO_0000190,BAO_0000219,...,Pseudomonas aeruginosa,Beta Lactamase,287,,,IC50,uM,UO_0000065,,12.055
1,active,7461628,[],CHEMBL1794438,PUBCHEM_BIOASSAY: Counterscreen for inhibitors...,B,,,BAO_0000190,BAO_0000219,...,Pseudomonas aeruginosa,Beta Lactamase,287,,,IC50,uM,UO_0000065,,4.712
2,active,7461629,[],CHEMBL1794438,PUBCHEM_BIOASSAY: Counterscreen for inhibitors...,B,,,BAO_0000190,BAO_0000219,...,Pseudomonas aeruginosa,Beta Lactamase,287,,,IC50,uM,UO_0000065,,7.828


#Data Pre-processing

###Handle missing data

In [None]:
#Removing compounds with missing IC50 value 
bioactivity_data.dropna(subset=["standard_value","canonical_smiles"],inplace=True)
df2 = bioactivity_data

We have lots of data in the dataframe but we only need the chembl_id, canonical_smiles and standard_value (IC50 value) 

In [None]:
df3 = df2[['molecule_chembl_id', 'canonical_smiles', 'standard_value']]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL1555532,N#Cc1c([N+](=O)[O-])cc([N+](=O)[O-])cc1S(=O)(=...,12055.0
1,CHEMBL61559,COc1ccc(C(=O)/C=C/c2ccc(N(C)C)cc2)c(OC)c1,4712.0
2,CHEMBL1494120,COc1ccc(NC(=O)/C(Cl)=C(/Cl)[S+]([O-])Cc2ccc(Cl...,7828.0
3,CHEMBL1698008,Cc1ccc(S(=O)(=O)c2ccc([N+](=O)[O-])o2)cc1,11190.0
4,CHEMBL1964993,O=[N+]([O-])c1ccc(N/N=C/C=C/c2ccco2)cc1,4186.0
...,...,...,...
811,CHEMBL1345758,COc1ccccc1-c1nnc(S)n1Cc1ccco1,20448.0
812,CHEMBL1464372,COc1ccc(-c2nnc(S)n2-c2ccc3c(c2)OCCO3)cc1OC,18033.0
813,CHEMBL1565378,CCCC(=O)Nc1nnc(SCC(=O)NCc2cccs2)s1,16334.0
814,CHEMBL1551022,Cc1cc(C)n(-c2nc(SCC(=O)O)c3c4c(sc3n2)COC(C)(C)...,24828.0


###Calculate pIC50 values

We'll convert our IC50 values to pIC50 values because pIC50 creates a more normal distribution and is more commonly used in drug discovery studies due to the logarithmic nature of dose-dependent inhibition.  More info can be found [here.](https://www.collaborativedrug.com/why-using-pic50-instead-of-ic50-will-change-your-life/)

First we'll fix IC50 values higher than 100,000,000 to 100,000,000 in order to avoid pIC50 values becoming negative after conversion.  

Then we'll write a function for converting IC50 to pIC50 in any ChEMBL imported dataset

In [None]:
df3.standard_value.describe()
#have to convert IC50 to float format for further processing

count         807
unique        638
top       59640.0
freq           42
Name: standard_value, dtype: object

In [None]:
#converting to float
df3["standard_value"] =df3["standard_value"].astype(float);

#fix IC50 at 100000000
for i in df3.standard_value:
  if i > 100000000:
    df3["standard_value"] = df3["standard_value"].replace([i],100000000)

#checking
df3.standard_value.describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


count      807.000000
mean     22736.480025
std      20102.893398
min         33.940000
25%       5443.500000
50%      13642.000000
75%      49730.000000
max      62430.000000
Name: standard_value, dtype: float64

In [None]:
#pIC50 calculator function 

def pIC50_calculator(input):
  import numpy as np

  for i in input.standard_value:
    pIC50 = -np.log10(i*(10**-9))
    df3["standard_value"] = df3["standard_value"].replace([i],pIC50)

  input.rename({"standard_value":"pIC50"},axis=1,inplace=True)
  return(input.head(3))

#convert our dataset
pIC50_calculator(df3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,molecule_chembl_id,canonical_smiles,pIC50
0,CHEMBL1555532,N#Cc1c([N+](=O)[O-])cc([N+](=O)[O-])cc1S(=O)(=...,4.918833
1,CHEMBL61559,COc1ccc(C(=O)/C=C/c2ccc(N(C)C)cc2)c(OC)c1,5.326795
2,CHEMBL1494120,COc1ccc(NC(=O)/C(Cl)=C(/Cl)[S+]([O-])Cc2ccc(Cl...,5.106349


In [None]:
df3.to_csv("pic50_nodescriptor3.csv",index=False)

#Calculating Molecular Descriptors 

Molecular descriptors are calculated properties of molecules from chemical notations.  They're used to provide information about chemical structures in model building.  In our case we'll be using these molecular descriptor values within the matrix of features during model building. Learn more about molecular descriptors from [here](https://en.wikipedia.org/wiki/Molecular_descriptor)

We'll be calculating 4 molecular descriptors using [RDKit library](https://www.rdkit.org/docs/index.html)

* Molecular weight
* Octanol-water partition coefficient (LogP)
* Hydrogen bond donors
* Hydrogen bond acceptors



###Install RDKit usind Conda

In [None]:
! wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
! chmod +x Miniconda3-py37_4.8.2-Linux-x86_64.sh
! bash ./Miniconda3-py37_4.8.2-Linux-x86_64.sh -b -f -p /usr/local
! conda install -c rdkit rdkit -y
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

###Calculate Molecular Descriptors

The function below takes SMILES notation as input and follows these two steps for calculating descriptor values: 
1. Construct molecules from SMILES notations in our data with *Chem.MolFromSmiles* 
2. Calculate descriptors from each constructed molecule and add them into a dataframe


In [None]:
from rdkit import Chem 
from rdkit.Chem import Descriptors 
from rdkit.Chem import Lipinski
from rdkit.Chem import AllChem


df4=pd.read_csv("pic50_nodescriptor3.csv")



In [None]:
# Reference: https://codeocean.com/explore/capsules?query=tag:data-curation

def moldesc_calculator(smiles,verbose = False):
  import numpy as np

  #construct molecules from SMILES and add to a list 
  mols = []
  
  for x in smiles: 
    molecule = Chem.MolFromSmiles(x)
    mols.append(molecule)

  #calculate descriptors from molecules in the list 
  #will add each descriptor to an array and stack each array on top of each other 
  

  array = np.arange(1,1) #create empty array to add descriptors 
  y = 0 
  for i in mols: 
     MolWt = Descriptors.MolWt(i)
     MolLogP = Descriptors.MolLogP(i)
     NumHDonors = Lipinski.NumHDonors(i)
     NumHAcceptors = Lipinski.NumHAcceptors(i)

     row = np.array([MolWt,MolLogP,NumHDonors,NumHAcceptors])
     
     #stacking rows 
     if (y == 0):
       array = row 

     else:
       array = np.vstack((array,row))
     
     y = y+1   
  #converting array to dataframe 
  columnNames = ["Molecular Weight", "Partition Coef.(LogP)","Num_HDonors",
                 "Num_HAcceptors"]
  
  descriptors = pd.DataFrame(data = array,columns=columnNames)

  return descriptors

In [None]:
moldesc_df = moldesc_calculator(df4.canonical_smiles)
moldesc_df.head(8)

Unnamed: 0,Molecular Weight,Partition Coef.(LogP),Num_HDonors,Num_HAcceptors
0,347.308,2.34858,0.0,7.0
1,311.381,3.6659,0.0,4.0
2,419.717,4.2777,1.0,4.0
3,267.262,2.32902,0.0,5.0
4,257.249,3.2989,1.0,5.0
5,545.265,3.8394,6.0,2.0
6,418.25,4.95266,0.0,5.0
7,287.098,2.772,0.0,4.0


Add molecular descriptors to the main data frame 

In [None]:
df5 = pd.concat ([df4,moldesc_df],axis=1)
df5.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,pIC50,Molecular Weight,Partition Coef.(LogP),Num_HDonors,Num_HAcceptors
0,CHEMBL1555532,N#Cc1c([N+](=O)[O-])cc([N+](=O)[O-])cc1S(=O)(=...,4.918833,347.308,2.34858,0.0,7.0
1,CHEMBL61559,COc1ccc(C(=O)/C=C/c2ccc(N(C)C)cc2)c(OC)c1,5.326795,311.381,3.6659,0.0,4.0
2,CHEMBL1494120,COc1ccc(NC(=O)/C(Cl)=C(/Cl)[S+]([O-])Cc2ccc(Cl...,5.106349,419.717,4.2777,1.0,4.0
3,CHEMBL1698008,Cc1ccc(S(=O)(=O)c2ccc([N+](=O)[O-])o2)cc1,4.95117,267.262,2.32902,0.0,5.0
4,CHEMBL1964993,O=[N+]([O-])c1ccc(N/N=C/C=C/c2ccco2)cc1,5.378201,257.249,3.2989,1.0,5.0


In [None]:
df5.to_csv("bioactivity_w_descriptors.csv",index= False)

#Calculating PubChem Fingerprints 

PubChem Fingerprint system checks SMILES notations for the presence of certain molecular characteristics such as the presence of an atom pairing, a type of a ring system, a single atom, etc. Then generates a binary code for each of the characteristic for presence or absence. Learn more from [here](https://pubchemdocs.ncbi.nlm.nih.gov/data-specification).  We generated the PubChem Fingerprints for the molecules in our database so that our model can work with more information and generate more accurate results.



We'll be using [PaDEL-Descriptor](https://onlinelibrary.wiley.com/doi/full/10.1002/jcc.21707) software for calculating PubChem fingerprints. 

Software requires SMILES to be inputted as .smi file in the following format.

In [None]:
import numpy as np

#Install padelpy, Python wrapper of PaDEL
!pip install padelpy

#Format SMILES notation 
df5 = pd.read_csv("bioactivity_w_descriptors.csv")
df7 = df5[["canonical_smiles","molecule_chembl_id"]]
df7.to_csv("molecule.smi",sep="\t",index=False,header=False)

Collecting padelpy
  Downloading padelpy-0.1.11-py2.py3-none-any.whl (20.9 MB)
[K     |████████████████████████████████| 20.9 MB 1.3 MB/s 
[?25hInstalling collected packages: padelpy
Successfully installed padelpy-0.1.11


We'll use *padeldescriptor* package from *padelpy*. This package requires type of descriptor/fingerprint to be specified in *descriptortypes=* variable.  

https://github.com/dataprofessor has created an .xml containing all types of descriptors/fingerprints, which we'll select PubChem Fingerprints from. 

In [None]:
! wget https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip
! unzip fingerprints_xml.zip

PubchemFingerprinter.xml is the file containing information required to calculate PubChem Fingerprints

In [None]:
from padelpy import padeldescriptor

padeldescriptor(mol_dir="molecule.smi",d_file="PubChem_fingerprints.csv",
                descriptortypes="PubchemFingerprinter.xml", detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

fingerprints_df = pd.read_csv("PubChem_fingerprints.csv")
fingerprints_df.head()

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL1555532,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL61559,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL1698008,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL1494120,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL1964993,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Lastly, combining PubChem Fingerprints data with molecular descriptors 

In [None]:
fingerprints_df = fingerprints_df.loc[:,["Pub"in i for i in fingerprints_df.columns]]
df8 = pd.concat ([df5,fingerprints_df],axis=1)
df8.to_csv("bioac_descriptors_fingerprints.csv",index=False)
df8.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,pIC50,Molecular Weight,Partition Coef.(LogP),Num_HDonors,Num_HAcceptors,PubchemFP0,PubchemFP1,PubchemFP2,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL1555532,N#Cc1c([N+](=O)[O-])cc([N+](=O)[O-])cc1S(=O)(=...,4.918833,347.308,2.34858,0.0,7.0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL61559,COc1ccc(C(=O)/C=C/c2ccc(N(C)C)cc2)c(OC)c1,5.326795,311.381,3.6659,0.0,4.0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL1494120,COc1ccc(NC(=O)/C(Cl)=C(/Cl)[S+]([O-])Cc2ccc(Cl...,5.106349,419.717,4.2777,1.0,4.0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL1698008,Cc1ccc(S(=O)(=O)c2ccc([N+](=O)[O-])o2)cc1,4.95117,267.262,2.32902,0.0,5.0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL1964993,O=[N+]([O-])c1ccc(N/N=C/C=C/c2ccco2)cc1,5.378201,257.249,3.2989,1.0,5.0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df8.isnull().values.any()

False

#Model Building 

We'll build 4 regression models and compare R-squared scores to evaluate fitness of models.

In [None]:
df9= pd.read_csv("bioac_descriptors_fingerprints.csv")
df9

In [56]:
x = df9.iloc[:,3:].values
y = df9.iloc[:,2].values

#splitting training and test sets 
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1)


Regression Models:
*   Random Forest Regressor
*   SVR
* Decision Tree Regressor 
* Linear Regressor



In [57]:
from sklearn.ensemble import RandomForestRegressor
regressor_rfr = RandomForestRegressor(n_estimators = 100, random_state = 0)
regressor_rfr.fit(x_train, y_train)

y_pred_rfr = regressor_rfr.predict(x_test)
from sklearn.metrics import r2_score
rfr_r2 = r2_score(y_test, y_pred_rfr)


from sklearn.svm import SVR
regressor_svr = SVR(kernel = 'rbf')
regressor_svr.fit(x_train, y_train)

y_pred_svr = regressor_svr.predict(x_test)
svr_r2 = r2_score (y_test,y_pred_svr)

from sklearn.tree import DecisionTreeRegressor
regressor_dt = DecisionTreeRegressor(random_state = 0)
regressor_dt.fit(x_train, y_train)

y_pred_dt = regressor_dt.predict(x_test)
dt_r2 = r2_score(y_test,y_pred_dt)

from sklearn.linear_model import LinearRegression
regressor_lr = LinearRegression()
regressor_lr.fit(x_train, y_train)

y_pred_lr = regressor_lr.predict(x_test)
lr_r2 = r2_score(y_test,y_pred_lr) 

print( "Random Forest Regressor r2 Score:", rfr_r2, "\n",
      "SVR r2 Score:", svr_r2, "\n",
      "Decision Tree Regressor r2 Score:", dt_r2,"\n",
      "Linear Regression r2 Score:", lr_r2)


Random Forest Regressor r2 Score: 0.014524510530207424 
 SVR r2 Score: -0.2905682123536324 
 Decision Tree Regressor r2 Score: -0.39422263342170116 
 Linear Regression r2 Score: -61.00704093152416


#Discussion & Conclusions 

In this project we aimed to build QSAR models for predicting the pIC50 values of chemical compounds against β-lactamase molecule produced in Pseudomonas aeruginosa bacteria.  
Firstly, we imported bioactivity data of compounds that showed activity towards β-lactamase from the ChEMBL database.  Then we calculated 4 different molecular descriptor value and PubChem fingerprints from the SMILES notation of compounds.  Lastly we built 4 regression models to predict pIC50 values by using the calculated molecular descriptors and PubChem fingerprints. 


The highest r2 score was 0.014 which indicates that none of our models are close to being good.  One can conclude that regression modeling isn't a suitable method for predicting the pIC50 value of compounds towards β-lactamase molecule produced in *Pseudomonas aeruginosa* 