<a href="https://colab.research.google.com/github/kthuang20/BetaLactamaseCNN/blob/main/Beta_Lactamase_CNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Click the link above to view this code notebook in Google Colab.

In [None]:
# download necessary packages
!pip install rdkit



In [None]:
# import data manipulation tools
import zipfile
import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Draw

# import visualization tool
import plotly.express as px

# import modeling tools
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten

# import metrics to evaluate model
from tensorflow.keras.metrics import Precision, Recall, BinaryAccuracy

# ML and AI Final Project

Antibiotics are compounds that work by directly killing or inhibiting the growth of the bacteria. For instance, penicillin works by inhibiting an enzyme involved in cell wall synthesis. This weakens the overall integrity of the bacterial cell wall, making the bacteria more susceptible to osmotic pressure changes and resulting in cell lysis **[CITE]**. While antibiotics have been effective for bacterial infections, some bacteria have been shown to contain β-lactamase, another enzyme that can break down and therefore inactivate these antibiotics, rendering them ineffective for bacterial infections **[CITE]**. This allows the bacteria to continue to proliferate in the presence of antibiotics, leading to antibiotic resistance. Therefore, it is thought that inhibiting β-lactamase might be a viable option for preventing antibiotic resistance.

The hope is to develop of an approach to help accelerate the discovery of β-lactamase inhibitors that effectively combat antibiotic resistance. **[Talk about QSAR and CNN]**
Here, a convolutional neural network trained on the chemical structure of compounds known to bind to β-lactamase to predict whether a future compound would be a strong candidate for inhibiting β-lactamase.

## 1. Generate Training Dataset

A total of 136 csv files belonging to 136 different variants of the β-lactamase protein were recorded from the ChEMBL database (version 29).

In [None]:
# download the file
! gdown --id 1HvDDqoBJdNnFg3i14raMes1oedgC_BFs

Downloading...
From: https://drive.google.com/uc?id=1HvDDqoBJdNnFg3i14raMes1oedgC_BFs
To: /content/beta_lactamase_CHEMBL29.zip
100% 1.42M/1.42M [00:00<00:00, 15.1MB/s]


In [None]:
# name of the zip file containing all 136 csv files
file_path = "beta_lactamase_CHEMBL29.zip"
# read in all 136 variants of β-lactamase
zf = zipfile.ZipFile(file_path, "r")
# combine all the compounds that are known to interact with each variant into one dataframe
beta_lactamase_data = pd.concat((pd.read_csv(zf.open(f)) for f in zf.namelist()))
beta_lactamase_data.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_relation,standard_value,standard_units,standard_type,pchembl_value,target_pref_name,bao_label
0,CHEMBL1730,CO/N=C(\C(=O)N[C@@H]1C(=O)N2C(C(=O)O)=C(COC(C)...,=,10.0,/mM/s,Kcat/Km,,Gil1,assay format
1,CHEMBL996,CO[C@@]1(NC(=O)Cc2cccs2)C(=O)N2C(C(=O)O)=C(COC...,,,,Kcat/Km,,Gil1,assay format
2,CHEMBL617,CC(=O)OCC1=C(C(=O)O)N2C(=O)[C@@H](NC(=O)Cc3ccc...,=,598.0,/mM/s,Kcat/Km,,Gil1,assay format
3,CHEMBL702,CCN1CCN(C(=O)N[C@@H](C(=O)N[C@@H]2C(=O)N3[C@@H...,=,3400.0,/mM/s,Kcat/Km,,Gil1,assay format
4,CHEMBL1449,CC1(C)S[C@@H]2[C@H](NC(=O)[C@H](C(=O)O)c3ccsc3...,=,10000.0,/mM/s,Kcat/Km,,Gil1,assay format


In [None]:
# create a dataframe containing only compounds we are certain the bioactivity of
train_data = beta_lactamase_data[beta_lactamase_data['standard_relation'] == '=']
# remove samples without any pchembl values
train_data = train_data[train_data['pchembl_value'].notna()]

# create a boolean series stating where the standard deviation of pchembl values for each compound is less than 2
low_pchembl_std = train_data.groupby('molecule_chembl_id')['pchembl_value'].std() < 2
# store a list containing the compounds that had small standard deviations
cps = low_pchembl_std[low_pchembl_std].index.tolist()
# filter out compounds with a high standard deviation
cols = ['standard_relation', 'standard_type', 'target_pref_name', 'bao_label']
train_data = train_data.loc[train_data['molecule_chembl_id'].isin(cps)].drop(columns=cols, axis=1)

# define aggregation function to remove duplicates by taking the mean IC50 value
remove_dup = {'molecule_chembl_id': 'first',
                'canonical_smiles': 'first',
                'standard_value': 'mean',
                'standard_units': 'first',
                'pchembl_value': 'mean'}

# remove duplicates
train_data = train_data.groupby('molecule_chembl_id').agg(remove_dup).reset_index(drop=True)
train_data

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,standard_units,pchembl_value
0,CHEMBL104,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,27500.000000,nM,4.580000
1,CHEMBL1089781,O=S(=O)(NCB(O)O)c1cc2c(Cl)ccc(Cl)c2s1,1997.500000,nM,5.905000
2,CHEMBL1091,CC(=O)OCC(=O)[C@@]1(O)CC[C@H]2[C@@H]3CCC4=CC(=...,84217.950000,nM,4.100000
3,CHEMBL109227,OB(O)c1ccc(-c2ccc(B(O)O)cc2)cc1,200.000000,nM,6.700000
4,CHEMBL1126,CC1(C)S[C@@H]2[C@H](NC(=O)Cc3ccccc3)C(=O)N2[C@...,5400.000000,nM,5.290000
...,...,...,...,...,...
791,CHEMBL87686,O=C(O)[C@H](S)Cc1ccc2oc3ccccc3c2c1,4961.505000,nM,6.320000
792,CHEMBL87719,CC1(C)[C@H](C(=O)O)N2C(=O)[C@]3(C[C@@H]3OC3CCC...,270.000000,nM,6.905000
793,CHEMBL891,Cc1onc(-c2ccccc2Cl)c1C(=O)N[C@@H]1C(=O)N2[C@@H...,4343.333333,nM,6.956667
794,CHEMBL9306,O=C([O-])[C@H]1/C(=C/CO)O[C@@H]2CC(=O)N21.[Li+],234.000000,nM,6.785000


In [None]:
# save a csv file for future use
train_data.to_csv('processed_data.csv')

In [None]:
# show the summary statistics of the pchembl values
sum_stats = train_data['pchembl_value'].describe()
print('Summary Statistics and Quartiles of the pChEMBL Values:')
sum_stats

Summary Statistics and Quartiles of the pChEMBL Values:


count    796.000000
mean       5.757514
std        1.081195
min        2.946667
25%        4.949167
50%        5.480000
75%        6.530250
max        8.800000
Name: pchembl_value, dtype: float64

In [None]:
# create a histogram to show the distribution of pChEMBL values
fig = px.histogram(train_data, x='pchembl_value')

# add title, axis labels
fig.update_layout(title = 'Figure 1. Distribution of pChEMBL Values of Compounds',
                  title_x = 0.5,
                  xaxis_title = 'pChEMBL Value',
                  yaxis_title = 'Number of Compounds',
                  bargap = 0.2)

# show the histogram
fig.show()

Based on the summary statistics, I will use the following to create 2 classes:
* 0-50% quartile: *inactive*
* 50-100% quartile: *active*
  

Based on the summary statistics, I will use the following to create 3 classes:
* 0-25% quartile: *inactive*
* 25-75% quartile: *intermediate*
* 75-100% quartile: *active*

In [None]:
### function to classify bioactivity of compound
def classify_bioactivity(bioactivity, threshold):
    ## if the compound has a bioactivity above this threshold,
    if bioactivity > threshold:
        # label it as an active compound
        return 1
    ## otherwise
    else:
        # it is an inactive compound
        return 0

In [None]:
# define the threshold for classifying a compound as active/inactive as the median
threshold = sum_stats.loc['50%']
# add a column containing the labelled output as to whether or not active
train_data['active'] = train_data['pchembl_value'].apply(classify_bioactivity, threshold=threshold)
train_data

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,standard_units,pchembl_value,active
0,CHEMBL104,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,27500.000000,nM,4.580000,0
1,CHEMBL1089781,O=S(=O)(NCB(O)O)c1cc2c(Cl)ccc(Cl)c2s1,1997.500000,nM,5.905000,1
2,CHEMBL1091,CC(=O)OCC(=O)[C@@]1(O)CC[C@H]2[C@@H]3CCC4=CC(=...,84217.950000,nM,4.100000,0
3,CHEMBL109227,OB(O)c1ccc(-c2ccc(B(O)O)cc2)cc1,200.000000,nM,6.700000,1
4,CHEMBL1126,CC1(C)S[C@@H]2[C@H](NC(=O)Cc3ccccc3)C(=O)N2[C@...,5400.000000,nM,5.290000,0
...,...,...,...,...,...,...
791,CHEMBL87686,O=C(O)[C@H](S)Cc1ccc2oc3ccccc3c2c1,4961.505000,nM,6.320000,1
792,CHEMBL87719,CC1(C)[C@H](C(=O)O)N2C(=O)[C@]3(C[C@@H]3OC3CCC...,270.000000,nM,6.905000,1
793,CHEMBL891,Cc1onc(-c2ccccc2Cl)c1C(=O)N[C@@H]1C(=O)N2[C@@H...,4343.333333,nM,6.956667,1
794,CHEMBL9306,O=C([O-])[C@H]1/C(=C/CO)O[C@@H]2CC(=O)N21.[Li+],234.000000,nM,6.785000,1


## 2. Preprocess the Data

In [None]:
### function to generate a 2D image of the compound
def gen_image(smiles):
    ## get the molecule for this smile
    mol = Chem.MolFromSmiles(smiles)
    ## convert this molecule into an image with a standardized size
    img = Draw.MolToImage(mol, size=(256,256))
    ## convert the image into a numpy array of pixels
    img_px = np.array(img)
    return img_px

In [None]:
# return a list of the images of the compounds
mols = train_data['canonical_smiles'].apply(gen_image)
# combine all the numpy array representations of the chemical compounds as a single tensor
stacked_imgs = tf.stack(mols.tolist())
# create a tensorflow dataset from the stacked tensor
dataset = tf.data.Dataset.from_tensor_slices((stacked_imgs, train_data['active']))

# scale images from 0-256 to 0-1
dataset = dataset.map(lambda x, y: (x/255, y))
# shuffle dataset
dataset = dataset.shuffle(buffer_size=len(mols))

In [None]:
### define a function to create the model
def gen_datasets(dataset, batch_size, train_split, val_split, test_split):
    # create batches based on batch size
    batched_dataset = dataset.batch(batch_size=batch_size)
    # store the total number of batches
    nbatches = len(batched_dataset)

    # define the sizes of each dataset
    train_size = int(nbatches * 0.7)
    val_size = int(nbatches * 0.2)
    test_size = int(nbatches * 0.1) + 1

    ## generate the datasets
    train = batched_dataset.take(train_size)
    val = batched_dataset.skip(train_size).take(val_size)
    test = batched_dataset.skip(train_size + val_size).take(test_size)

    return train, val, test

In [None]:
# split the data into through datasets: training, validation, and testing datasets
train, val, test = gen_datasets(dataset, 64, 0.7, 0.2, 0.1)
print('Number of batches in training dataset: ', str(len(train)))
print('Number of batches in validation dataset: ', str(len(val)))
print('Number of batches in testing dataset: ', str(len(test)))

Number of batches in training dataset:  9
Number of batches in validation dataset:  2
Number of batches in testing dataset:  2


## 3. Generate the CNN

In [None]:
### function to create the model
def gen_model():
  ## initilaize a sequential model
  model = Sequential()
  ## add convolutional layers
  model.add(Conv2D(16, (3,3), 1, activation='relu', input_shape=(256, 256, 3)))
  model.add(MaxPooling2D())

  model.add(Conv2D(32, (3,3), 1, activation='relu'))
  model.add(MaxPooling2D())

  model.add(Conv2D(16, (3,3), 1, activation='relu'))
  model.add(MaxPooling2D())

  ## add flatten layer
  model.add(Flatten())

  ## add dense layers
  model.add(Dense(256, activation='relu'))
  model.add(Dense(1, activation='sigmoid'))

  ## compile model
  model.compile('adam', loss=tf.losses.BinaryCrossentropy(), metrics=['accuracy'])

  ## show model summary (with architecture of model)
  print(model.summary())

  return model

In [None]:
# create the architecture of the CNN
model = gen_model()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_6 (Conv2D)           (None, 254, 254, 16)      448       
                                                                 
 max_pooling2d_6 (MaxPoolin  (None, 127, 127, 16)      0         
 g2D)                                                            
                                                                 
 conv2d_7 (Conv2D)           (None, 125, 125, 32)      4640      
                                                                 
 max_pooling2d_7 (MaxPoolin  (None, 62, 62, 32)        0         
 g2D)                                                            
                                                                 
 conv2d_8 (Conv2D)           (None, 60, 60, 16)        4624      
                                                                 
 max_pooling2d_8 (MaxPoolin  (None, 30, 30, 16)       

## 4. Train the CNN

In [None]:
# set up a log directory on local drive to store how model performed at each epoch
logdir = 'logs'
tensorboard_callbacks = tf.keras.callbacks.TensorBoard(log_dir=logdir)

# train the model
hist = model.fit(train, epochs=20, validation_data=val, callbacks=[tensorboard_callbacks])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
# show a dataframe of the results
hist_df = pd.DataFrame(hist.history)
# rename columns
hist_df.columns = ['Training Loss', 'Training Accuracy', 'Validation Loss', 'Validation Accuracy']
hist_df

Unnamed: 0,Training Loss,Training Accuracy,Validation Loss,Validation Accuracy
0,1.755424,0.508681,0.69856,0.523438
1,0.692754,0.541667,0.691342,0.5
2,0.700421,0.482639,0.685554,0.484375
3,0.683521,0.550347,0.678464,0.539062
4,0.662611,0.569444,0.628585,0.625
5,0.587651,0.701389,0.543502,0.710938
6,0.550284,0.701389,0.525964,0.742188
7,0.513817,0.755208,0.439075,0.796875
8,0.475915,0.777778,0.379697,0.859375
9,0.43424,0.809028,0.390554,0.835938


In [None]:
### function to compare metric between training and validation dataset
def compare_metric(metric_results, metric, fig_num):
  ## create a scatter plot comparing the training and validation loss over each iteration
  fig = px.line(metric_results,
                x = metric_results.index,
                y = ['Training '+ metric, 'Validation ' + metric],
                markers = True)

  ## add title, axis labels
  fig.update_layout(title = 'Figure ' + str(fig_num) + '. Training and Validation ' + metric,
                    title_x = 0.5,
                    xaxis_title = 'Epoch',
                    yaxis_title = metric,
                    legend_title_text = 'Dataset')

  ## show figure
  fig.show()

In [None]:
# compare loss between training and validation datasets
compare_metric(hist_df, 'Loss', 3)

In [None]:
# compare accuracies between training and validation dataset
compare_metric(hist_df, 'Accuracy', 4)

## 5. Evaluate Performance of CNN

In [None]:
# initialize the metrics
precision = Precision()
recall = Recall()
acc = BinaryAccuracy()

In [None]:
### iterate through each batch of testing dataset
for batch in test.as_numpy_iterator():
  ## get the labelled inputs and outputs of all examples
  X, yactual = batch
  ## store the model's predictions on the testing dataset
  ypred = model.predict(X)
  ## compute and store the metrics between training and testing dataset
  precision.update_state(yactual, ypred)
  recall.update_state(yactual, ypred)
  acc.update_state(yactual, ypred)

### show results
print(f'Precision: {precision.result()}')
print(f'Recall: {recall.result()}')
print(f'Accuracy: {acc.result()}')

Precision: 0.9775910377502441
Recall: 0.9748603105545044
Accuracy: 0.976902186870575


## 6. Save the Model

In [None]:
# import necessary package
import os

In [None]:
# save the model for future use
model.save(os.path.join('models', 'BetaLactmaseCNN.h5'))


You are saving your model as an HDF5 file via `model.save()`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')`.



References

1.   List item
2.   List item

