## Importing Libraries

In [None]:
# Import basic data science packages
import numpy as np

# Import tensorflow packages
import tensorflow as tf
from tensorflow.test import gpu_device_name

# Import various keras tools
from tensorflow.keras.layers import Input, Dense, Conv2D, MaxPool2D, Flatten, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import SGD, Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

#### Run this cell when working locally. When using Colab, the import of this library is handled in the next section.

In [None]:
# pcamlib.py is my library of helper functions
import pcamlib

---

## Google Colab Setup. When running locally, skip these cells

These cells adapted from code located [here](https://towardsdatascience.com/google-drive-google-colab-github-dont-just-read-do-it-5554d5824228)

In [None]:
device_name = gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

In [None]:
# Mount Google Drive
from google.colab import drive # import drive from google colab

ROOT = "/content/drive"     # default location for the drive
print(ROOT)                 # print content of ROOT (Optional)

drive.mount(ROOT)           # we mount the google drive at /content/drive

In [None]:
# Change to capstone folder and print directory to confirm
%cd /content/drive/MyDrive/BrainStation\ Capstone\ Project/capstone
%pwd

In [None]:
#import pcamlib to Google Colab
import imp 
pcamlib = imp.new_module('pcamlib')
exec(open("./pcamlib.py").read(), pcamlib.__dict__)

In [None]:
# Attempting to load pcam directly from a folder in Google Drive..
# Works locally, doesn't currently work on Colab. May be a version issue
directory = '/content/drive/My Drive/BrainStation Capstone Project/tensorflow_datasets'
pcam, pcam_info = tfds.load("patch_camelyon", data_dir=directory, with_info=True, download=False)

---

## Data Set-up

To get started with this dataset, I adapted the code from this [article](https://geertlitjens.nl/post/getting-started-with-camelyon/) written by Geert Litjens, one of the authors of the dataset.

I used his code for the `train_pipeline`, `valid_pipeline`, and `test_pipeline`, which load the train, validation, and test sets and prepare them for modelling. I also make use of his function `convert_sample`. This function extracts each image and its corresponding label from the dataset, converts each image to a TensorFlow `tf.float32` datatype, then performs one-hot encoding on the labels and converts them to `tf.float32` as well.

In [None]:
# Load dataset and dataset info 
pcam, pcam_info = pcamlib.load_pcam()

In [None]:
# Create generator "pipelines" for train, validation and test sets.
# Default batch sizes of 64 for the train set and 128 for validation and test sets to speed up calculations
train_pipeline, valid_pipeline, test_pipeline = pcamlib.build_pipelines(pcam)

---

## Modelling

### If you are not training this model and are loading it from a file, skip ahead to Loading the Model

I also used Geert Litjens CNN layer architecture as a starting point. It resembles a VGG16 architecture because it has three sets of two Convolutional layers followed by a single Max Pooling layer, followed by a Flattening layer and two Dense layers before the final Dense layer which outputs the class predictions. I kept the layer parameters the same as his example.

I changed the optimizer to `Adam` from `SGD` simply because he provided multiple hyperparameters to go along with it, and I wanted to experiment with that on my own. I also added additional Dropout layers after each convolutional layer, because the first iteration of the model started overfitting quickly after the first epoch and the validation accuracy didn't improve beyond 80%.

In [None]:
# Instantiate model object
cnn = Sequential()

# Images are 96x96 px, in RGB so there are 3 channels
image_shape = (96, 96, 3)

# Adding convultional layers to the model 
# It was important to add dropout layers after each convolutional layer to reduce overfitting
cnn.add(Conv2D(16, kernel_size=(3, 3), activation='relu', padding='valid', input_shape=image_shape))
cnn.add(Dropout(0.25))
cnn.add(Conv2D(16, kernel_size=(3, 3), activation='relu', padding='valid'))
cnn.add(Dropout(0.25))

# Add a max pool layer to reduce the dimensions of the feature maps
cnn.add(MaxPool2D(pool_size=(2, 2), strides=(2,2)))

# Repeating this architecture two more times
cnn.add(Conv2D(32, kernel_size=(3, 3), activation='relu', padding='valid'))
cnn.add(Dropout(0.25))
cnn.add(Conv2D(32, kernel_size=(3, 3), activation='relu', padding='valid'))
cnn.add(Dropout(0.25))
cnn.add(MaxPool2D(pool_size=(2, 2), strides=(2,2)))
     
cnn.add(Conv2D(64, kernel_size=(3, 3), activation='relu', padding='valid'))
cnn.add(Dropout(0.25))
cnn.add(Conv2D(64, kernel_size=(3, 3), activation='relu', padding='valid'))
cnn.add(Dropout(0.25))
cnn.add(MaxPool2D(pool_size=(2, 2), strides=(2,2)))

# Flatten the data to prepare for dense layers
cnn.add(Flatten())
        
cnn.add(Dense(256, activation='relu'))
cnn.add(Dropout(0.25))

cnn.add(Dense(128, activation='relu'))
cnn.add(Dropout(0.25))

# This extra layer is cnn1.2
cnn.add(Dense(64, activation='relu'))
cnn.add(Dropout(0.25))

# Final Dense layer to make class predictions
cnn.add(Dense(2, activation='softmax'))
        
cnn.summary()

In [None]:
# For comparison, this commented line is the original optimizer used in the article:
# sgd_opt = SGD(lr=0.01, momentum=0.9, decay=0.0, nesterov=True)
cnn.compile(optimizer='Adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Add early stop callback to prevent the model from overfitting, or running too long
early_stop = EarlyStopping(monitor='val_accuracy', min_delta=0.01, patience=3, verbose=1)

In [None]:
%%time
history = cnn.fit(train_pipeline,
                   validation_data=valid_pipeline,
                   verbose=1, epochs=30, steps_per_epoch=4096, validation_steps=256,
                   callbacks=[early_stop])

# Save the history of the model to a pandas dataframe
hist_df = pd.DataFrame(history.history)

---

## Saving Model and History

In [None]:
# Save the fitted model to a file
cnn.save('data/models/cnn1')

In [None]:
# Save the history of the model to a csv
pcamlib.save_history(hist_df, 'data/models/history/cnn1.2_history.csv')

---

## Loading the Model

In [None]:
# Load the model from a file
cnn = tf.keras.models.load_model("data/models/cnn1.1")

In [None]:
# Load the model training history from a file
hist_df = pcamlib.load_history('data/models/history/cnn1.1_history.csv')

In [None]:
# Load y_proba from file if the model is not saved. For some larger models, I only save y_proba because the files are too large to track using git
y_proba = pcamlib.load_y_proba('data/y_proba/cnn1.1_y_proba.csv')

---

## Analysis

In [None]:
# Plot the training and validation Accuracy and Loss
pcamlib.plot_history(hist_df)

In [None]:
%%time

# Generate y_proba
y_proba = pcamlib.generate_y_proba(cnn, test_pipeline, class_1=False, save=True, filepath='data/y_proba/cnn1.1_y_proba.csv')

In [None]:
%%time

# Get predictions from y_proba. Default threshold of 0.5, meaning predicts positive class if >= 50% certainty of class 1 
y_pred = pcamlib.generate_y_pred(y_proba)

In [None]:
%%time

# Create a list of the true labels for the test set
y_true = pcamlib.generate_y_true(pcam)

In [None]:
%%time

# Calculate accuracy of the predictions on the test set
pcamlib.print_test_accuracy(y_true, y_pred)

In [None]:
# Plot the confusion matrix
pcamlib.plot_cf_matrix(y_true, y_pred, normalize=True)

In [None]:
# Print the classification report to see precision, recall, and f1 score
pcamlib.print_classification_report(y_true, y_pred)

In [None]:
# Plot the receiver operating characteristic curve
pcamlib.plot_roc_curve(y_true, y_proba)

In [None]:
# Show a sample of images that were misclassified
pcamlib.plot_misclassified_images(pcam, y_true, y_pred)

## Summary

| Model   | Description                                                                                                        | Test Accuracy | Training Time | Prediction Time |
|---------|--------------------------------------------------------------------------------------------------------------------|---------------|---------------|-----------------|
| CNN 1.0 | Base model adapted from Geert Liljens topology.  Added 20% Dropout layers after each Convolutional and Dense layer | 84.4%         |               |                 |
| CNN 1.1 | Increased dropout to 25%                                                                                           | 85.4%         |               |                 |