# Problem Description

This notebook is designed to produce a solution to the [Histopathologic Cancer Detection](https://www.kaggle.com/c/histopathologic-cancer-detection/overview) project on Kaggle. This project involves the identification of metastatic cancer in small patches of images, some of which contain cancer cells and some of which do not. This notebook uses a variety of different Convolutional Neural Networks (CNN) to develop a solution to this problem.

## Imports

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore') # we don't want nobody nobody sent
import keras
import tensorflow as tf

# Exploratory Data Analysis (EDA)

## Data Load

In [3]:
# labels
labels = pd.read_csv('/kaggle/input/histopathologic-cancer-detection/train_labels.csv')
labels.head()

Unnamed: 0,id,label
0,f38a6374c348f90b587e046aac6079959adf3835,0
1,c18f2d887b7ae4f6742ee445113fa1aef383ed77,1
2,755db6279dae599ebb4d39a9123cce439965282d,0
3,bc3f0c64fb968ff4a8bd33af6971ecae77c75e08,0
4,068aba587a4950175d04c680d38943fd488d6a9d,0


In [4]:
labels['filename'] = labels['id'] + '.tif'
labels['label'] = labels['label'].astype(str)
labels.head()

Unnamed: 0,id,label,filename
0,f38a6374c348f90b587e046aac6079959adf3835,0,f38a6374c348f90b587e046aac6079959adf3835.tif
1,c18f2d887b7ae4f6742ee445113fa1aef383ed77,1,c18f2d887b7ae4f6742ee445113fa1aef383ed77.tif
2,755db6279dae599ebb4d39a9123cce439965282d,0,755db6279dae599ebb4d39a9123cce439965282d.tif
3,bc3f0c64fb968ff4a8bd33af6971ecae77c75e08,0,bc3f0c64fb968ff4a8bd33af6971ecae77c75e08.tif
4,068aba587a4950175d04c680d38943fd488d6a9d,0,068aba587a4950175d04c680d38943fd488d6a9d.tif


In [5]:
from sklearn.model_selection import train_test_split

train_labels, val_labels = train_test_split(labels, test_size = 0.2, random_state = 42)

In [6]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

train_images = ImageDataGenerator(rescale = 1/255).flow_from_dataframe(dataframe=train_labels,
    directory='/kaggle/input/histopathologic-cancer-detection/train',
    x_col='filename',
    y_col='label',
    target_size=(96, 96),
    batch_size=96,
    class_mode='categorical')

val_images = ImageDataGenerator(rescale = 1/255).flow_from_dataframe(dataframe=val_labels,
    directory='/kaggle/input/histopathologic-cancer-detection/train',
    x_col='filename',
    y_col='label',
    target_size=(96, 96),
    batch_size=96,
    class_mode='categorical')

Found 176020 validated image filenames belonging to 2 classes.
Found 44005 validated image filenames belonging to 2 classes.


# Model Architecture

We will be using transfer learning for this project using a variety of different architectures for different pre-trained models. The first one will be VGG, which is one of the architectures mentioned in class, which will be compared to a few other architectures below.

In [7]:
from keras import layers
from keras import models

In [8]:
base_model = tf.keras.applications.VGG19(
    input_shape = (96,96,3),
    include_top = False,
    weights = 'imagenet'
)

base_model.trainable = False
base_model.summary()

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg19/vgg19_weights_tf_dim_ordering_tf_kernels_notop.h5
[1m80134624/80134624[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [8]:
model = models.Sequential([
    base_model,
    layers.Flatten(input_shape=(96, 96)),
    layers.Dense(units=96, activation='relu'),
    layers.Dense(units=32, activation='relu'),
    layers.Dense(2, activation = 'softmax')
])
model.summary()

In [11]:
epochs = 10
opt = tf.keras.optimizers.Adam(0.0001)
model.compile(loss='categorical_crossentropy', optimizer = opt, metrics=['accuracy'])
out = model.fit(train_images, epochs = epochs, validation_data = val_images)


Epoch 1/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1785s[0m 963ms/step - accuracy: 0.8202 - loss: 0.3972 - val_accuracy: 0.8550 - val_loss: 0.3345
Epoch 2/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m562s[0m 306ms/step - accuracy: 0.8541 - loss: 0.3340 - val_accuracy: 0.8631 - val_loss: 0.3175
Epoch 3/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m415s[0m 226ms/step - accuracy: 0.8646 - loss: 0.3135 - val_accuracy: 0.8630 - val_loss: 0.3195
Epoch 4/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m430s[0m 234ms/step - accuracy: 0.8729 - loss: 0.2989 - val_accuracy: 0.8686 - val_loss: 0.3042
Epoch 5/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m452s[0m 246ms/step - accuracy: 0.8770 - loss: 0.2892 - val_accuracy: 0.8735 - val_loss: 0.2949
Epoch 6/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m408s[0m 222ms/step - accuracy: 0.8830 - loss: 0.2774 - val_accuracy: 0.8742 - val_loss

# Results and Analysis

## Comparison of Different Architectures

In [None]:
base_model_x = tf.keras.applications.Xception(
    input_shape = (96,96,3),
    include_top = False,
    weights = 'imagenet'
)

base_model_x.trainable = False

base_model_x.summary()

In [12]:
model = models.Sequential([
    base_model_x,
    layers.Flatten(input_shape=(96, 96)),
    layers.Dense(units=96, activation='relu'),
    layers.Dense(units=32, activation='relu'),
    layers.Dense(2, activation = 'softmax')
])
model.summary()

epochs = 10
opt = tf.keras.optimizers.Adam(0.0001)
model.compile(loss='categorical_crossentropy', optimizer = opt, metrics=['accuracy'])
out = model.fit(train_images, epochs = epochs, validation_data = val_images)

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/xception/xception_weights_tf_dim_ordering_tf_kernels_notop.h5
[1m83683744/83683744[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


Epoch 1/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m621s[0m 331ms/step - accuracy: 0.8203 - loss: 0.4005 - val_accuracy: 0.8520 - val_loss: 0.3388
Epoch 2/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m453s[0m 246ms/step - accuracy: 0.8581 - loss: 0.3277 - val_accuracy: 0.8587 - val_loss: 0.3269
Epoch 3/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m438s[0m 238ms/step - accuracy: 0.8682 - loss: 0.3040 - val_accuracy: 0.8562 - val_loss: 0.3338
Epoch 4/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m511s[0m 278ms/step - accuracy: 0.8762 - loss: 0.2894 - val_accuracy: 0.8663 - val_loss: 0.3091
Epoch 5/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m542s[0m 295ms/step - accuracy: 0.8847 - loss: 0.2718 - val_accuracy: 0.8686 - val_loss: 0.3084
Epoch 6/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m551s[0m 300ms/step - accuracy: 0.8908 - loss: 0.2591 - val_accuracy: 0.8682 - val_loss:

In [None]:
base_model_r = tf.keras.applications.ResNet152V2(
    input_shape = (96,96,3),
    include_top = False,
    weights = 'imagenet'
)

base_model_r.trainable = False

base_model_r.summary()

In [13]:
model = models.Sequential([
    base_model_r,
    layers.Flatten(input_shape=(96, 96)),
    layers.Dense(units=96, activation='relu'),
    layers.Dense(units=32, activation='relu'),
    layers.Dense(2, activation = 'softmax')
])
model.summary()
epochs = 10
opt = tf.keras.optimizers.Adam(0.0001)
model.compile(loss='categorical_crossentropy', optimizer = opt, metrics=['accuracy'])
out = model.fit(train_images, epochs = epochs, validation_data = val_images)

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet152v2_weights_tf_dim_ordering_tf_kernels_notop.h5
[1m234545216/234545216[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 0us/step


Epoch 1/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m486s[0m 251ms/step - accuracy: 0.8113 - loss: 0.4195 - val_accuracy: 0.8464 - val_loss: 0.3519
Epoch 2/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m425s[0m 231ms/step - accuracy: 0.8673 - loss: 0.3100 - val_accuracy: 0.8525 - val_loss: 0.3415
Epoch 3/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m517s[0m 281ms/step - accuracy: 0.8934 - loss: 0.2573 - val_accuracy: 0.8561 - val_loss: 0.3398
Epoch 4/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m583s[0m 317ms/step - accuracy: 0.9151 - loss: 0.2121 - val_accuracy: 0.8554 - val_loss: 0.3615
Epoch 5/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m592s[0m 322ms/step - accuracy: 0.9346 - loss: 0.1711 - val_accuracy: 0.8507 - val_loss: 0.3914
Epoch 6/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m581s[0m 316ms/step - accuracy: 0.9500 - loss: 0.1378 - val_accuracy: 0.8495 - val_loss:

## Hyperparameter Tuning

In [10]:
def model_builder(hp):
    # Tune the number of units in the first Dense layer
    # Choose an optimal value between 32-96
    hp_units = hp.Int('units', min_value=32, max_value=96, step=32)
    
    model = models.Sequential([
    base_model,
    layers.Flatten(input_shape=(96, 96)),
    layers.Dense(units=hp_units, activation='relu'),
    layers.Dense(units=32, activation='relu'),
    layers.Dense(2, activation = 'softmax')
  ])

    # Tune the learning rate for the optimizer
    # Choose an optimal value from 0.01, 0.001, or 0.0001
    hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])

    model.compile(optimizer=keras.optimizers.Adam(learning_rate=hp_learning_rate),
                loss='categorical_crossentropy',
                metrics=['accuracy'])
    
    
    return model



In [11]:
import keras_tuner as kt

tuner = kt.Hyperband(model_builder,
                     objective='val_accuracy',
                     max_epochs=10,
                     factor=3)

### Performance Tweaks

In [10]:
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

### Parameter Scan

In [13]:
tuner.search(train_images, epochs = 10, validation_data = val_images, callbacks=[stop_early])

# Get the optimal hyperparameters
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]

print(f"""
The hyperparameter search is complete. The optimal number of units in the first densely-connected
layer is {best_hps.get('units')} and the optimal learning rate for the optimizer
is {best_hps.get('learning_rate')}.
""")

tuner.search_space_summary(extended=True)

Trial 9 Complete [00h 09m 18s]
val_accuracy: 0.8599931597709656

Best val_accuracy So Far: 0.8650835156440735
Total elapsed time: 01h 40m 31s

The hyperparameter search is complete. The optimal number of units in the first densely-connected
layer is 96 and the optimal learning rate for the optimizer
is 0.001.

Search space summary
Default search space size: 2
units (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 96, 'step': 32, 'sampling': 'linear'}
learning_rate (Choice)
{'default': 0.01, 'conditions': [], 'values': [0.01, 0.001, 0.0001], 'ordered': True}


## Submission

In [11]:
best_model = models.Sequential([
    base_model,
    layers.Flatten(input_shape=(96, 96)),
    layers.Dense(units=96, activation='relu'),
    layers.Dense(units=32, activation='relu'),
    layers.Dense(2, activation = 'softmax'),
])
best_model.compile(loss='categorical_crossentropy', optimizer = tf.keras.optimizers.Adam(0.001), metrics=['accuracy'])
out = best_model.fit(train_images, epochs = 10, validation_data = val_images, callbacks=[stop_early])

Epoch 1/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1160s[0m 624ms/step - accuracy: 0.8246 - loss: 0.3879 - val_accuracy: 0.8495 - val_loss: 0.3406
Epoch 2/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m289s[0m 157ms/step - accuracy: 0.8564 - loss: 0.3303 - val_accuracy: 0.8631 - val_loss: 0.3172
Epoch 3/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m274s[0m 149ms/step - accuracy: 0.8690 - loss: 0.3059 - val_accuracy: 0.8584 - val_loss: 0.3257
Epoch 4/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m285s[0m 155ms/step - accuracy: 0.8738 - loss: 0.2959 - val_accuracy: 0.8671 - val_loss: 0.3135
Epoch 5/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m284s[0m 154ms/step - accuracy: 0.8791 - loss: 0.2832 - val_accuracy: 0.8755 - val_loss: 0.2946
Epoch 6/10
[1m1834/1834[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m272s[0m 148ms/step - accuracy: 0.8844 - loss: 0.2744 - val_accuracy: 0.8732 - val_loss

In [12]:
# load test images
submission = pd.read_csv('/kaggle/input/histopathologic-cancer-detection/sample_submission.csv')
submission['filename'] = submission['id'] + '.tif'
submission.head()

Unnamed: 0,id,label,filename
0,0b2ea2a822ad23fdb1b5dd26653da899fbd2c0d5,0,0b2ea2a822ad23fdb1b5dd26653da899fbd2c0d5.tif
1,95596b92e5066c5c52466c90b69ff089b39f2737,0,95596b92e5066c5c52466c90b69ff089b39f2737.tif
2,248e6738860e2ebcf6258cdc1f32f299e0c76914,0,248e6738860e2ebcf6258cdc1f32f299e0c76914.tif
3,2c35657e312966e9294eac6841726ff3a748febf,0,2c35657e312966e9294eac6841726ff3a748febf.tif
4,145782eb7caa1c516acbe2eda34d9a3f31c41fd6,0,145782eb7caa1c516acbe2eda34d9a3f31c41fd6.tif


In [14]:
test_images = ImageDataGenerator(rescale=1/255).flow_from_dataframe(
    dataframe = submission,
    directory = '/kaggle/input/histopathologic-cancer-detection/test',
    x_col = 'filename',
    batch_size = 96,
    shuffle = False,
    class_mode = None,
    target_size = (96,96)
)

Found 57458 validated image filenames.


In [15]:
# predict test labels using fitted model
test_labels = best_model.predict(test_images)
submission.label = test_labels[:,1]
submission.head()

[1m599/599[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m307s[0m 512ms/step


Unnamed: 0,id,label,filename
0,0b2ea2a822ad23fdb1b5dd26653da899fbd2c0d5,0.024745,0b2ea2a822ad23fdb1b5dd26653da899fbd2c0d5.tif
1,95596b92e5066c5c52466c90b69ff089b39f2737,0.023142,95596b92e5066c5c52466c90b69ff089b39f2737.tif
2,248e6738860e2ebcf6258cdc1f32f299e0c76914,0.000655,248e6738860e2ebcf6258cdc1f32f299e0c76914.tif
3,2c35657e312966e9294eac6841726ff3a748febf,0.019055,2c35657e312966e9294eac6841726ff3a748febf.tif
4,145782eb7caa1c516acbe2eda34d9a3f31c41fd6,0.021113,145782eb7caa1c516acbe2eda34d9a3f31c41fd6.tif


In [17]:
# save labels to a folder
submission.to_csv('submission.csv', header = True, index = False)

# Conclusion

Below, Resnet and Xception, which are larger models, seem to perform less well at transfer learning than VGG and overfit more easily, which is likely due to the differences in the number of parameters that must be fit in order to adapt them to a binary classification model, as these are larger models than the VGG model that was used. Overall, the effectiveness of this type of learning seems to be roughly constant across different architectures and parameter values, however, which implies that the differences may not translate all that much acrss different tasks - validation accuracy appears to be roughly asymptotic around 87 or 88% for the best tuned models.

# References

1. https://www.kaggle.com/code/prashant111/comprehensive-guide-to-cnn-with-keras
3. https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam
4. https://bmcmedimaging.biomedcentral.com/articles/10.1186/s12880-022-00793-7
5. https://keras.io/guides/transfer_learning/
6. https://keras.io/api/applications/
7. VGG: https://arxiv.org/abs/1409.1556
8. Resnet: https://arxiv.org/abs/1603.05027
9. Xception: https://arxiv.org/abs/1610.02357
10. https://keras.io/keras_tuner/api/hypermodels/
11. https://www.tensorflow.org/tutorials/keras/keras_tuner