# Cancer detection with deep learning

The aim of this project is to create and train a deep neural network to detect cancer on images, with high accuracy. The GitHub repository for the project is the following: https://github.com/mohosb/cancer_detection

The data is from the Kaggle competition "Histopathologic Cancer Detection" (https://www.kaggle.com/competitions/histopathologic-cancer-detection/overview).

In this project I will use PyTorch, a popular deep learning library for Python, and also a package called "nn_utils", that contains lots of useful functions. This package is writen and developed entirely by me. (https://github.com/szegedai/nn_utils)

## Data preprocessing and loading

In [1]:
import torch
from torch import nn
import torchvision
from torchvision.transforms.functional import normalize as standardize
import pandas as pd
import os
from PIL import Image
import matplotlib.pyplot as plt
from nn_utils.misc import split_dataset, create_data_loaders, RollingStatistics
from nn_utils.models import count_parameters
from nn_utils.models.resnet_v2 import ResNetV2
from nn_utils.training import train_classifier, CLILoggerCallback, LRSchedulerCallback

In [2]:
data_path = '.'

In [None]:
try:
    os.makedirs(data_path + '/train/0', exist_ok=True)
    os.makedirs(data_path + '/train/1', exist_ok=True)

    df = pd.read_csv('train_labels.csv', sep=',')
    for id, label in zip(df['id'], df['label']):
        os.rename(f'{data_path}/train/{id}.tif', f'{data_path}/train/{label}/{id}.tif')
    del df
except:
    pass

In [28]:
class UnlabeledImageFolder(torch.utils.data.Dataset):
    def __init__(self, root, transform, return_img_id=False):
        self.root = root
        self.transform = transform
        self.all_images = sorted(os.listdir(root))
        self.return_img_id = return_img_id

    def __len__(self):
        return len(self.all_images)

    def __getitem__(self, idx):
        image = Image.open(os.path.join(self.root, self.all_images[idx])).convert('RGB')
        if self.return_img_id:
            return self.transform(image), self.all_images[idx].split('.')[0]
        return self.transform(image)

In [42]:
train_transforms = torchvision.transforms.Compose([
    torchvision.transforms.RandomHorizontalFlip(),
    torchvision.transforms.RandomVerticalFlip(),
    torchvision.transforms.ToTensor()
])
test_transforms = torchvision.transforms.ToTensor()
train_ds = torchvision.datasets.ImageFolder(data_path + '/train', train_transforms)
test_ds = UnlabeledImageFolder(data_path + '/test', test_transforms, return_img_id=True)
train_ds, val_ds = split_dataset(train_ds, split=0.1, seed=42)
train_loader, val_loader = create_data_loaders(
    [train_ds, val_ds],
    batch_size=512, shuffle=True,
    num_workers=4, pin_memory=True,
    multiprocessing_context='spawn', persistent_workers=False
)
test_loader, = create_data_loaders(
    [test_ds], batch_size=512, shuffle=False,
    num_workers=0, multiprocessing_context=None, persistent_workers=False
)

## EDA

Let us look at the data to get an understanding of it. In the following cells, we will see the following:
- Length of the training and validations datasets
- Some sample images from the training data and their labels
- The statistics of the RGB channels over the training data (the models will use this later to standardize the input)
- The number of labels for the two classes to see how unbalanced it is

In [None]:
len(train_ds), len(val_ds)

In [None]:
labels = []
fig, ax = plt.subplots(2, 5)
for i in range(10):
    img, label = train_ds[i]
    labels.append(label)
    current_ax = ax[i // 5, i % 5]
    current_ax.imshow(img.permute(1, 2, 0).numpy())
    current_ax.axis('off')
print('img size:', list(img.shape))
print('labels:', labels)

In [None]:
# Not recommend running this, because it takes a long time even on a fast SSD. On an HDD don't even try it LOL! XD
# The result of this code is hardcoded in the next cell. Just use those.
stats = RollingStatistics((3, 96 * 96))
label_sum = 0
for images, labels in train_loader:
    stats.update(images.permute(1, 0, 2, 3).flatten(1, -1).numpy())
    label_sum += sum(labels)

means = stats.mean.tolist()
stds = stats.std.tolist()

In [5]:
means = [0.7024619 , 0.5462504 , 0.69643368]
stds = [0.23888772, 0.28208111, 0.21622823]

In [None]:
print('label 0 count:', len(train_ds) - label_sum)
print('label 1 count:', label_sum)

## Model training and hyperparameter tuning

In this section we will set up some model configurations to train different models and try to select the best hyperparameters.
The plan for the selection is the following:
1) pick some reasonable learning rate and weight decay that we can fix for now
2) pick some models with different architectures
3) try to increase the width (number of filters per convolution layers) and the depth (number of convolution layers) to see which one gives us better performance
4) select the best model architecture and try a few other learning rate and weight decay combinations

(In other situations I would do a full hyperparameter search but that would take weeks to run, so I will just pick some arbitrary values of the top of my head.)

Our models will use 3x3 convolutional layers (for computational efficiency) paired with 2D batch normalization layers (for stabilizing training performance) for feature extraction and a fully connected final layer for the actual classification.

In [6]:
dtype = torch.float32
#dtype = torch.float16  # Use this for faster training and lower memory usage!

if torch.cuda.is_available():
    device = torch.device('cuda')       # For Nvidia GPUs if cuda is installed
elif torch.backends.mps.is_available():
    device = torch.device('mps')        # For Apple silicon (M1, M2)
else:
    device = torch.device('cpu')        # If nothing else is available use the CPU (not recommended!)
print('using:', device)

using: mps


In [7]:
def fit(model, base_lr, wd, num_epochs=3, verbose=False):
    optimizer = torch.optim.AdamW(model.parameters(), lr=base_lr, weight_decay=wd)
    def lr_fn(epoch):
        if epoch < 2:  # [1, 2)
            return 1.
        if epoch < 3:  # [2, 3)
            return 1e-1
        return 1e-2  # [3, inf)
    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_fn)

    loss_fn = nn.CrossEntropyLoss()

    callbacks = [
        LRSchedulerCallback(scheduler)
    ]
    if verbose:
        callbacks.append(CLILoggerCallback())

    final_metrics = train_classifier(
        model, loss_fn, optimizer, train_loader, val_loader,
        callbacks=callbacks, num_epochs=num_epochs
    )
    return final_metrics

In [8]:
class BasicCNN(nn.Module):
    def __init__(self, num_classes, depth_factor=5, width_factor=1, means=(0., 0., 0.), stds=(1., 1., 1.)):
        super().__init__()

        num_channels = [16] + [2 ** (4 + i) * width_factor for i in range(depth_factor)]
        self.conv_layers = [
            nn.Conv2d(3, num_channels[0], kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(num_channels[0]),
            nn.ReLU(True)
        ]
        for i in range(1, depth_factor + 1):
            self.conv_layers += [
                nn.Conv2d(num_channels[i - 1], num_channels[i], kernel_size=3, stride=2, padding=1),
                nn.BatchNorm2d(num_channels[i]),
                nn.Conv2d(num_channels[i], num_channels[i], kernel_size=3, stride=1, padding=1),
                nn.BatchNorm2d(num_channels[i]),
                nn.ReLU(True)
            ]
        self.conv_layers = nn.ModuleList(self.conv_layers)
        self.classifier = nn.Linear(num_channels[-1], num_classes)

        self.means = means
        self.stds = stds

    def forward(self, x):
        x = standardize(x, self.means, self.stds)
        for layer in self.conv_layers:
            x = layer(x)
        x = torch.mean(x, (2, 3))
        x = self.classifier(x)
        return x

In the following cell, uncomment the model architecture and the configuration, you would like to use.

In [13]:
#model = BasicCNN(2, depth_factor=3, width_factor=1, means=means, stds=stds)
#model = BasicCNN(2, depth_factor=3, width_factor=2, means=means, stds=stds)
#model = BasicCNN(2, depth_factor=4, width_factor=1, means=means, stds=stds)
#model = ResNetV2(2, [2, 2, 2], width_factor=1, means=means, stds=stds)
#model = ResNetV2(2, [2, 2, 2], width_factor=2, means=means, stds=stds)
model = ResNetV2(2, [2, 2, 2, 2], width_factor=1, means=means, stds=stds)

#base_learning_rate, weight_decay = 0.001, 0.0001
#base_learning_rate, weight_decay = 0.0005, 0.0001
#base_learning_rate, weight_decay = 0.005, 0.0001
base_learning_rate, weight_decay = 0.005, 0.0005

model.to(device=device, dtype=dtype)
print(next(model.modules()))
print('number of parameters:', count_parameters(model))

ResNetV2(
  (head): Conv2d(3, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (groups): Sequential(
    (0): Sequential(
      (0): BasicBlock(
        (bn1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (activation_fn): ReLU(inplace=True)
        (conv1): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      )
      (1): BasicBlock(
        (bn1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (activation_fn): ReLU(inplace=True)
        (conv1): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(16, 16, kernel_size=(3, 3),

In [14]:
fit(model, base_learning_rate, weight_decay, num_epochs=3, verbose=True)

1/3 epoch:

[1A[2K  1/387 3.75s - train_loss: 0.6953 train_acc: 0.6133 
[1A[2K  2/387 4.15s - train_loss: 0.7191 train_acc: 0.6377 
[1A[2K  3/387 4.53s - train_loss: 0.7645 train_acc: 0.5553 
[1A[2K  4/387 4.92s - train_loss: 0.7269 train_acc: 0.5796 
[1A[2K  5/387 5.31s - train_loss: 0.6928 train_acc: 0.6109 
[1A[2K  6/387 5.71s - train_loss: 0.6582 train_acc: 0.6426 
[1A[2K  7/387 6.11s - train_loss: 0.6312 train_acc: 0.6610 
[1A[2K  8/387 6.49s - train_loss: 0.6098 train_acc: 0.6782 
[1A[2K  9/387 6.87s - train_loss: 0.5941 train_acc: 0.6914 
[1A[2K  10/387 7.26s - train_loss: 0.5810 train_acc: 0.7006 
[1A[2K  11/387 7.65s - train_loss: 0.5743 train_acc: 0.7061 
[1A[2K  12/387 8.03s - train_loss: 0.5676 train_acc: 0.7109 
[1A[2K  13/387 8.42s - train_loss: 0.5554 train_acc: 0.7207 
[1A[2K  14/387 8.81s - train_loss: 0.5441 train_acc: 0.7292 
[1A[2K  15/387 9.19s - train_loss: 0.5355 train_acc: 0.7366 
[1A[2K  16/387 9.58s - train_loss: 0.5301 train_ac

{'train_loss': 0.19716130146992608,
 'train_acc': 0.9234583861470638,
 'std_val_loss': 0.0003686105364637476,
 'std_val_acc': 0.9276883919643669}

## Results and conclusion

| model type  | learning rate | weight decay | number of filters per conv layers | number of parameters | validation accuracy |
|-------------|---------------|--------------|-----------------------------------|----------------------|---------------------|
| basic model | 0.001         | 0.0001       | 16, 2x16, 2x32, 2x64              | 75 010               | 90.63%              |
| basic model | 0.001         | 0.0001       | 16, 2x32, 2x64, 2x128             | 292 386              | 91.28%              |
| basic model | 0.001         | 0.0001       | 16, 2x16, 2x32, 2x64, 2x128       | 297 090              | 91.54%              |
| resnet      | 0.001         | 0.0001       | 16, 4x16, 4x32, 4x64              | 174 546              | 91.62%              |
| resnet      | 0.001         | 0.0001       | 16, 4x32, 4x64, 4x128             | 690 642              | 91.37%              |
| resnet      | 0.001         | 0.0001       | 16, 4x16, 4x32, 4x64, 4x128       | 699 986              | 91.95%              |
| resnet      | 0.0005        | 0.0001       | 16, 4x16, 4x32, 4x64, 4x128       | 699 986              | 90.76%              |
| resnet      | 0.005         | 0.0001       | 16, 4x16, 4x32, 4x64, 4x128       | 699 986              | 92.69%              |
| resnet      | 0.005         | 0.0005       | 16, 4x16, 4x32, 4x64, 4x128       | 699 986              | 92.77%              |

(Some filters are writen like 4x16, to write it in a compact way, but it means that there are 4 different convolutional layers, all with 16 filters each.)

Based on this little hyperparameter tuning, it seems that making the models deeper helps more than widening them by adding more filters to each convolutional layer. Also, it looks like doubling or tripling the number of parameters only adds 1-2% of accuracy to the final results, while it significantly increases the training time and the amount of memory used. It is probably possible to further increase the performance of the model by deepening the model with more additional convolutional layers, but I do not have the resources to run bigger models at the time of making this project.

In [49]:
all_ids = []
all_predictions = []
model.eval()
for images, ids in test_loader:
    all_predictions += model(images.to(device)).argmax(dim=1).to(dtype=torch.int).tolist()
    all_ids += ids
submission = pd.DataFrame({'id': all_ids, 'target': all_predictions})
submission.to_csv('submission.csv', sep=',', header=True, index=False)