# Lung and Colon Cancer Detection
This notebook demonstrates the process of building a machine learning model to detect lung and colon cancer using histopathological images. The dataset contains labeled images of cancerous and non-cancerous (<em>healthy</em>) tissues. 

There are five classes in this dataset: 
- Lung benign tissue (<em>healthy</em>)
- Lung adenocarcinoma
- Lung squamos cell carcinoma
- Colon adenocarcinoma
- Colong benign tissue (<em>healthy</em>)

The goal is to compare different Convolutional Neural Networks (CNNs) to explore the strengths and weaknesses of various architectures and understand which ones perform best for the chosen dataset. Ultimately, the most robust classifier (CNN) will be identified and can accurately identify cancerous lung or colon tissues from the given sample images. 

For a fair comparison of the different CNNs, it is necessary to set some guidelines / rules:
- The dataset has to be properly preprocessed.
- The same training parameters are used 
  - Learning rate
  - Batch size
  - Number of epochs
- The same optimizer is used
  - Isolates the effect if the CNN architecture on performance

Those rules will ensure that the comparison is consistent, controlled and fair.

The dataset used in this notebook is sourced from Kaggle: https://www.kaggle.com/datasets/andrewmvd/lung-and-colon-cancer-histopathological-images/data

## Required Packages

In [1]:
import os
import pandas as pd
import random
import torch
import torch.nn as nn
from PIL import Image
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms

## Dataset handling
### Import Dataset
The dataset containes 25.000 images.

In [2]:
lung_dataset = '../lung_colon_image_set/lung_image_sets'
colon_dataset = '../lung_colon_image_set/colon_image_sets'

if not os.path.exists(lung_dataset):
    raise FileNotFoundError(f"Dataset path '{lung_dataset}' does not exist!")
if not os.path.exists(colon_dataset):
    raise FileNotFoundError(f"Dataset path '{colon_dataset}' does not exist!")

lung_classes = sorted(os.listdir(lung_dataset))
colon_classes = sorted(os.listdir(colon_dataset))

print(f"Classes found in lung dataset: {lung_classes}")
print(f"Classes found in colon dataset: {colon_classes}")

lc_classes = sorted(set(lung_classes + colon_classes))
print(f"Lung and Colon classes combined: {lc_classes}")

Classes found in lung dataset: ['.DS_Store', 'lung_aca', 'lung_n', 'lung_scc']
Classes found in colon dataset: ['colon_aca', 'colon_n']
Lung and Colon classes combined: ['.DS_Store', 'colon_aca', 'colon_n', 'lung_aca', 'lung_n', 'lung_scc']


### Split Dataset
The dataset is split into the following three categories with pre defined percentages:
- Training data (<em>80 %</em>)
- Validation data (<em>10 %</em>)
- Testing data (<em>10 %</em>)

In [3]:
def prepare_splits(data_directories, split_ratios=(0.8, 0.1, 0.1)):
    all_data = {}
    for data_directory in data_directories:
        for class_name in sorted(os.listdir(data_directory)):
            class_directory = os.path.join(data_directory, class_name)
            if os.path.isdir(class_directory):
                all_data.setdefault(class_name, []).extend(os.path.join(class_directory, file_name) for file_name in os.listdir(class_directory))

    for class_name, files in all_data.items():
        print(f"Class '{class_name}' has {len(files)} images")

    dataset_splits = {'training': [], 'validation': [], 'testing': []}

    for class_name, files in all_data.items():
        training_dataset, temporary_dataset = train_test_split(files, test_size=(1 - split_ratios[0]), random_state=42)

        validation_dataset, testing_dataset = train_test_split(temporary_dataset, test_size=split_ratios[2]/(split_ratios[1] + split_ratios[2]), random_state=42)

        dataset_splits['training'].extend(training_dataset)
        dataset_splits['validation'].extend(validation_dataset)
        dataset_splits['testing'].extend(testing_dataset)

    return dataset_splits

dataset_directories = [lung_dataset, colon_dataset]
dataset_splits = prepare_splits(dataset_directories)

print(f"Training dataset has {len(dataset_splits['training'])} images")
print(f"Validation dataset has {len(dataset_splits['validation'])} images")
print(f"Testing dataset has {len(dataset_splits['testing'])} images")

Class 'lung_aca' has 5000 images
Class 'lung_n' has 5000 images
Class 'lung_scc' has 5000 images
Class 'colon_aca' has 5000 images
Class 'colon_n' has 5000 images
Training dataset has 20000 images
Validation dataset has 2500 images
Testing dataset has 2500 images


### Dataset Class
Initializes the dataset class.

In [4]:
class CancerDetectionDataset(Dataset):
    def __init__(self, file_paths, all_classes, transform=None):
        self.file_paths = file_paths
        self.all_classes = all_classes
        self.labels = [all_classes.index(os.path.basename(os.path.dirname(file_path))) for file_path in file_paths]
        self.transform = transform

    def __len__(self):
        return len(self.file_paths)
    
    def __getitem__(self, idx):
        image_path = self.file_paths[idx]
        label = self.labels[idx]
        image = Image.open(image_path).convert('RGB')

        if self.transform:
            image = self.transform(image)
        return image, label

## Mean and Standard Deviation
Overall, normalization and standardization helps in stabilizing and speeding up the training process of machine learning models.

These are the main reasons:
- Subtracting the mean from each image centers the data around zero.
- Dividing by the standard deviation scales the data to have unit variance.
- Normalized inputs can lead to faster and improved convergance during training because the gradients are more stable and the optimization is more efficient.
- Normalization ensures that all input features have are within the same, consistent range.

Calculating the mean and standard deviation of the specific dataset, rather than using standard values, is quite helpful:
- Dataset specificity,
- Improved model performance,
- Avoiding bias and
- Consistency

By calculating the mean and standard deviation specific to the dataset, it is ensured that the normalization is optimal for the data, leading to better model performance and more reliable results.

In [None]:
def get_image_paths(dataset_splits):
    keys = ['training', 'validation', 'testing']
    image_paths = []

    for key in keys:
        image_paths.extend(dataset_splits[key])
    
    return image_paths

transformms = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.RandomRotation(degrees=(-45, 45)),
    transforms.ToTensor()
])

def calculate_mean_std(loader):
    mean = torch.zeros(3)
    std = torch.zeros(3)
    total_images_count = 0
    for images, _ in loader:
        batch_samples = images.size(0)
        images = images.view(batch_samples, images.size(1), -1)
        mean += images.mean(2).sum(0)
        std += images.std(2).sum(0)
        total_images_count += batch_samples

    mean /= total_images_count
    std /= total_images_count
    
    return mean, std

if __name__ == '__main__':
    dataset = CancerDetectionDataset(get_image_paths(dataset_splits), lc_classes, transform=transformms)
    loader = DataLoader(dataset, batch_size=64, shuffle=False)

    mean, std = calculate_mean_std(loader)
    print(f'Mean: {torch.round(mean, decimals=3)}')
    print(f'Std: {torch.round(std, decimals=3)}')

Mean: tensor([0.6440, 0.5290, 0.7740])
Std: tensor([0.2620, 0.2520, 0.2800])


## Transformer
Transforms the images.

In [70]:
transform = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.RandomRotation(degrees=(-45, 45)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.644, 0.529, 0.774], std=[0.262, 0.252, 0.280])
])