# Prerequisites

Please run `01_data_exploration.ipynb` before running this notebook as it uses the output of data exploration.

## Dataset Requirements

- `metadata_updated.csv` must be stored in `data` folder in your project root directory
- Images must be stored in the `data/raw_dataset` folder
- Expected structure:
  - `data/metadata_updated.csv` - Updated metadata file
  - `data/raw_dataset/images/` - Contains all dermatological images

## Notebook Structure

This notebook follows a structured workflow:
- Loading required libraries
- Defining constants and paths
- Splitting data into training (60%), validation (20%), and testing sets (20%) using stratified sampling
- Creating custom PyTorch dataset class for skin lesion images
- Implementing data transformation pipelines for training and validation/testing
- Applying data augmentation techniques to increase dataset diversity
- Creating and storing processed datasets as PyTorch tensors for efficient model training

## Outputs

This notebook provides the following outputs:
- Augmented images inside `data/augmented_images` folder
- Processed datasets `train_dataset.pt`, `val_dataset.pt` and `test_dataset.pt` inside `data/processed` folder

# Libraries

In [1]:
# System libraries
import os
from pathlib import Path

# Third-party libraries
import torch
import numpy as np
import pandas as pd
from PIL import Image
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset
import albumentations as A
from albumentations.pytorch import ToTensorV2

# Local imports
from scd.utils.common import get_test_transforms

  from .autonotebook import tqdm as notebook_tqdm


# Constants

In [2]:
random_state = 42

# Define paths
root_dir = Path.cwd().parent
data_dir = root_dir / 'data'
image_path = data_dir / 'raw_dataset' / 'images'
augmented_image_path = data_dir / 'augmented_images'
metadata_path = data_dir / 'metadata_updated.csv'
processed_data_dir = data_dir / 'processed'

# Load Metadata

In [3]:
# Load metadata
metadata = pd.read_csv(metadata_path)
metadata.head()

Unnamed: 0.1,Unnamed: 0,DDI_ID,image_id,skin_tone,malignant,disease,strata
0,0,1,000001.png,56,1,melanoma-in-situ,56_1
1,1,2,000002.png,56,1,melanoma-in-situ,56_1
2,2,3,000003.png,56,1,mycosis-fungoides,56_1
3,3,4,000004.png,56,1,squamous-cell-carcinoma-in-situ,56_1
4,4,5,000005.png,12,1,basal-cell-carcinoma,12_1


# Split Training, Validation and Testing Set

We split the dataset into training (60%), validation (20%) and testing (20%) sets using stratified sampling to ensure balanced distribution of malignant and benign cases across all skin tones. This approach maintains the same proportion of classes in each subset, which is important for model training and evaluation, especially with imbalanced datasets.

We first split the data into train (60%) and a temporary set (40%), then further divide the temporary set into validation and test sets of equal size.

In [4]:
# Split the dataset into training, validation, and test sets
train_df, val_df = train_test_split(
  metadata, 
  test_size=0.4, 
  stratify=metadata['malignant'], 
  random_state=random_state
)

val_df, test_df = train_test_split(
  val_df,
  test_size=0.5,
  stratify=val_df['malignant'],
  random_state=random_state
)

# Prepare Dataset

We create a custom PyTorch dataset class (`SkinDataset`) to efficiently load and preprocess skin lesion images for our deep learning model. The dataset class handles:

1. Loading images from file paths using data stored in our metadata DataFrame
2. Applying provided transformations to the images
3. Pairing each image with its corresponding label (malignant or benign) and filename

In [5]:
class SkinDataset(Dataset):
    """
    Custom Dataset for loading skin images and their labels.

    """
    def __init__(self, dataframe: pd.DataFrame, data_dir: str, transform: callable):
        """
        Initializes the SkinDataset with a DataFrame, data directory, and transformations.

        Parameters
        ----------
        dataframe : pandas.DataFrame
            DataFrame containing image file names and labels.
        data_dir : str
            Directory where raw and augmented images are stored.
        transform : callable
            Transformations to apply to the images.
        """
        self.dataframe = dataframe.reset_index(drop=True)
        self.raw_dataset_path = data_dir / 'raw_dataset' / 'images'
        self.aug_dataset_path = data_dir / 'augmented_images'
        self.transform = transform

    def __len__(self) -> int:
        """ 
        Returns the number of samples in the dataset. 

        Returns
        -------
        int
            Number of samples in the dataset.
        """
        return len(self.dataframe)

    def __getitem__(self, idx: int) -> tuple:
        """
        Retrieves an image and its label by index.

        Parameters
        ----------
        idx : int
            Index of the sample to retrieve.

        Returns
        -------
        tuple
            A tuple containing the transformed image and its label.
        """
        # Create image path with image directory and filename
        filename = str(self.dataframe.loc[idx, 'image_id'])
        if 'aug' in filename:
            img_path = os.path.join(self.aug_dataset_path, filename) # use augmented images path
        else:
            img_path = os.path.join(self.raw_dataset_path, filename) # use raw images path
        
        # Retrieve the label for the image
        label = self.dataframe.loc[idx, 'malignant']

        # Load the image, convert to RGB, and apply transformations
        image = Image.open(img_path).convert("RGB")
        image_np = np.array(image)
        image = self.transform(image=image_np)['image']

        # Return the transformed image and its label and filename
        return image, label, filename
    
    def get_labels(self) -> np.ndarray:
        """
        Returns the labels of the dataset.

        Returns
        -------
        numpy.ndarray
            Array of labels.
        """
        return self.dataframe['malignant'].values

# Data Transformation

The `transformation()` function creates different transformation pipelines:

1. **Training Transformations:** Apply various random modifications to training images to help the model learn more robust features:
  - Resize images to provided size
  - Random horizontal and vertical flips to simulate different orientations
  - Random affine transformations (rotation, translation, scaling) to provide positional variance
  - Color jitter to simulate lighting variations
  - Gaussian blur to simulate focus variations in dermatoscopic images
  - Random erasing to help the model learn to identify lesions even with partial occlusions
  - Normalisation with ImageNet mean and standard deviation values

2. **Augmentation Transformations:** Similar to training transformations but without normalisation and tensor conversion, specifically designed for generating new training examples:
  - All the same transformations as training pipeline
  - Outputs numpy arrays instead of tensors for direct saving to disk

In [6]:
def transformations(resize: tuple, for_augmentation: bool =False) -> A.Compose:
    """
    Creates a set of transformations for training or augmentation.

    Parameters
    ----------
    resize : tuple
        Target size in (height, width) format.
    for_augmentation : bool
        If True, returns only the training transformation for augmentation purposes.

    Returns
    -------
    albumentations.Compose
        The augmentation or training transformations.
    """
    if for_augmentation:
        # If for augmentation, only return the training transform
        return A.Compose([
            A.Resize(height=resize[0], width=resize[1]),
            A.HorizontalFlip(p=0.2),
            A.VerticalFlip(p=0.2),
            A.Affine(rotate=(-20, 20), translate_percent=(0.1, 0.1), scale=(0.9, 1.1), p=0.8),
            A.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1, hue=0.0, p=0.5),
            A.GaussianBlur(blur_limit=(5, 5), sigma_limit=(0.1, 2.0), p=0.3),
            A.CoarseDropout(
                num_holes_range=(1, 8),
                hole_height_range=(int(resize[0]*0.05), int(resize[0]*0.1)),
                hole_width_range=(int(resize[1]*0.05), int(resize[1]*0.1)),
                fill=0,
                p=0.2
            ),
        ])

    # Define the training transformation pipeline
    train_transform = A.Compose([
        A.Resize(height=resize[0], width=resize[1]),
        A.HorizontalFlip(p=0.2),
        A.VerticalFlip(p=0.2),
        A.Affine(rotate=(-20, 20), translate_percent=(0.1, 0.1), scale=(0.9, 1.1), p=0.8),
        A.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1, hue=0.0, p=0.5),
        A.GaussianBlur(blur_limit=(5, 5), sigma_limit=(0.1, 2.0), p=0.3),
        A.CoarseDropout(
            num_holes_range=(1, 8),
            hole_height_range=(int(resize[0]*0.05), int(resize[0]*0.1)),
            hole_width_range=(int(resize[1]*0.05), int(resize[1]*0.1)),
            fill=0,
            p=0.2
        ),
        A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) if not for_augmentation else lambda x: x,
        ToTensorV2() if not for_augmentation else lambda x: x,
    ])

    return train_transform

# Data Augmentation

Data augmentation is a crucial technique for improving model generalisation and performance, especially when working with limited datasets. Our augmentation function applies various transformations to the training data to artificially increase the diversity of the training set.

In [7]:
def augmentation(df: pd.DataFrame, img_path: Path, output_path: Path, transform: A.Compose, num_augmented:int = 3) -> pd.DataFrame:
    """
    Applies Albumentations transform multiple times to each image, 
    saves the augmented versions and combines the new metadata with the original.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame with 'filename' and 'label' columns.
    img_path : Path
        Path to images folder.
    output_path : Path
        Path to save augmented images.
    transform : albumentations.Compose
        The transform pipeline for data augmentation.
    num_augmented : int
        Number of augmented versions per image.

    Returns
    -------
    pd.DataFrame
        New combined DataFrame with filenames and labels of augmented images.
    """
    # Ensure the output directory exists
    os.makedirs(output_path, exist_ok=True)
    new_records = []

    # Iterate through each row in the DataFrame
    for _, row in df.iterrows():
        filename = row['image_id']
        label = row['malignant']

        image = Image.open(f"{img_path}/{filename}").convert("RGB")
        image_np = np.array(image)

        # Apply the transformation multiple times to create augmented images
        for i in range(num_augmented):
            augmented = transform(image=image_np)['image']

            new_filename = f"{os.path.splitext(filename)[0]}_aug{i}.png"
            save_path = os.path.join(output_path, new_filename)
            Image.fromarray(augmented).save(save_path)

            new_records.append({'image_id': new_filename, 'malignant': label})
    
    # Create a DataFrame from the new records
    augmented_df = pd.DataFrame(new_records)

    print("Sample of augmented data:")
    print(augmented_df.head())
    print('\n\n')

    # Return combined original and augmented data
    return pd.concat([train_df, augmented_df], ignore_index=True)

# Create Datasets

The `create_datasets()` function creates Skin Dataset objects for training, validation, and testing applying data transformations.

In [8]:
def create_datasets(dataframes: tuple, data_dir: Path, augmented_image_path: Path, resize: tuple) -> tuple:
    """
    Create datasets for training, validation, and testing. Uses various transformations as defined in this project.

    Parameters
    ----------
    dataframes : tuple
        A tuple containing three DataFrames: (train_df, val_df, test_df).
    data_dir : Path
        Path to the data directory.
    augmented_image_path : Path
        Path to the directory where augmented images are stored.
    resize : tuple
        Tuple specifying the size to which images should be resized (height, width).
    
    Returns
    -------
    tuple of SkinDataset
        A tuple containing the training, validation, and test datasets.
    """
    train_df, val_df, test_df = dataframes

    # Initialise transformations
    train_transform = transformations(resize=resize)
    test_transform = get_test_transforms(resize=resize)
    augment_transform = transformations(resize=resize, for_augmentation=True)
    combined_train_df = augmentation(train_df, image_path, augmented_image_path, augment_transform)

    # Create datasets
    train_dataset = SkinDataset(combined_train_df, data_dir, transform=train_transform)
    val_dataset = SkinDataset(val_df, data_dir, transform=test_transform)
    test_dataset = SkinDataset(test_df, data_dir, transform=test_transform)

    # Return the datasets
    return train_dataset, val_dataset, test_dataset

# Store Datasets

We store the datasets as PyTorch tensors to avoid redundant preprocessing during model training and evaluation. Each dataset (training, validation, and testing) is stored with:
- Image tensors (already transformed and normalised)
- Label tensors (malignant or benign)
- Filenames for traceability and explainability

In [9]:
# Store as PyTorch tensors
def store_dataset(dataset: Dataset, name: str, output_dir: Path) -> None:
  """ 
  Store a PyTorch dataset as tensors in a specified directory.
  Parameters
  ----------
  dataset : Dataset
      The dataset to be stored.
  name : str
      Name of the dataset to be saved.
  output_dir : Path
      Directory where the dataset will be saved.
  """

  images = []
  labels = []
  filenames = []

  for i in range(len(dataset)):
    image, label, filename = dataset[i]
    images.append(image)
    labels.append(label)
    filenames.append(filename)
  
  # Convert to tensors
  images = torch.stack(images)
  labels = torch.tensor(labels)
  filenames = np.array(filenames)
  
  # Save tensors
  torch.save({
    'images': images,
    'labels': labels,
    'filenames': filenames
  }, output_dir / f'{name}_dataset.pt')
  
  print(f"Saved {name} dataset with {len(dataset)} samples")

# Execute Create and Store Dataset

In [10]:
resize = (384, 384)

# Create datasets
train_dataset, val_dataset, test_dataset = create_datasets(
  (train_df, val_df, test_df),
  data_dir,
  augmented_image_path,
  resize=resize
)

# Create directory for saving loaders if it doesn't exist
os.makedirs(processed_data_dir, exist_ok=True)

# Save all datasets
store_dataset(train_dataset, 'train', processed_data_dir)
store_dataset(val_dataset, 'val', processed_data_dir)
store_dataset(test_dataset, 'test', processed_data_dir)

Sample of augmented data:
          image_id  malignant
0  000005_aug0.png          1
1  000005_aug1.png          1
2  000005_aug2.png          1
3  000188_aug0.png          0
4  000188_aug1.png          0



Saved train dataset with 1572 samples
Saved val dataset with 131 samples
Saved test dataset with 132 samples
