# Introduction
This notebook presents a solution to the Histopathologic Cancer Detection challenge on Kaggle, focusing on the detection of cancer in small patches from larger digital pathology scans, utilizing deep learning techniques.

##Understanding the Problem:
The objective here is to develop a deep learning algorithm capable of identifying metastatic cancer from small image patches extracted from extensive digital pathology scans. Histopathology refers to the examination of disease signs through microscopic analysis of biopsied or surgically removed tissue specimens, which are stained and placed on glass slides for examination under a microscope.

Lymph Nodes are vital in this context as they are small glands filtering lymph fluid and often the initial site for the spread of breast cancer. The histological analysis of lymph node metastases is crucial in the TNM Classification, the global standard for assessing cancer spread. Due to the extensive area of tissue needing examination and the potential to overlook small metastases, leveraging Machine Learning offers a promising alternative to enhance both accuracy and efficiency in diagnostics.

### Understanding the Data:
The dataset for this project is divided into training and testing sets.
The Training set comprises 220,000 or so images, while the Test set includes 57,500 or so images.

Note that this dataset is a subset of the larger PCam dataset. The PCam dataset is known for its probabilistic sampling, which has led to duplicate images, but the Kaggle version has been refined to remove these duplicates for more effective training and evaluation.






## Understanding the Images
In this competition, you're tasked with predicting labels for images in the test folder. The presence of a positive label signifies that the center 32x32px region of a patch contains at least one pixel of tumor tissue. It's important to note that tumor tissue outside this central region doesn't affect the label. The inclusion of the outer region enables the use of fully-convolutional models without zero-padding, ensuring consistent behavior when applied to a whole-slide image.

This problem can be defined as a binary image classification task, where the goal is to differentiate between patches containing tumor tissue and those that do not.

## Understanding the Evaluation Metric
The evaluation metric used is the Area Under the Receiver Operating Characteristic (ROC) Curve, often abbreviated as AU-ROC or AUC. This metric is crucial for assessing the performance of classification models.

The ROC curve is a graphical representation of the model's performance across various threshold settings. It plots the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity). The AUC score quantifies the model's ability to distinguish between the classes, with higher values indicating better performance.

In essence, a high AUC suggests that the model is proficient at correctly identifying positive cases as positive and negative cases as negative. Thus, the AUC serves as a measure of the model's discriminatory power, crucial for tasks like distinguishing between diseased and healthy individuals.







### Step 1 : Adding Dependencies

In [None]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import numpy as np
import pandas as pd
import keras
import shutil
import time
import itertools
from keras import layers
from tensorflow import data as tf_data
import matplotlib.pyplot as plt
import tensorflow as tf

# Import useful sklearn functions
import sklearn
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

# Import Tensorflow functions
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import Dense, Dropout, Flatten, Activation
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.optimizers import Adam

### Step 2 : Setting GPU Memory Consumption Growth

This code is just to make sure that Tensorflow will not be using all the memory available, rather it should use what is required.

In [None]:
# Avoid OOM errors by setting GPU Memory Consumption Growth
# Grab all the GPUs available on the machine
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus: #looping through every potential GPUs here
    tf.config.experimental.set_memory_growth(gpu, True)

### Step 3 : Loading Dataset from Kaggle

We will download the raw zip archive data by using opendatasets library, So lets install that first:

In [None]:
!pip install opendatasets -q #It will do the installation in quiet mode

These files and folders will be downloaded in our local instance:

In [None]:
base_dir = '../input/histopathologic-cancer-detection/'
print(os.listdir(base_dir))

FileNotFoundError: [Errno 2] No such file or directory: '../input/histopathologic-cancer-detection/'

Make sure you key-in the kaggle crediatials before downloading. Your Kaggle key and password will be required to proceed.

In [None]:
import opendatasets as od

od.download("https://www.kaggle.com/c/histopathologic-cancer-detection/data")
print("Kaggle dataset was successfully downloaded on local instance..............................")

Now let's load the labels in pandas dataframe:

In [None]:
full_train_df = pd.read_csv("../input/histopathologic-cancer-detection/train_labels.csv")
full_train_df.head()

### Step 4 : Exploratory Data Analysis (EDA)

Let's start with the count first. How many images do we have in training as well as test datasets?

In [None]:
print("Train Size: {}".format(len(os.listdir('../input/histopathologic-cancer-detection/train/'))))
print("Test Size: {}".format(len(os.listdir('../input/histopathologic-cancer-detection/test/'))))

As mentioned on the kaggle website, we have 2,20,025 images in training and 57,458 images in testing folders. Now when we build our model we will only be using training data. Our model should not be seeing the test data at all! So lets check the class distribution of our training dataset first!

#### Check the Class Distribution

In [None]:
labels_count = full_train_df.label.value_counts()

# Plot a pie chart to visualize label distribution
plt.figure(figsize=(8, 8))
plt.pie(labels_count, labels=['Healthy', 'Cancer'], startangle=180,
        autopct='%1.1f', colors=['#00ff99', '#FF96A7'], shadow=False)
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle
plt.title('Distribution of Labels')
plt.show()

A ratio of 6:4

In [None]:
print(full_train_df.shape)

#### Set Hyperparameters

This step is crucial, and I've experimented with various hyperparameters through multiple iterations to gain insight into my model's performance under different conditions. Given the substantial size of our training data, it's impractical to run the model on the entire dataset each time, especially at the outset. Even with GPUs on Kaggle notebooks, it can take several hours. To mitigate this, I've opted to sample a smaller subset of images, such as 5000 from cancer patients and another 5000 from healthy patients.

This sampling strategy ensures a perfectly balanced dataset for both labels, which often leads to improved model performance. Once we're confident in our model's behavior and performance with this smaller dataset, we can then proceed to train the model on larger chunks of data or even the entire dataset if time permits. This staged approach allows us to iterate efficiently and make informed decisions about hyperparameters while managing computational resources effectively.







In [None]:
SAMPLE_SIZE = 10000 # the number of images we use from each of the two classes
IMAGE_SIZE = 96 # This is the pixel dimension of the given images, this is not we are choosing (so not a hyperparameter), it is what it is.
EPOCHS = 20 # This is an important hyperparameter, the bigger the better for our model to learn as long as it is not overfitting
BATCH_SIZE = 32 # This is an important hyperparameter as well, It tells number of samples to work through before updating the internal model parameters
LEARNING_RATE = 0.0003 # This is an important hyperparameter as well, It controls the step size for a model to reach the minimum loss function
LR_REDUCE_FACTOR = 0.5 # This hyperparameter helps reducing the learning rate by a factor of 2-10 once learning stagnates.

#### Create the Training and Validation Sets

Here we are building our own dataframe, which is perfectly balanced and also does not contain the entire dataset.

In [None]:
#Take a random sample of class-0 with size equal to number of samples
df_0 = full_train_df[full_train_df['label'] == 0].sample(SAMPLE_SIZE, random_state = 101)
#Take a random sample of class-1 with size equal to number of samples
df_1 = full_train_df[full_train_df['label'] == 1].sample(SAMPLE_SIZE, random_state = 101)

#Concat both the dataframes
sampled_df = pd.concat([df_0, df_1], axis=0).reset_index(drop=True)
#Shuffle them
sampled_df = shuffle(sampled_df)
#Check the class distribution
sampled_df['label'].value_counts()

In [None]:
print(sampled_df.shape)

#### Create a Directory Structure for Images

We'll establish a folder structure for organizing our sample image files. First, we'll create a parent folder named "PatientImages". Within this parent folder, we'll create two subfolders: "cancer_images" and "healhty_images". This organizational setup enables us to segregate our sample image files effectively, which is crucial for loading image data.



In [None]:
# Create a new directory
base_dir = 'PatientImages'

# Check whether the specified path exists or not
isExist = os.path.exists(base_dir)
if isExist:
    #Delete PatientImages if already there
    shutil.rmtree('PatientImages')

# Create a new directory because it does not exist
os.mkdir(base_dir)

In [None]:
# create a path to 'base_dir' to which we will join the names of the new folders
# cancer_images
cancer_dir = os.path.join(base_dir, 'cancer_images')
os.mkdir(cancer_dir)

# no_cancer_images
no_cancer_dir = os.path.join(base_dir, 'healthy_images')
os.mkdir(no_cancer_dir)

In [None]:
# check that the sub-folders have been created as expected
os.listdir('PatientImages')

### Transfer Images into Folders

In [None]:
df_0_arr = np.array(df_0['id'])
df_1_arr = np.array(df_1['id'])

In [None]:
import shutil
import os

# Iterate over each image ID in df_0_arr
for image in df_0_arr:
    # Construct the filename by adding the ".tif" extension
    fname = image + '.tif'

    # Define the source and destination paths
    src = os.path.join('../input/histopathologic-cancer-detection/train/', fname)
    dst = os.path.join('PatientImages/healthy_images', fname)

    # Copy the image from the source to the destination
    shutil.copyfile(src, dst)


In [None]:
# Transfer the cancer images
for image in df_1_arr:
    # the id in the csv file does not have the .tif extension therefore we add it here
    fname = image + '.tif'
    # source path to image
    src = os.path.join('../input/histopathologic-cancer-detection/train/', fname)
    # destination path to image
    dst = os.path.join('PatientImages/cancer_images', fname)
    # copy the image from the source to the destination
    shutil.copyfile(src, dst)

In [None]:
#Check how many training and validation images we have in the folder
print(len(os.listdir('PatientImages/cancer_images')))
print(len(os.listdir('PatientImages/healthyƒ_images')))

# thats a ton

#### get rid of  bad images

It's essential to identify and remove any corrupt images from our dataset to ensure the quality of our training data. This process is a crucial step in data cleaning for image datasets. Below is how we can accomplish this:



In [None]:
import cv2
import imghdr

data_dir = 'PatientImages'
image_exts = ['tiff', 'tif']
for image_class in os.listdir(data_dir):
    for image in os.listdir(os.path.join(data_dir, image_class)):
        image_path = os.path.join(data_dir, image_class, image)
        try:
            img = cv2.imread(image_path)
            tip = imghdr.what(image_path)
            if tip not in image_exts:
                print('Image not in ext list {}'.format(image_path))
                os.remove(image_path)
        except Exception as e:
            print('Issue with image {}'.format(image_path))

So no more dodgy images so far!

#### Create Dataset for Image Processing

To generate a new dataset use the following:

In [None]:
file_path = 'PatientImages'
num_train_samples = SAMPLE_SIZE * 2
train_batch_size = BATCH_SIZE
val_batch_size = BATCH_SIZE

train_steps = None
val_steps = None

In [None]:
datagen = ImageDataGenerator(rescale=1.0/255) #Data Scaling
alldata_gen = datagen.flow_from_directory(file_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=train_batch_size,
                                        class_mode='categorical')

In [None]:
# Get the labels that are associated with each index
print(alldata_gen.class_indices)

### Step 5 : Data Visualization - Visualizing Some Training Images

Let's see how the images look like for in visuals.

In [None]:
imgpath ="histopathologic-cancer-detection/train/" # training data is stored in this folder
malignant = full_train_df.loc[full_train_df['label']==1]['id'].values    # get the ids of malignant cases
normal = full_train_df.loc[full_train_df['label']==0]['id'].values       # get the ids of the normal cases

In [None]:
from PIL import Image, ImageDraw

def plot_fig(ids,title,nrows=5,ncols=15):

    fig,ax = plt.subplots(nrows,ncols,figsize=(18,6))
    plt.subplots_adjust(wspace=0, hspace=0)
    for i,j in enumerate(ids[:nrows*ncols]):
        fname = os.path.join(imgpath ,j +'.tif')
        img = Image.open(fname)
        idcol = ImageDraw.Draw(img)
        idcol.rectangle(((0,0),(95,95)),outline='white')
        plt.subplot(nrows, ncols, i+1)
        plt.imshow(np.array(img))
        plt.axis('off')

    plt.suptitle(title, y=0.94)

In [None]:
plot_fig(malignant,'Cancer Cases')

In [None]:
plot_fig(normal,'Non-Malignant Cases')

### Step 6 : Loading the Test data

To create a test folder directory structure and copy all test images from the source folder into the 'test_images' subfolder, we can use the following code:



In [None]:
# create test_dir
test_dir = 'test_dir'

# Check whether the folder exists or not
isExist = os.path.exists(test_dir)
if isExist:
    shutil.rmtree('test_dir')
# Create a new directory because it does not exist
os.mkdir(test_dir)

# create test_images inside test_dir
test_images = os.path.join(test_dir, 'test_images')
# Create a new directory because it does not exist
os.mkdir(test_images)

test_path = 'histopathologic-cancer-detection/test'

# Transfer the test images into image_dir
for image in os.listdir(test_path):
    # source path to image
    src = os.path.join(test_path, image)
    # destination path to image
    dst = os.path.join('test_dir/test_images', image)
    # copy the image from the source to the destination
    shutil.copyfile(src, dst)

In [None]:
# check how many test images we have in the folder
print(len(os.listdir('test_dir/test_images')))

Now load the test data into image data generators.

In [None]:
datagen = ImageDataGenerator(rescale=1.0/255) #Data Scaling
# Note: shuffle=False causes the test dataset NOT get shuffled
test_gen = datagen.flow_from_directory('test_dir',
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=1,
                                        class_mode='categorical',
                                        shuffle=False)

### Step 7 : Create the Model Architecture

We will be building our own deep Convolutional Neural Network from scratch. I am not using any pre-trained models for image classification in this project.

In [None]:
kernel_size = (3,3)
pool_size= (2,2)
first_filters = 32
second_filters = 64
third_filters = 128

dropout_conv = 0.3
dropout_dense = 0.3

model = Sequential()
model.add(Conv2D(first_filters, kernel_size, activation = 'relu', input_shape = (96, 96, 3)))
model.add(Conv2D(first_filters, kernel_size, activation = 'relu'))
model.add(Conv2D(first_filters, kernel_size, activation = 'relu'))
model.add(MaxPooling2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

model.add(Conv2D(second_filters, kernel_size, activation ='relu'))
model.add(Conv2D(second_filters, kernel_size, activation ='relu'))
model.add(Conv2D(second_filters, kernel_size, activation ='relu'))
model.add(MaxPooling2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

model.add(Conv2D(third_filters, kernel_size, activation ='relu'))
model.add(Conv2D(third_filters, kernel_size, activation ='relu'))
model.add(Conv2D(third_filters, kernel_size, activation ='relu'))
model.add(MaxPooling2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

model.add(Flatten())
model.add(Dense(256, activation = "relu"))
model.add(Dropout(dropout_dense))
model.add(Dense(2, activation = "softmax"))

model.summary()

### Step 8 : Train-Validation Split Data (80% : 20%)

In [None]:
train_size = int(num_train_samples*.8)
val_size = int(num_train_samples*.2)

print(f'The Training data size is : {train_size}')
print(f'The Validation data size is : {val_size}')

train_steps = np.ceil(train_size / train_batch_size)
val_steps = np.ceil(val_size / val_batch_size)

print(f'The number of Training Steps in each epoch will be : {train_steps}')
print(f'The number of Validation Steps in each epoch will be : {val_steps}')

In [None]:
#Now we will split the training data into training and validation
train = itertools.islice(alldata_gen, train_size)
val = itertools.islice(alldata_gen, train_size, train_size + val_size)

### Step 9 : Train the CNN Model

We will be using Adam optimiser for our model.

In [None]:
model.compile(Adam(learning_rate=LEARNING_RATE), loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'])

We will be creating a folder for our model checkpoint, wherein the best model weights will be saved in that folder.

In [None]:
# Create a new directory
mdl_ckpt = 'Model_Checkpoint'

# Check whether the specified path exists or not
isExist = os.path.exists(mdl_ckpt)
if not isExist:
  # Create a new directory because it does not exist
  os.mkdir(mdl_ckpt)
  print("The new directory is created!")

In [None]:
model_checkpoint_callback = ModelCheckpoint(mdl_ckpt, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')

Implementing a callback to reduce the learning rate when a metric has stopped improving can significantly enhance model training. Typically, models experience performance plateaus, where further training may not yield substantial improvements. In such cases, reducing the learning rate by a factor of 2-10 can help the model navigate these plateaus and continue learning effectively. This callback continuously monitors a specified metric during training and, if no improvement is observed for a defined number of epochs (patience), it adjusts the learning rate accordingly.



In [None]:
reduce_lr = ReduceLROnPlateau(monitor='val_accuracy', factor=LR_REDUCE_FACTOR, patience=2, verbose=1, mode='max', min_lr=0.00001)

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
history = model.fit(train,
                    steps_per_epoch=train_steps,
                    validation_data=val,
                    validation_steps=val_steps,
                    epochs=EPOCHS, verbose=1,
                    callbacks=[model_checkpoint_callback, reduce_lr])

We observed a validation accuracy of 89.675%, which is indeed impressive. It's important to note that this result was achieved after training for only 20 epochs and on a subset of 20,000 images. Considering this performance, expanding the training to include all available data and increasing the number of epochs to 50 or more is expected to yield even better results.

By utilizing the entire dataset and allowing the model to train for an extended period, we anticipate achieving near-optimal performance, potentially yielding perfect results or approaching them closely. This strategy capitalizes on the larger data volume and longer training duration to further refine the model's performance and enhance its ability to generalize to unseen data.







### Step 9 : Plot Model Performance

In [None]:
fig = plt.figure()
plt.plot(history.history['loss'], color='brown', label='Training Loss')
plt.plot(history.history['val_loss'], color='green', label='Validation Loss')
fig.suptitle('Model Loss Vs Epoch', fontsize=20)
plt.legend(loc="upper right")
plt.show()

In [None]:
fig = plt.figure()
plt.plot(history.history['accuracy'], color='brown', label='Training ')
plt.plot(history.history['val_accuracy'], color='green', label='Validation ')
fig.suptitle('Model Accuracy Vs Epoch', fontsize=20)
plt.legend(loc="upper right")
plt.show()

### Step 9 : Make a Prediction on Test data

Lets load the best epoch's weights and do the predictions.

In [None]:
# make sure we are using the best epoch
model.load_weights(mdl_ckpt)

In [None]:
num_test_images = 57458
predictions = model.predict(test_gen, steps=num_test_images, verbose=1)

# Are the number of predictions correct?It should be 57458.
print(f'Total number of predictions = {len(predictions)}')

### Step 10 : Create Submission File

In [None]:
# Put the predictions into a dataframe
df_preds = pd.DataFrame(predictions, columns=['no_tumor_tissue', 'has_tumor_tissue'])

# This outputs the file names in the sequence in which
# the generator processed the test images.
test_filenames = test_gen.filenames

# add the filenames to the dataframe
df_preds['file_names'] = test_filenames

# Create an id column. We will extract the id as shown below
# A file name now has this format:
# test_images/00006537328c33e284c973d7b39d340809f7271b.tif

def extract_id(x):
    # split into a list
    a = x.split('/')
    # split into a list
    b = a[1].split('.')
    extracted_id = b[0]

    return extracted_id

df_preds['id'] = df_preds['file_names'].apply(extract_id)

In [None]:
# Get the predicted labels.
# We were asked to predict a probability that the image has tumor tissue
y_pred = df_preds['has_tumor_tissue']

# get the id column
image_id = df_preds['id']

submission = pd.DataFrame({'id':image_id,  'label':y_pred, }).set_index('id')
submission.to_csv('Final_Predictions.csv', columns=['label'])

In [None]:
submission.head()

### Step 11 : Final Results

When I trained my model on 160,000 sample images then I got a kaggle score of 0.8870 as public score. This is what I got on Kaggle :

<img src="https://github.com/GVworkds/DTSA-5511-Introduction-to-Deep-Learning/blob/main/Leaderboard%20Score-1.png?raw=true">

### Step 12 :  Conclusion

This was a huge dataset for image classification and deep convolutional network helped us classify the images nicely. Sampling was a great way to tune-in our hyperparameters as every iteration took quite a bit to train.

<img src="https://github.com/jamesthesnake/Kaggle-CNN-CU-MSC/blob/main/Screen%20Shot%202024-03-20%20at%2011.23.08%20AM.png?raw=true">

With the current model, we've achieved an impressive AUC score of ~0.89 in predicting breast cancer, indicating its reliability and effectiveness over random guessing. However, there's always room for improvement. Here are a few tweaks we could consider to potentially enhance the model's performance further:

Feature Engineering: Analyze the existing features and consider engineering new features that could provide more discriminative information for the model to learn from. This might involve domain expertise or experimentation with various transformations of the existing features.

Model Architecture: Experiment with different architectures of the model. Perhaps a deeper or wider neural network could capture more complex patterns in the data. Alternatively, consider using more advanced architectures such as attention mechanisms or graph neural networks if the dataset warrants it.

Hyperparameter Tuning: Fine-tune the hyperparameters of the model such as learning rate, batch size, and regularization strength. This can be done through techniques like grid search or random search over a predefined range of values.

Data Augmentation: If the dataset is limited in size, consider augmenting it with techniques such as rotation, scaling, or flipping of images. This can help the model generalize better to unseen data.

Ensemble Methods: Combine multiple models trained on different subsets of the data or using different algorithms. Ensemble methods often lead to improved performance by leveraging the diversity of individual models.

Regularization Techniques: Implement regularization techniques such as dropout or batch normalization to prevent overfitting and improve generalization.

Advanced Preprocessing: Explore advanced preprocessing techniques such as feature scaling, normalization, or handling missing values in a more sophisticated manner.

Domain-Specific Knowledge: Consult with domain experts to incorporate domain-specific knowledge into the model. This could involve incorporating relevant medical literature or insights from practitioners in the field of breast cancer diagnosis.

By carefully implementing these tweaks and iteratively evaluating the model's performance, we can potentially enhance its reliability and predictive power, bringing it closer to achieving an AUC score of 1.






