# Lymphoma Diagnosis in Histopathology Images

## Introduction

This Jupyter notebook explores a Convolutional Neural Network (CNN) based approach for diagnosing lymphoma in histopathology images. Lymphoma, a type of cancer that originates in the lymphatic system, can be challenging to diagnose accurately. Histopathology images, which provide microscopic views of tissue samples, are crucial for the identification of cancerous cells.

In this project, we leverage the power of deep learning techniques, specifically employing Convolutional Neural Networks (CNNs), to automate and enhance the process of lymphoma diagnosis. Additionally, we incorporate Neural Architecture Search (NAS) as an optimization technique. NAS allows us to automatically discover optimal neural network architectures, potentially improving both accuracy and efficiency.

## Objectives

- Develop a CNN-based model for accurate lymphoma diagnosis in histopathology images.
- Utilize Neural Architecture Search (NAS) to automatically discover an optimized neural network architecture.
- Evaluate the model's performance on a dataset of histopathology images, considering factors such as accuracy, precision, recall, and F1-score.
- Provide insights into the potential benefits of employing NAS in optimizing deep learning models for medical image analysis.

## Dataset

    For the dataset, we are using “Multi Cancer Dataset” based on a publication of the IEEE Engineering in Medicine and
    Biology Society: “Automatic Classification of Lymphoma Images With Transform-Based Global Features” by Orlov,
    Nikita and Chen, Wayne and Eckley, David and Macura, Tomasz and Shamir, Lior and Jaffe, Elaine and Goldberg, Ilya
    (2010)
This dataset contains:
* 20 000 images of Acute Lymphoblastic Leukemia
* 15 000 images of Brain Cancer
* 10 000 images of Breast Cancer
* 25 000 images of Cervical Cancer
* 10 000 images of Kidney Cancer
* 25 000 images of Lung and Colon Cancer
* 10 000 images of Oral Cancer
* 15 000 images of Lymphoma


#### We are working on those 15 000 images of Lymphoma and they are divided into 3 subclasses as follows:
* 5 000 images of “Chronic Lymphocytic Leukemia”
* 5 000 images of “Follicular Lymphoma”
* 5 000 images of “Mantle Cell Lymphoma”

##### Here is an example on each subclass:
![](https://i.imgur.com/nBKrFie.jpeg)
![](https://i.imgur.com/0P6CkHF.jpeg)
![](https://i.imgur.com/uPQCYpC.jpeg)

######   (a) Chronic Lymphocytic Leukemia  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     (b) Follicular Lymphoma  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  (c) Mantle Cell Lymphoma
## Methodology

1. **Data Preprocessing**: Prepare and preprocess the histopathology images to make them suitable for training the CNN.
2. **Model Architecture**: Design and implement a CNN architecture for lymphoma diagnosis.
3. **Neural Architecture Search (NAS)**: Apply NAS to automatically search for an optimized neural network architecture.
4. **Model Training**: Train the CNN model on the preprocessed dataset, utilizing the NAS-discovered architecture.
5. **Evaluation**: Evaluate the model's performance using various metrics to assess its accuracy and effectiveness in lymphoma diagnosis.

By the end of this notebook, we aim to present an efficient and accurate deep learning model for automating lymphoma diagnosis, showcasing the potential improvements achieved through the integration of Neural Architecture Search.


### 1. Data

#### 1.1 Path Setup

In [None]:
data_path_cll = "/kaggle/input/multi-cancer/Multi Cancer/Lymphoma/lymph_cll"
data_path_fl = "/kaggle/input/multi-cancer/Multi Cancer/Lymphoma/lymph_fl"
data_path_mcl = "/kaggle/input/multi-cancer/Multi Cancer/Lymphoma/lymph_mcl"

#### 1.2 Loading Data

In [None]:
import os
import cv2
import numpy as np

def load_images_with_labels(folder_path: str, label: int, img_size: tuple = (128, 128)) -> tuple[np.ndarray, np.ndarray]:
    """
    Load images from a specified folder, resize them, and assign labels.

    Parameters:
    - folder_path (str): The path to the folder containing images.
    - label (int): The label to assign to the loaded images.
    - img_size (tuple): The target size for the images after resizing (default: (128, 128)).

    Returns:
    - Tuple of NumPy arrays: (images, labels)
      - images (np.ndarray): Array of resized and normalized images.
      - labels (np.ndarray): Array of corresponding labels.
    """
    images = []
    labels = []

    for filename in os.listdir(folder_path):
        img_path = os.path.join(folder_path, filename)
        print(f"Loading image: {img_path}")

        img = cv2.imread(img_path)

        if img is None:
            print(f"Error loading image: {img_path}")
            continue

        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img = cv2.resize(img, img_size)
        images.append(img)
        labels.append(label)

    return np.array(images) / 255.0, np.array(labels)

In [None]:
labels_dict = { 0 : 'Chronic Lymphocytic Leukemia',1 : 'Follicular Lymphoma', 2 : 'Mantle Cell Lymphoma'}

In [None]:
lymph_cll = load_images_with_labels(folder_path = data_path_cll, label = 0)
lymph_fl  = load_images_with_labels(folder_path = data_path_fl , label = 1)
lymph_mcl = load_images_with_labels(folder_path = data_path_mcl, label = 2)

In [None]:
lymph_cll_images, lymph_cll_labels = lymph_cll[0], lymph_cll[1]
lymph_fl_images,  lymph_fl_labels  = lymph_fl[0],  lymph_fl[1]
lymph_mcl_images, lymph_mcl_labels = lymph_mcl[0], lymph_mcl[1]

#### 1.3 Split data into Train, test and validate

In [None]:
from sklearn.model_selection import train_test_split

##### 1.3.1 Train and test Split

In [None]:
#Chronic Lymphocytic Leukemia
X_train_validate_cll, X_test_cll, y_train_validate_cll, y_test_cll = train_test_split(lymph_cll_images, lymph_cll_labels, test_size=0.2, random_state=42)
#Follicular Lymphoma
X_train_validate_fl,  X_test_fl,  y_train_validate_fl,  y_test_fl  = train_test_split(lymph_fl_images,  lymph_fl_labels,  test_size=0.2, random_state=42)
#Mantle Cell Lymphoma
X_train_validate_mcl, X_test_mcl, y_train_validate_mcl, y_test_mcl = train_test_split(lymph_mcl_images, lymph_mcl_labels, test_size=0.2, random_state=42)


##### 1.3.2 Train and validate split

In [None]:
#Chronic Lymphocytic Leukemia
X_train_cll, X_val_cll, y_train_cll, y_val_cll = train_test_split(X_train_validate_cll, y_train_validate_cll, test_size=0.2, random_state=42)
#Follicular Lymphoma
X_train_fl,  X_val_fl,  y_train_fl,  y_val_fl  = train_test_split(X_train_validate_fl,  y_train_validate_fl,  test_size=0.2, random_state=42)
#Mantle Cell Lymphoma
X_train_mcl, X_val_mcl, y_train_mcl, y_val_mcl = train_test_split(X_train_validate_mcl, y_train_validate_mcl, test_size=0.2, random_state=42)


##### 1.3.3 Concatenate Data

In [None]:
X_train = np.concatenate((X_train_cll, X_train_fl,X_train_mcl), axis=0)
X_test  = np.concatenate((X_test_cll,  X_test_fl ,X_test_mcl ), axis=0)
X_val   = np.concatenate((X_val_cll,   X_val_fl  ,X_val_mcl  ), axis=0)

In [None]:
y_train = np.concatenate((y_train_cll, y_train_fl,y_train_mcl), axis=0)
y_test  = np.concatenate((y_test_cll,  y_test_fl ,y_test_mcl ), axis=0)
y_val   = np.concatenate((y_val_cll,   y_val_fl  ,y_val_mcl  ), axis=0)

##### 1.3.4 Shapes Check 

In [None]:
print(f'X_train : {X_train.shape} ,  y_train :  {y_train.shape}')
print(f'X_val   : {X_val.shape} ,  y_val   :  {y_val.shape}  ')
print(f'X_test  : {X_test.shape} ,  y_test  :  {y_test.shape} ')


    9600 Images train, 2400 images validate and 3000 images test

##### 1.3.5 Shuffle

In [None]:
# Generate an array of indices and shuffle them
indices_train = np.arange(X_train.shape[0])
indices_val   = np.arange(X_val.shape[0])
indices_test  = np.arange(X_test.shape[0])

In [None]:
np.random.shuffle(indices_train)
np.random.shuffle(indices_val)
np.random.shuffle(indices_test)

In [None]:
# Use the shuffled indices to shuffle both X_train and y_train
X_train_shuffled = X_train[indices_train]
y_train_shuffled = y_train[indices_train]

In [None]:
# Use the shuffled indices to shuffle both X_val and y_val
X_val_shuffled = X_val[indices_val]
y_val_shuffled = y_val[indices_val]

In [None]:
# Use the shuffled indices to shuffle both X_test and y_test
X_test_shuffled = X_test[indices_test]
y_test_shuffled = y_test[indices_test]

In [None]:
# Free Some memory !
del X_train, X_val, X_test, y_train, y_test, y_val, X_train_cll, X_val_cll, y_train_cll, y_val_cll, X_train_fl,  X_val_fl,  y_train_fl,  y_val_fl, X_train_mcl, X_val_mcl, y_train_mcl, y_val_mcl, X_train_validate_mcl, X_test_mcl, y_train_validate_mcl, y_test_mcl, X_train_validate_fl,  X_test_fl,  y_train_validate_fl,  y_test_fl,X_train_validate_cll, X_test_cll, y_train_validate_cll, y_test_cll 
del lymph_cll_images, lymph_cll_labels, lymph_fl_images,  lymph_fl_labels, lymph_mcl_images, lymph_mcl_labels, lymph_cll, lymph_fl, lymph_mcl

### 1.4 EDA (Exploratory Data Analysis)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image

#### 1.4.1 Visualize Class Distribution

In [None]:
y_mapped = [labels_dict[label] for label in y_train_shuffled]

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(x=y_mapped)
plt.title('Class Distribution')
plt.show()

#### 1.4.2 Display Sample Images

In [None]:
def visualize_images(images, labels, class_names=None, num_samples=4):
    num_rows = 1
    num_cols = num_samples
    plt.figure(figsize=(16, 16))

    for i in range(num_samples):
        
        plt.subplot(num_rows, num_cols, i + 1)
        plt.imshow(images[i])
        if class_names:
            plt.title(class_names[labels_dict[labels[i]]])
        else:
            plt.title(f"Label: {labels_dict[labels[i]]}")
        plt.axis('off')

    plt.show()

In [None]:
# Display 5 random samples
visualize_images(X_train_shuffled, y_train_shuffled)

#### 1.4.3 Explore Color Channels

In [None]:
# Explore color channels
def plot_color_channels(img):
    plt.figure(figsize=(12, 4))
    plt.subplot(1, 4, 1)
    plt.imshow(img)
    plt.title('Original Image')
    plt.axis('off')

    for i, channel in enumerate(['Red', 'Green', 'Blue']):
        plt.subplot(1, 4, i + 2)
        plt.imshow(img[:, :, i], cmap='gray')
        plt.title(f'{channel} Channel')
        plt.axis('off')

    plt.show()

In [None]:
random_index = np.random.randint(0, len(X_train_shuffled))
random_image = X_train_shuffled[random_index]
plot_color_channels(random_image)


#### 1.4.4 Pixel Intensity Distribution

In [None]:
# Explore pixel intensity distribution
def plot_pixel_intensity_distribution(img):
    plt.figure(figsize=(12, 4))
    plt.subplot(1, 2, 1)
    plt.imshow(img)
    plt.title('Original Image')
    plt.axis('off')

    plt.subplot(1, 2, 2)
    plt.hist(img.ravel(), bins=256, color='gray', histtype='step')
    plt.title('Pixel Intensity Distribution')
    plt.xlabel('Pixel Intensity')
    plt.ylabel('Frequency')

    plt.show()

In [None]:
plot_pixel_intensity_distribution(random_image)

#### 1.4.5 Average Pixel Intensity per Class

In [None]:
# Calculate and visualize average pixel intensity per class
def average_pixel_intensity_per_class(X, y):
    unique_classes = np.unique(y)
    avg_intensity_per_class = []

    for label in unique_classes:
        class_indices = np.where(y == label)[0]
        class_images = X[class_indices]
        avg_intensity = np.mean(class_images)
        avg_intensity_per_class.append(avg_intensity)

    plt.bar(unique_classes, avg_intensity_per_class)
    plt.title('Average Pixel Intensity per Class')
    plt.xlabel('Class')
    plt.ylabel('Average Pixel Intensity')
    plt.show()

In [None]:
# Visualize average pixel intensity per class
average_pixel_intensity_per_class(X_train_shuffled, y_train_shuffled)

#### 1.4.6 Correlation Between Channels

In [None]:
# Explore correlation between color channels
def plot_channel_correlation(img):
    plt.figure(figsize=(12, 4))
    plt.subplot(1, 4, 1)
    plt.imshow(img)
    plt.title('Original Image')
    plt.axis('off')
    
    plt.subplot(1, 4, 2)
    plt.scatter(img[:, :, 0].ravel(), img[:, :, (1) % 3].ravel(), s=2, alpha=0.5)
    plt.title(f'Correlation: Red vs Green')
    plt.xlabel(f'Red Channel')
    plt.ylabel(f'Green Channel')
    
    plt.subplot(1, 4, 3)
    plt.scatter(img[:, :, 1].ravel(), img[:, :, (2) % 3].ravel(), s=2, alpha=0.5)
    plt.title(f'Correlation: Green vs Blue')
    plt.xlabel(f'Green Channel')
    plt.ylabel(f'Blue Channel')
    
    plt.subplot(1, 4, 4)
    plt.scatter(img[:, :, 2].ravel(), img[:, :, (3) % 3].ravel(), s=2, alpha=0.5)
    plt.title(f'Correlation: Blue vs Green')
    plt.xlabel(f'Blue Channel')
    plt.ylabel(f'Green Channel')


    plt.show()

In [None]:
plot_channel_correlation(random_image)

## 2. Model

### 2.1 Use Naive ResNet

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.utils import to_categorical

#### 2.1.1 Load Pre-Trained ResNet-50 Version

In [None]:
# Load pre-trained ResNet50 model (excluding top classification layer)
base_resnet = ResNet50(weights='imagenet', include_top=False, input_shape=(128, 128, 3))

In [None]:
# Freeze the layers of the pre-trained ResNet model
for layer in base_resnet.layers:
    layer.trainable = False

#### 2.1.2 Modeling on top of PreTrained Base ResNet-50

In [None]:
# Create a new model on top of the pre-trained ResNet model
ResNet50 = models.Sequential()
ResNet50.add(base_resnet)
ResNet50.add(layers.GlobalAveragePooling2D())
ResNet50.add(layers.Dense(256, activation='relu'))
ResNet50.add(layers.Dropout(0.5))
ResNet50.add(layers.Dense(3, activation='softmax'))

#### 2.1.3 One Hot Encoding

In [None]:
y_train_one_hot = to_categorical(y_train_shuffled, 3)
y_val_one_hot   = to_categorical(y_val_shuffled, 3)
y_test_one_hot  = to_categorical(y_test_shuffled, 3)

In [None]:
ResNet50.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

#### 2.1.4 Train

In [None]:
res_net_history = ResNet50.fit(X_train_shuffled, y_train_one_hot, batch_size=32, epochs=10, validation_data=(X_val_shuffled, y_val_one_hot))

#### 2.1.5 Evaluation

##### 2.1.5.1 Plot Learning Curve

In [None]:
def plot_learning_curves(history):
    plt.figure(figsize=(12, 6))

    # Plot training & validation accuracy values
    plt.subplot(1, 2, 1)
    plt.plot(history.history['accuracy'])
    plt.plot(history.history['val_accuracy'])
    plt.title('Model accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend(['Train', 'Validate'], loc='upper left')

    # Plot training & validation loss values
    plt.subplot(1, 2, 2)
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('Model loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend(['Train', 'Validate'], loc='upper left')

    plt.tight_layout()
    plt.show()

In [None]:
plot_learning_curves(res_net_history)

##### 2.1.5.2  Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
y_pred_pretrained = ResNet50.predict(X_test_shuffled)
y_pred_classes_pretrained = np.argmax(y_pred_pretrained, axis=1)
y_test_classes_pretrained = np.argmax(y_test_one_hot, axis=1)

In [None]:
confusion_mtx_resnet = confusion_matrix(y_pred_classes_pretrained, y_test_classes_pretrained)
print("Confusion Matrix:")
print(confusion_mtx_resnet)

In [None]:
plt.figure(figsize=(10, 10))
plt.imshow(confusion_mtx_resnet, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix Using Pretrained ResNet-50')
plt.colorbar()
tick_marks = np.arange(3)
plt.xticks(tick_marks, [labels_dict[0], labels_dict[1], labels_dict[2]])
plt.yticks(tick_marks, [labels_dict[0], labels_dict[1], labels_dict[2]])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

##### 2.1.5.3 Classification Report

In [None]:
from sklearn.metrics import classification_report

In [None]:
class_report_resnet = classification_report(y_test_classes_pretrained, y_pred_classes_pretrained)
print("Classification Report Using Pretrained ResNet-50:")
print(class_report_resnet)

### 2.2 Using NAS (Neural Architecture Search) With Convolution Blocks

In [None]:
!pip install autokeras

In [None]:
import autokeras as ak

#### 2.2.1 NAS Modeling

In [None]:
clf = ak.ImageClassifier(overwrite=True, max_trials=3)

#### 2.2.2 Searching

In [None]:
nas = clf.fit(X_train_shuffled, y_train_one_hot, batch_size=32, epochs=10, validation_data=(X_val_shuffled, y_val_one_hot))

#### 2.2.3 Evaluation

##### 2.2.3.1 Learning Curve 

In [None]:
plot_learning_curves(nas)

##### 2.2.3.2 Confusion Matrix

In [None]:
y_pred_pretrained = clf.predict(X_test_shuffled)
y_pred_classes_pretrained = np.argmax(y_pred_pretrained, axis=1)
y_test_classes_pretrained = np.argmax(y_test_one_hot, axis=1)

In [None]:
confusion_mtx_nas = confusion_matrix(y_pred_classes_pretrained, y_test_classes_pretrained)
print("Confusion Matrix:")
print(confusion_mtx_nas)

In [None]:
plt.figure(figsize=(10, 10))
plt.imshow(confusion_mtx_nas, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix Using NAS')
plt.colorbar()
tick_marks = np.arange(3)
plt.xticks(tick_marks, [labels_dict[0], labels_dict[1], labels_dict[2]])
plt.yticks(tick_marks, [labels_dict[0], labels_dict[1], labels_dict[2]])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

##### 2.1.5.3 Classification Report

In [None]:
class_report_nas = classification_report(y_test_classes_pretrained, y_pred_classes_pretrained)
print("Classification Report Using Pretrained NAS")
print(class_report_nas)

## Conclusion

The results of our experiments highlight the remarkable effectiveness of the Neural Architecture Search (NAS) optimization technique in the context of lymphoma diagnosis in histopathology images. While the pre-trained ResNet-50 model struggled with an accuracy of 50% and exhibited limitations in precision, recall, and F1-score, the NAS-optimized model achieved a perfect accuracy of 100%.

The NAS approach, with its ability to automatically discover optimal neural network architectures for a given dataset, proved to be a powerful tool for enhancing the performance of the model. The perfect precision, recall, and F1-score across all classes demonstrate the robustness and reliability of the NAS-optimized model in accurately identifying different types of lymphomas.

In conclusion, the NAS optimization technique emerges as a promising avenue for further research in medical image classification tasks. Its ability to adapt and tailor neural network architectures to specific datasets showcases its potential for improving diagnostic accuracy in histopathology images. Future work may involve exploring the application of NAS on larger datasets and investigating its generalizability across different medical imaging domains.