##**Breast Cancer Detection via Convolutional Neural Network**

##**Abstract**

The CNN model is designed with a stack of three convolutional layers and two fully connected layers. Each convolutional layer strategically employs max-pooling and batch normalization, enhancing the network's ability to discern hierarchical features in breast ultrasound images. This architecture is purpose-built to classify these images into the categories of benign or malignant. The normal class has been excluded from the model, as the image composition is vastly different from the other classes, causing the convolution errors in running. The utilization of binary crossentropy loss and accuracy metrics during model compilation underscores its' suitability for multiclass image classification tasks. The comprehensive summary of the CNN model provides an overview of its structure and layer configurations, conveying the intricate architecture designed for optimal feature extraction and classification. Moreover, the hyperparameters, including the number of convolutional layers, kernel size, and max-pooling size, are explicitly specified, offering a detailed glimpse into the configurational aspects that contribute to the model's efficacy in medical image analysis [Rimal].

##**Data Preprocessing**

In [None]:
import os
from PIL import Image
import random
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, BatchNormalization, Dropout
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import optimizers
import time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import re
from PIL import ImageOps
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#data path for madison
#data_dir = "/content/drive/MyDrive/fall23/dsc201/Dataset_BUSI_with_GT/"

In [None]:
#data path for sofia
#data_dir = "/content/drive/MyDrive/DSC201/Project3/Dataset_BUSI_with_GT/"

In [None]:
# Data Path for Anthony
data_dir = "/content/drive/MyDrive/DSC201/Images/Dataset_BUSI_with_GT/"

In [None]:
# Data Path for Dr. Rimal
#data_dir = "/content/drive/MyDrive/Breast_Cancer/"

In [None]:
def load_and_resize_image(image_path, desired_width, desired_height):
    im = Image.open(image_path)

    width, height = im.size

    left = 0
    right = 2 * width / 3
    bottom = 2 * height / 3

    im1 = im.crop((left, 0, right, bottom))

    im1 = ImageOps.grayscale(im1)

    return im1.resize((desired_width, desired_height))

In [None]:
desired_width = 300
desired_height = 300

image_files = [file for file in os.listdir(data_dir) if file.endswith('.png') and 'mask' in file and 'normal' not in file]
labels = [image_file.split('_')[0] for image_file in image_files]

label_pattern = re.compile(r'[^0-9()]+')
labels = [label_pattern.match(image_file).group() for image_file in image_files]

dataset = []

for image_file, label in zip(image_files, labels):
    image_path = os.path.join(data_dir, image_file)
    resized_image = load_and_resize_image(image_path, desired_width, desired_height)
    if resized_image is not None:
        dataset.append({'image': np.array(resized_image), 'label': label})

random.shuffle(dataset)

if not dataset:
    raise ValueError("Dataset is empty. Please check the loading and resizing of images.")

In [None]:
# Create a figure to display the images
fig, axs = plt.subplots(3, 4, figsize=(18, 12))

# Display the first twelve images
for i in range(3):
    for j in range(4):
        index = i * 4 + j
        if index < len(dataset):
            image_data = dataset[index]['image']
            label = dataset[index]['label']

            # Display the image using imshow
            axs[i, j].imshow(image_data)
            axs[i, j].set_title(label)
            axs[i, j].axis('off')

# Adjust layout and display the plot
plt.tight_layout()
plt.show()

###**Normalization**

In [None]:
# Separate images and labels from the dataset
grayscale_images = [ImageOps.grayscale(Image.fromarray(data['image'])) for data in dataset]
images = np.array([np.array(image) for image in grayscale_images])
labels = np.array([data['label'] for data in dataset])

# Convert labels to categorical format
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)
categorical_labels = to_categorical(encoded_labels)

# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(images, categorical_labels, test_size=0.2, random_state=42)

# Normalize pixel values to a range between 0 and 1
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

In [None]:
print("x_train shape:", x_train.shape)
print("x_test shape:", x_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

##**Building the CNN Model**

In [None]:
def Build_CNN_Model(conv_layers, kernel_size, max_pool_size, optimizer='Adam', learning_rate=0.001):
    model = Sequential()

    # 1st Convolutional layer
    model.add(Conv2D(conv_layers[0], kernel_size, activation='relu', strides=(6, 6), input_shape=(300, 300, 1)))
    model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    model.add(BatchNormalization())

    # 2nd Convolutional layer
    model.add(Conv2D(conv_layers[1], (5, 5), activation='relu', padding='same'))
    model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    model.add(BatchNormalization())

    # # 3rd Convolutional layer
    # model.add(Conv2D(conv_layers[2], (3, 3), activation='relu', padding='same'))
    # model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

    # 4th Convolutional layer
    # model.add(Conv2D(conv_layers[3], (3, 3), activation='relu', padding='same'))

    # 5th Convolutional layer
    # model.add(Conv2D(conv_layers[4], (3, 3), activation='relu', padding='same'))
    # model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    # model.add(BatchNormalization())

    # Flatten layer
    model.add(Flatten())

    # Fully connected layer
    model.add(Dense(16, activation='relu'))
    # model.add(Dropout(0.5))
    # model.add(Dense(4, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(2, activation='softmax'))

    if optimizer == 'Adam':
        opt = optimizers.Adam(learning_rate=learning_rate)
    elif optimizer == 'Adagrad':
        opt = optimizers.Adagrad(learning_rate=learning_rate)
    elif optimizer == 'Nadam':
        opt = optimizers.Nadam(learning_rate=learning_rate)
    elif optimizer == 'Ndadelta':
        opt = optimizers.Adadelta(learning_rate=learning_rate)
    elif optimizer == 'Rmsprop':
        opt = optimizers.RMSprop(learning_rate=learning_rate)
    else:
        print("No optimizer found in the list(['Adam', 'Adagrad', 'Nadam', 'Adadelta', 'Rmsprop'])! "
              "Please apply your optimizer manually...")

    model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])

    #print(model.summary())
    return model

In [None]:
optimizer_names = ['Adam', 'Adagrad']
learning_rate = 0.001
conv_layers = [64, 32]
kernel_size = (5,5)
max_pool_size = (2,2)
Build_CNN_Model(conv_layers, kernel_size, max_pool_size, optimizer = optimizer_names[0], learning_rate= learning_rate).summary()

In [None]:
model = Build_CNN_Model(conv_layers, kernel_size, (2,2), 'Adam',
                                            learning_rate=0.01)
callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)
model.fit(x_train, y_train, batch_size=16, epochs=2, validation_data=(x_test, y_test), callbacks=[callback])

##**Hyperparameter Tuning**

In [None]:
def write_dic_to_file(dic_name, file_name):
    with open(file_name, 'w') as file:
        file.write(str(dic_name))

In [None]:
def CNN_Hyper_Parameter_Tuning(conv_layers, kernel_size, max_pool_shape, optimizers_names,
                               learning_rates, batch_sizes, epochs, num_replicates=2):

    best_avg_accuracy = 0.0
    collect_accuracy = []
    all_avg_accuracy = np.zeros((len(optimizers_names), len(learning_rates), len(batch_sizes)))

    best_hyper_parameters = {"model": conv_layers,
                           "max_pool_shape": None,
                           "optimizer": None,
                           "learning_rate": None,
                           "batch_size": None,
                           "best_avg_accuracy": None}

    for opt in range(len(optimizers_names)):
        for lr in range(len(learning_rates)):
            for bs in range(len(batch_sizes)):
                for i in range(num_replicates):
                    print("Running for " + optimizers_names[opt] + " optimizer " + str(learning_rates[lr]) + \
                          " learning_rate " + str(batch_sizes[bs]) + " batch_size and " + str(i) + " replicate " + "\n")

                    model = Build_CNN_Model(conv_layers, kernel_size, max_pool_shape, optimizers_names[opt],
                                            learning_rate=learning_rates[lr])
                    callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)
                    history = model.fit(x_train, y_train, batch_size=batch_sizes[bs], epochs=epochs,
                                        validation_data=(x_test, y_test), callbacks=[callback])
                    collect_accuracy.append(history.history['accuracy'])

                avg_accuracy = np.mean(np.array(collect_accuracy))
                print("Average accuracy for this model: ", avg_accuracy)
                all_avg_accuracy[opt][lr][bs] = avg_accuracy

                if avg_accuracy > best_avg_accuracy:
                    best_avg_accuracy = avg_accuracy
                    best_hyper_parameters = {"model": conv_layers,
                                           "max_pool_shape": max_pool_shape,
                                           "optimizer": optimizers_names[opt],
                                           "learning_rate": learning_rates[lr],
                                           "batch_size": batch_sizes[bs],
                                           "best_avg_accuracy": best_avg_accuracy}

    output_dictionary = {
        "best_hyper_parameters": best_hyper_parameters,
        "all_avg_accuracy": all_avg_accuracy
    }

    # writing output dictionary to a file
    file_name = "cnn-" + str(conv_layers[0]) + "-hyperparameter_tuning_results" + ".txt"
    write_dic_to_file(output_dictionary, file_name)

    print("Best_hyper_parameters(CNN): \n", output_dictionary['best_hyper_parameters'])
    print("all_avg_accuracy(CNN): \n", output_dictionary['all_avg_accuracy'])

    return output_dictionary['best_hyper_parameters']


In [None]:
conv_layers = [64, 32]
kernel_size = (5, 5)
max_pool_size = (4, 4)
#kernel_size = (3, 3)
#max_pool_size = (2, 2)
optimizer_names = ['Adam', 'Nadam']
learning_rates = [0.001, 0.01]
batch_sizes = [32, 64]
epochs = 5
num_replicates = 10

In [None]:
alexnet_best_hyper_parameters = CNN_Hyper_Parameter_Tuning(conv_layers, kernel_size, max_pool_size,
                                                           optimizer_names, learning_rates, batch_sizes,
                                                           epochs=epochs, num_replicates=num_replicates)
alexnet_best_hyper_parameters

##**Best CNN Model**

In [None]:
def Final_CNN_Model(conv_layers, kernel_size, max_pool_shape, hyper_parameters, epochs=5, num_replicates=10):
    # Arrays for collecting performance scores
    accuracy_array = np.zeros(num_replicates)
    elapsed_time_array = np.zeros(num_replicates)

    models_history = []

    for i in range(num_replicates):
        print("Program is running for %d replicate ----->\n" % i)

        model = Build_CNN_Model(conv_layers, kernel_size, max_pool_shape, optimizer=hyper_parameters["optimizer"],
                                    learning_rate=hyper_parameters["learning_rate"])
        callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

        start = time.time()
        history = model.fit(x_train, y_train, batch_size=hyper_parameters["batch_size"], epochs=epochs,
                            validation_data=(x_test, y_test), callbacks=[callback])
        end = time.time()
        elapsed_time = end - start

        models_history.append(history)

        # Calculating performance scores
        accuracy_array[i] = history.history['accuracy'][-1]
        elapsed_time_array[i] = elapsed_time

    avg_accuracy = np.mean(accuracy_array)
    avg_elapsed_time = np.mean(elapsed_time_array)

    # Collecting important results
    performance_metrics = {
        'scores': {'accuracy': accuracy_array, 'elapsed_time': elapsed_time_array},
        'avg_scores': {'accuracy': avg_accuracy, 'elapsed_time': avg_elapsed_time},
        'stds': {'accuracy': np.std(accuracy_array), 'elapsed_time': np.std(elapsed_time_array)},
        'maximums': {'accuracy': np.max(accuracy_array), 'elapsed_time': np.max(elapsed_time_array)}
    }

    model_with_best_accuracy = {
        'replicate': np.argmax(accuracy_array),
        'accuracy': np.max(accuracy_array),
        'elapsed_time': elapsed_time_array[np.argmax(accuracy_array)],
        'history': models_history[np.argmax(accuracy_array)].history
    }

    # Collecting all the outputs together
    output_dictionary = {
        'best_model': model_with_best_accuracy,
        'hyper_parameters': hyper_parameters,
        'performance_metrics': performance_metrics,
        'models_history': models_history
    }

    print("Progress: All works are done successfully, congratulations!!\n")
    return output_dictionary

In [None]:
best_hyperparameters = {'model': [64, 32],
                        'max_pool_shape': (4, 4),
                        'optimizer': 'Adam',
                        'learning_rate': 0.001,
                        'batch_size': 64,
                        'best_avg_accuracy': 0.7302443587779999}

epochs = 10
num_replicates = 20

cnn_output = Final_CNN_Model(conv_layers=best_hyperparameters['model'],
                                      kernel_size=(5, 5),
                                      max_pool_shape=best_hyperparameters['max_pool_shape'],
                                      hyper_parameters=best_hyperparameters,
                                      epochs=epochs,
                                      num_replicates=num_replicates)

cnn_output

##**Confusion Matrix**

In [None]:
best_model = cnn_output['best_model']['history']

y_pred = model.predict(x_test)

y_true = np.argmax(y_test, axis=1)
y_pred = np.argmax(y_pred, axis=1)

class_labels = label_encoder.classes_
class_mapping = {i: class_labels[i] for i in range(len(class_labels))}
y_true_labels = [class_mapping[label] for label in y_true]
y_pred_labels = [class_mapping[label] for label in y_pred]

conf_matrix = confusion_matrix(y_true_labels, y_pred_labels, labels=class_labels)

plt.rcParams.update({'font.size': 16})

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Greens', xticklabels=class_labels, yticklabels=class_labels)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()


In [None]:
class_report = classification_report(y_true_labels, y_pred_labels, target_names=class_labels)
print("Classification Report:\n", class_report)