# Breast Cancer Detection

**Author: Ogo Ndugba**

## Overview

Cancer is a disease that happens when cells grow and multiply uncontrollably. There are more than 100 different types of cancer. Breast cancer happens when cells in a person's breast tissue start growing out of control. Cancer can be very deadly - some cancers have a 5 year survival rate of less than 10%, while others have 5 year survival rate of over 90%. Breast Cancer in particular has a 5 year survival rate of over 90% when it is diagnosed very early - in stage 0 or stage 1. This survival rate falls to 22% if is diagnosed late - when the cancer has spread to other organs stage 4.

What's interesting is the fact that cancer is more prevalent in developed countries.

![](Images/share-of-population-with-cancer.png)


## Business Understanding

Cancer is very prevalent in the US - we have the highest incidence rating in the developed world. Cancer is the second leading cause of death in the United States. It is responsible for more than 1 in 4 deaths. More than 1.6 million Americans have been diagnosed with cancer each year since 2016. As of 2019, there were more than 16,000,000 Americans living with some form of cancer.

Breast Cancer is the most prevalent cancer in the world. There are some 19 million people living with breast cancer. And looking at research - we know that early diagnosis and treatment lead to better survival odds. Out of the more than 100 cancers - we typically only screen for 4 - 6 cancers - depending on what country you live in. These cancers are: breast, prostate, cervical, lung, colorectal, and skin cancer. 

The tool that is the current gold standard for breast cancer screening is a mammogram.

Mammograms are recommended every year or two based a person's risk factors.

However mammograms arent perfect. They have a false positive rate of anywhere from 10-20% - according to several studies. They also have a false negative rate of about 15%. 

These error rates are increased for women who have dense breast tissue, are younger, and are women of color. 

The goal of this project is to develop a machine learning classification algorithmn for the National Institues of health that is more reliable at finding instances of breast cancer and also mimimzing the false negative and false positive rates.

![](Images/number-of-people-with-cancer-by-type.png)


## Data Understanding

The data for this project comes from the BreakHis dataset on Kaggle. It contains 7,909 images - of which 2480 are benign (no cancer) and 5429 are malignant (cancer). ["Kaggle"](https://www.kaggle.com/datasets/ambarish/breakhis). This images were collected from "82 patients using different magnifying factors (40X, 100X, 200X, and 400X)". Approximately 70% of the images in our dataset were malignant. 


![](Images/Distribution-of-Mammogram.png)

## Data Preparation

In [None]:

# importing Packages

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import splitfolders
import itertools
from matplotlib.image import imread
import matplotlib.pyplot as plt
from PIL import Image

from sklearn.metrics import plot_confusion_matrix, confusion_matrix
from sklearn.metrics import recall_score, ConfusionMatrixDisplay, plot_roc_curve
from sklearn.metrics import precision_score, classification_report 
from sklearn.model_selection import cross_val_score
from sklearn.dummy import DummyClassifier

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras import models, layers, optimizers, metrics, regularizers, losses
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.wrappers import scikit_learn
from tensorflow.keras.layers import Dense, Dropout, Flatten

import warnings
warnings.filterwarnings('ignore')
np.random.seed(2004)
%matplotlib inline



In [None]:
# installing split folders package
#pip install split-folders

In [None]:
# using the splitfolders package to split the images into train, validation, and test sets.
# this is now commented out so that a new folder isnt created each time notebook is run


# splitfolders.ratio("CancerData", output="Data",
# seed=42, ratio=(.64, .16, .2), group_prefix=None, move=True)

In [None]:
# using ImageDataGenerator to rescale all images 
train_datagen = ImageDataGenerator(rescale=1./255)
val_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

In [None]:
train_generator = train_datagen.flow_from_directory(
        # This is the target directory
        'Data/train',
        # All images will be resized to 150x150
        target_size=(150, 150),
        batch_size=5061,
        color_mode='grayscale',
        # Since we use binary_crossentropy loss, we need binary labels
        class_mode='binary')


In [None]:
validation_generator = val_datagen.flow_from_directory('Data/val',
                                                        target_size=(150, 150),
                                                        batch_size=1264,
                                                        color_mode='grayscale',
                                                        class_mode='binary')
test_generator = test_datagen.flow_from_directory('Data/test',
                                                  target_size=(150, 150),
                                                  batch_size=1584,
                                                  color_mode='grayscale',
                                                  class_mode='binary')

In [None]:
#Creating the augumented data
aug_train_images = ImageDataGenerator(rotation_range=30, 
                                   width_shift_range=0.25, 
                                   height_shift_range=0.25, 
                                   shear_range=0.25, 
                                   zoom_range=0.25, 
                                   horizontal_flip=True,
                                   vertical_flip=True)

train_aug = aug_train_images.flow_from_directory('Data/train',
                                                  target_size=(150, 150),
                                                  batch_size=3747,
                                                  color_mode='grayscale',
                                                  class_mode='binary')

In [None]:
#getting images and labels for models
train_data, train_labels = next (train_generator)
test_data, test_labels = next (test_generator)
val_data, val_labels = next (validation_generator)

In [None]:
#reshaping for our Simple Model - dimension needs to be 2D 
train_data = train_data.reshape(train_data.shape[0], -1)
test_data = test_data.reshape(test_data.shape[0], -1)
val_data = val_data.reshape(val_data.shape[0], -1)
train_data.shape

In [None]:
# Function to show confusion matrix 
##from sklearn
def plot_confusion_matrix(cm, classes,
                        normalize=False,
                        title='Confusion matrix',
                        cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
            horizontalalignment="center",
            color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

### Dummy Model
I will use a dummy model classifier as the baseline model. This model will predict the majority class. Since the majority class in our data is malignant, this model will predict all images are malignant.

In [None]:
# instantiate our DummyModel and fit to train dataset
dummy_model =  DummyClassifier(strategy='most_frequent')
dummy_model.fit(train_data, train_labels)

In [None]:
# creating predictions to evalaute model 
y_pred = (dummy_model.predict(test_data))

In [None]:
# getting metrics for model
dummy_acc = dummy_model.score(test_data, test_labels)
dummy_rec = recall_score(test_labels,y_pred)
dummy_pre = precision_score(test_labels,y_pred)

print(f"Dummy Model accuracy: {dummy_acc}")
print(f"Dummy Model recall: {dummy_rec}")
print(f"Dummy Model precision: {dummy_pre}")

In [None]:
# creating confusion matrix
cm = confusion_matrix(y_true= test_labels, y_pred=y_pred) 
cm_labels = ['Benign','Malignant']
plot_confusion_matrix(cm=cm, classes=cm_labels, title='Confusion Matrix');

### Simple Model 1
The first model will be a basic simple model.

In [None]:
# instantiating neural network model
simple_model = models.Sequential()

In [None]:
# giving input and output layers
simple_model.add(layers.Dense(12, activation='relu', input_shape=(22500,)))
simple_model.add(layers.Dense(1, activation='sigmoid')) 

In [None]:
# compiling model and printing summary
simple_model.compile(optimizer='SGD',
                       loss='binary_crossentropy',
                       metrics=['accuracy', metrics.Precision(name='precision'), metrics.Recall(name='recall')])
simple_model.summary()

In [None]:
# training our simple model and validating using out subset of validation data
simple_model_history = simple_model.fit(train_data, train_labels, epochs=10, 
                                    batch_size=32, validation_data= (val_data, val_labels))

In [None]:
#visualizing metrics
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2,figsize=(15,8))

loss = simple_model_history.history['loss']
accuracy = simple_model_history.history['accuracy']
precision = simple_model_history.history['precision']
recall = simple_model_history.history['recall']

validation_loss = simple_model_history.history['val_loss']
validation_accuracy = simple_model_history.history['val_accuracy']
validation_precision = simple_model_history.history['val_precision']
validation_recall = simple_model_history.history['val_recall']

sns.lineplot(simple_model_history.epoch, simple_model_history.history['loss'], y=loss, ax=ax1, label='loss')
sns.lineplot(simple_model_history.epoch, simple_model_history.history['val_loss'], y=loss, ax=ax1, label='val_loss')


sns.lineplot(simple_model_history.epoch, simple_model_history.history['accuracy'], y=accuracy, ax=ax2, label='accuracy')
sns.lineplot(simple_model_history.epoch, simple_model_history.history['val_accuracy'], y=accuracy, ax=ax2, label='val_accuracy')


sns.lineplot(simple_model_history.epoch, simple_model_history.history['precision'], y=precision, ax=ax3, label='precision')
sns.lineplot(simple_model_history.epoch, simple_model_history.history['val_precision'], y=precision, ax=ax3, label='val_precision')


sns.lineplot(simple_model_history.epoch, simple_model_history.history['recall'], y=recall, ax=ax4, label='recall')
sns.lineplot(simple_model_history.epoch, simple_model_history.history['val_recall'], y=recall, ax=ax4, label='val_recall');



In [None]:
# creating predictions to test model metrics against validation data
y_pred = (simple_model.predict(val_data) > 0.5).astype("int32")
cm = confusion_matrix(y_true= val_labels, y_pred=y_pred) 

In [None]:
# visualizing confusion matrix
cm_labels = ['Benign','Malignant']
plot_confusion_matrix(cm=cm, classes=cm_labels, title='Confusion Matrix');

In [None]:
# evaluating model
results = simple_model.evaluate(val_data, val_labels)

In [None]:
# getting model metrics
print(f"Model loss:  {results[0]}")
print(f"Model accuracy: {results[1]}")
print(f"Model precision: {results[2]}")
print(f"Model recall: {results[3]}")

### Simple Model 2
Added another layer to the basic simple model

In [None]:
simple_model2 = models.Sequential([
    layers.Flatten(input_shape=(22500,1)),
    layers.Dense(32, activation='relu', input_shape=(22500,)),
    layers.Dense(1, activation='sigmoid')]
# compiling model and printing summary
simple_model2.compile(optimizer='SGD',
                       loss='binary_crossentropy',
                       metrics=['accuracy', metrics.Precision(name='precision'), metrics.Recall(name='recall')])
simple_model2.summary()

In [None]:
# training our simple model and validating using out subset of validation data
simple_model2_history = simple_model2.fit(train_data, train_labels, epochs=20, 
                                    batch_size=None, validation_data= (val_data, val_labels))

In [None]:
#visualizing metrics
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2,figsize=(15,8))

loss = simple_model2_history.history['loss']
accuracy = simple_model2_history.history['accuracy']
precision = simple_model2_history.history['precision']
recall = simple_model2_history.history['recall']

validation_loss = simple_model2_history.history['val_loss']
validation_accuracy = simple_model2_history.history['val_accuracy']
validation_precision = simple_model2_history.history['val_precision']
validation_recall = simple_model2_history.history['val_recall']

sns.lineplot(simple_model2_history.epoch, simple_model2_history.history['loss'], y=loss, ax=ax1, label='loss')
sns.lineplot(simple_model2_history.epoch, simple_model2_history.history['val_loss'], y=loss, ax=ax1, label='val_loss')


sns.lineplot(simple_model2_history.epoch, simple_model2_history.history['accuracy'], y=accuracy, ax=ax2, label='accuracy')
sns.lineplot(simple_model2_history.epoch, simple_model2_history.history['val_accuracy'], y=accuracy, ax=ax2, label='val_accuracy')


sns.lineplot(simple_model2_history.epoch, simple_model2_history.history['precision'], y=precision, ax=ax3, label='precision')
sns.lineplot(simple_model2_history.epoch, simple_model2_history.history['val_precision'], y=precision, ax=ax3, label='val_precision')


sns.lineplot(simple_model2_history.epoch, simple_model2_history.history['recall'], y=recall, ax=ax4, label='recall')
sns.lineplot(simple_model2_history.epoch, simple_model2_history.history['val_recall'], y=recall, ax=ax4, label='val_recall');



In [None]:
# creating predictions to test model metrics against validation data
y_pred = (simple_model2.predict(val_data) > 0.5).astype("int32")
cm = confusion_matrix(y_true= val_labels, y_pred=y_pred) 

In [None]:
# visualizing confusion matrix
cm_labels = ['Benign','Malignant']
plot_confusion_matrix(cm=cm, classes=cm_labels, title='Confusion Matrix');

In [None]:
# evaluating model
results = simple_model2.evaluate(val_data, val_labels)

In [None]:
# getting model metrics
print(f"Model loss:  {results[0]}")
print(f"Model accuracy: {results[1]}")
print(f"Model precision: {results[2]}")
print(f"Model recall: {results[3]}")

### CNN Models 
Started off with a Convulution Neural Network model and built several iterations with it - from adding different types of layers, to adding a regularizer.

In [None]:

#recreating data sets for our CNN models  - dimension needs to be 4D 
train_data, train_labels = next (train_generator)
test_data, test_labels = next (test_generator)
val_data, val_labels = next (validation_generator)

In [None]:
cnn1_model = models.Sequential([
    layers.Conv2D(64, (3, 3), activation='relu',
                    input_shape=(150, 150, 1)),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(16, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])
cnn1_model.compile(optimizer="adam",
                        loss='binary_crossentropy',
                        metrics=['accuracy', metrics.Precision(name='precision'), metrics.Recall(name='recall')])

cnn1_model.summary()

In [None]:
cnn1_history = cnn1_model.fit(train_data,
               train_labels,
               batch_size=10,
               epochs=20,
               validation_data=(val_data, val_labels))

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2,figsize=(15,8))

sns.lineplot(cnn1_history.epoch, cnn1_history.history['loss'], y=loss, ax=ax1, label='loss')
sns.lineplot(cnn1_history.epoch, cnn1_history.history['val_loss'], y=loss, ax=ax1, label='val_loss')

sns.lineplot(cnn1_history.epoch, cnn1_history.history['accuracy'], y=accuracy, ax=ax2, label='accuracy')
sns.lineplot(cnn1_history.epoch, cnn1_history.history['val_accuracy'], y=accuracy, ax=ax2, label='val_accuracy')

sns.lineplot(cnn1_history.epoch, cnn1_history.history['precision'], y=precision, ax=ax3, label='precision')
sns.lineplot(cnn1_history.epoch, cnn1_history.history['val_precision'], y=precision, ax=ax3, label='val_precision')

sns.lineplot(cnn1_history.epoch, cnn1_history.history['recall'], y=recall, ax=ax4, label='recall')
sns.lineplot(cnn1_history.epoch, cnn1_history.history['val_recall'], y=recall, ax=ax4, label='val_recall');


In [None]:
y_pred = (cnn1_model.predict(val_data) > 0.5).astype("int32")
cm = confusion_matrix(y_true= val_labels, y_pred=y_pred)  

In [None]:
cm_labels = ['Benign','Malignant']
plot_confusion_matrix(cm=cm, classes=cm_labels, title='Confusion Matrix');

In [None]:
results = cnn1_model.evaluate(validation_generator)

In [None]:
print(f"Model loss:  {results[0]}")
print(f"Model accuracy: {results[1]}")
print(f"Model precision: {results[2]}")
print(f"Model recall: {results[3]}")

### CNN Model 2

In [None]:
cnn2_model = models.Sequential( [layers.Conv2D(64, (3, 3), activation='relu', 
                            input_shape=(150,150,1)),
    layers.MaxPooling2D((2,2)),
    layers.Conv2D(64, (3,3), activation='relu'),      
    layers.MaxPooling2D((2,2)),
    layers.Flatten(),
    layers.Dense(32, activation='relu'),
    layers.Dense(12, activation='relu'),
    layers.Dense(8, activation='relu'),
    layers.Dense(1, activation='sigmoid') ])

cnn2_model.compile(optimizer="adam",
                        loss='binary_crossentropy',
                        metrics=['accuracy', metrics.Precision(name='precision'), metrics.Recall(name='recall')])

cnn2_model.summary()

In [None]:
cnn2_history = cnn2_model.fit(train_data,
               train_labels,
               batch_size=20,
               epochs=20,
               validation_data=(val_data, val_labels))

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2,figsize=(15,8))

sns.lineplot(cnn2_history.epoch, cnn2_history.history['loss'], y=loss, ax=ax1, label='loss')
sns.lineplot(cnn2_history.epoch, cnn2_history.history['val_loss'], y=loss, ax=ax1, label='val_loss')

sns.lineplot(cnn2_history.epoch, cnn2_history.history['accuracy'], y=accuracy, ax=ax2, label='accuracy')
sns.lineplot(cnn2_history.epoch, cnn2_history.history['val_accuracy'], y=accuracy, ax=ax2, label='val_accuracy')

sns.lineplot(cnn2_history.epoch, cnn2_history.history['precision'], y=precision, ax=ax3, label='precision')
sns.lineplot(cnn2_history.epoch, cnn2_history.history['val_precision'], y=precision, ax=ax3, label='val_precision')

sns.lineplot(cnn2_history.epoch, cnn2_history.history['recall'], y=recall, ax=ax4, label='recall')
sns.lineplot(cnn2_history.epoch, cnn2_history.history['val_recall'], y=recall, ax=ax4, label='val_recall');

In [None]:
y_pred = (cnn2_model.predict(val_data) > 0.5).astype("int32")
cm = confusion_matrix(y_true= val_labels, y_pred=y_pred)  

In [None]:
cm_labels = ['Benign','Malignant']
plot_confusion_matrix(cm=cm, classes=cm_labels, title='Confusion Matrix');

In [None]:
results = cnn2_model.evaluate(validation_generator)

In [None]:
print(f"Model loss:  {results[0]}")
print(f"Model accuracy: {results[1]}")
print(f"Model precision: {results[2]}")
print(f"Model recall: {results[3]}")

### CNN Model 3

In [None]:
cnn3_model = models.Sequential( [layers.Conv2D(64, (4, 4), activation='relu', 
                            input_shape=(150,150,1)),
    layers.MaxPooling2D((2,2)),
    layers.Conv2D(64, (4,4), activation='relu'),      
    layers.MaxPooling2D((2,2)),
    layers.Flatten(),
    layers.Dense(32, activation='relu'),
    layers.Dense(12, activation='relu'),
    layers.Dense(8, activation='relu'),
    layers.Dense(1, activation='sigmoid') ])

cnn2_model.compile(optimizer="adam",
                        loss='binary_crossentropy',
                        metrics=['accuracy', metrics.Precision(name='precision'), metrics.Recall(name='recall')])

cnn2_model.summary()

In [None]:
cnn3_history = cnn3_model.fit(train_data,
               train_labels,
               batch_size=20,
               epochs=20,
               validation_data=(val_data, val_labels))

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2,figsize=(15,8))

sns.lineplot(cnn3_history.epoch, cnn3_history.history['loss'], y=loss, ax=ax1, label='loss')
sns.lineplot(cnn3_history.epoch, cnn3_history.history['val_loss'], y=loss, ax=ax1, label='val_loss')

sns.lineplot(cnn3_history.epoch, cnn3_history.history['accuracy'], y=accuracy, ax=ax2, label='accuracy')
sns.lineplot(cnn3_history.epoch, cnn3_history.history['val_accuracy'], y=accuracy, ax=ax2, label='val_accuracy')

sns.lineplot(cnn3_history.epoch, cnn3_history.history['precision'], y=precision, ax=ax3, label='precision')
sns.lineplot(cnn3_history.epoch, cnn3_history.history['val_precision'], y=precision, ax=ax3, label='val_precision')

sns.lineplot(cnn3_history.epoch, cnn3_history.history['recall'], y=recall, ax=ax4, label='recall')
sns.lineplot(cnn3_history.epoch, cnn3_history.history['val_recall'], y=recall, ax=ax4, label='val_recall');

In [None]:
y_pred = (cnn3_model.predict(val_data) > 0.5).astype("int32")
cm = confusion_matrix(y_true= val_labels, y_pred=y_pred)  

In [None]:
cm_labels = ['Benign','Malignant']
plot_confusion_matrix(cm=cm, classes=cm_labels, title='Confusion Matrix');

In [None]:
results = cnn3_model.evaluate(validation_generator)

In [None]:
print(f"Model loss:  {results[0]}")
print(f"Model accuracy: {results[1]}")
print(f"Model precision: {results[2]}")
print(f"Model recall: {results[3]}")

### CNN Model 4

In [None]:
cnn4_model = models.Sequential()
cnn4_model.add(layers.Conv2D(64, (4, 4), activation='relu',
                       input_shape=(150, 150, 1), kernel_regularizer=regularizers.l2(l=0.05)))
cnn4_model.add(layers.MaxPooling2D((2, 2)))
cnn4_model.add(layers.Conv2D(32, (3, 3), activation='relu', 
                                     kernel_regularizer=regularizers.l2(l=0.05)))
cnn4_model.add(layers.MaxPooling2D((2,2)))
cnn4_model.add(layers.Flatten())
cnn4_model.add(layers.Dense(16, activation='relu'))
cnn4_model.add(layers.Dense(1, activation='sigmoid'))

cnn4_model.compile(optimizer="adam",
                          loss='binary_crossentropy',
                          metrics=['accuracy', metrics.Precision(name='precision'), metrics.Recall(name='recall')])
cnn4_model.summary()

In [None]:
cnn4_history = cnn4_model.fit(train_data,
              train_labels,
              batch_size=32,
              epochs=10,
              validation_data=(val_data, val_labels))

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2,figsize=(15,8))

sns.lineplot(cnn4_history.epoch, cnn4_history.history['loss'], y=loss, ax=ax1, label='loss')
sns.lineplot(cnn4_history.epoch, cnn4_history.history['val_loss'], y=loss, ax=ax1, label='val_loss')

sns.lineplot(cnn4_history.epoch, cnn4_history.history['accuracy'], y=accuracy, ax=ax2, label='accuracy')
sns.lineplot(cnn4_history.epoch, cnn4_history.history['val_accuracy'], y=accuracy, ax=ax2, label='val_accuracy')

sns.lineplot(cnn4_history.epoch, cnn4_history.history['precision'], y=precision, ax=ax3, label='precision')
sns.lineplot(cnn4_history.epoch, cnn4_history.history['val_precision'], y=precision, ax=ax3, label='val_precision')

sns.lineplot(cnn4_history.epoch, cnn4_history.history['recall'], y=recall, ax=ax4, label='recall')
sns.lineplot(cnn4_history.epoch, cnn4_history.history['val_recall'], y=recall, ax=ax4, label='val_recall');

In [None]:
y_pred = (cnn4_model.predict(val_data) > 0.5).astype("int32")
cm = confusion_matrix(y_true= val_labels, y_pred=y_pred)  

In [None]:
cm_labels = ['Benign','Malignant']
plot_confusion_matrix(cm=cm, classes=cm_labels, title='Confusion Matrix');

In [None]:
results = cnn4_model.evaluate(validation_generator)

In [None]:
print(f"Model loss:  {results[0]}")
print(f"Model accuracy: {results[1]}")
print(f"Model precision: {results[2]}")
print(f"Model recall: {results[3]}")

### CNN Model 5

In [None]:
cnn5_model = models.Sequential()
cnn5_model.add(layers.Conv2D(64, (4, 4), activation='relu',
                       input_shape=(150, 150, 1), kernel_regularizer=regularizers.l2(l=0.05)))
cnn5_model.add(layers.MaxPooling2D((2, 2)))
cnn5_model.add(layers.Conv2D(32, (3, 3), activation='relu', 
                                     kernel_regularizer=regularizers.l2(l=0.05)))
cnn5_model.add(layers.MaxPooling2D((2,2)))
cnn5_model.add(layers.Flatten())
cnn5_model.add(layers.Dense(16, activation='relu'))
cnn5_model.add(layers.Dropout(0.5))
cnn5_model.add(layers.Dense(1, activation='sigmoid'))

cnn5_model.compile(optimizer="adam",
                          loss='binary_crossentropy',
                          metrics=['accuracy', metrics.Precision(name='precision'), metrics.Recall(name='recall')])

cnn5_model.summary()

In [None]:
cnn5_history = cnn5_model.fit(train_data,
               train_labels,
               batch_size=32,
               epochs=10,
               validation_data=(val_data, val_labels))

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2,figsize=(15,8))

sns.lineplot(cnn5_history.epoch, cnn5_history.history['loss'], y=loss, ax=ax1, label='loss')
sns.lineplot(cnn5_history.epoch, cnn5_history.history['val_loss'], y=loss, ax=ax1, label='val_loss')

sns.lineplot(cnn5_history.epoch, cnn5_history.history['accuracy'], y=accuracy, ax=ax2, label='accuracy')
sns.lineplot(cnn5_history.epoch, cnn5_history.history['val_accuracy'], y=accuracy, ax=ax2, label='val_accuracy')

sns.lineplot(cnn5_history.epoch, cnn5_history.history['precision'], y=precision, ax=ax3, label='precision')
sns.lineplot(cnn5_history.epoch, cnn5_history.history['val_precision'], y=precision, ax=ax3, label='val_precision')

sns.lineplot(cnn5_history.epoch, cnn5_history.history['recall'], y=recall, ax=ax4, label='recall')
sns.lineplot(cnn5_history.epoch, cnn5_history.history['val_recall'], y=recall, ax=ax4, label='val_recall');

In [None]:
y_pred = (cnn5_model.predict(val_data) > 0.5).astype("int32")
cm = confusion_matrix(y_true= val_labels, y_pred=y_pred)  

In [None]:
cm_labels = ['Benign','Malignant']
plot_confusion_matrix(cm=cm, classes=cm_labels, title='Confusion Matrix');

In [None]:
results = cnn5_model.evaluate(validation_generator)

In [None]:
print(f"Model loss:  {results[0]}")
print(f"Model accuracy: {results[1]}")
print(f"Model precision: {results[2]}")
print(f"Model recall: {results[3]}")

Overall all of the CNN models had similar metrics. I chose Model 3 as the best performing model because it overall it had a higher accuracy score.

## Final Model
The final model uses Model 3 to evaluate the test data. 

In [None]:
results = cnn3_model.evaluate(test_generator)

In [None]:
print(f"Model loss:  {results[0]}")
print(f"Model accuracy: {results[1]}")
print(f"Model precision: {results[2]}")
print(f"Model recall: {results[3]}")

In [None]:
y_pred = (cnn3_model.predict(test_data) > 0.5).astype("int32")
cm = confusion_matrix(y_true= test_labels, y_pred=y_pred) 

In [None]:
cm_labels = ['Benign','Malignant']
plot_confusion_matrix(cm=cm, classes=cm_labels, title='Confusion Matrix');

## Modeling and Results
In this project, I was trying to build a model that would do better than the current accepted error rates - the False Positive rate of 10-20% and the False Negative rate of 15%. I built several models to make the classification. I trained the models with training data and validated using the validation data. 

Based on the validation metrics - I chose the model that had the best overall metrics and ran it with the test data. I built 8 models- including the dummy model. Unfortunately - the results I achieved werent great- I wasnt able to hit 85% on any of my metrics - accuracy, precision, or recall.


## Conclusion & Next Steps
In conclusion, my best model did not do well at classifying images as benign or malignant. 

As for potential next steps, these images were taken from a pretty small sample of people - 82 patients. There was no accompanying clinical information - did the patient have dense breast tissue, and if they had cancer - what stage was the image from, what specific type of breast cancer. There was also no demographic information provided.

Increasing the sample size to a number that is significant and providing some demographic and clinical information could lead to better results.

Also there is some research to suggest that there are other ways that would be more accurate at diagnosing breast cancer - using MRIS, cell free DNA and cell tumor DNA seem to hold some promise as well.
