# **Title : Histopathologic Cancer Detection**

you must create an algorithm to identify metastatic cancer in small image patches taken from larger digital pathology scans.
This dataset was provided by Bas Veeling, with additional input from Babak Ehteshami Bejnordi, Geert Litjens, and Jeroen van der Laak.

# 1. Prepare the Environment and Load Data

Here, you can import the library, set the path, and import the csv file.

In [None]:
# Libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import cv2
import tensorflow as tf
from tensorflow.keras.optimizers import RMSprop

In [None]:
#Setting up the Path
test_path = "../input/histopathologic-cancer-detection/test/"
train_path = "../input/histopathologic-cancer-detection/train/"
path = "../input/histopathologic-cancer-detection/"
train_files       = os.listdir(train_path)
test_files        = os.listdir(test_path)

In [None]:
# Load csv file
labels = pd.read_csv(path+"train_labels.csv")
labels

# 2. EDA

Explore the data to understand how the data is organized.

In [None]:
labels['label'].value_counts()

**Label distinguishes whether it is cancer or not, and 1 is cancer and 0 is not cancer.**

According to the label data, there are 89117 data classified as cancer, and 130908 data are not cancer.

In [None]:
cancer_labels = ["No Cancer", "Cancer"]
values = labels.label.value_counts()

chart_donut = go.Figure(data=[go.Pie(labels=cancer_labels, values=values, hole=.5, marker_colors=["green", "purple"])])
chart_donut.show()

According to the train data, 40.5% of the total images were diagnosed as cancer.

In [None]:
number_images = 15
fig, axs = plt.subplots(1, len(labels[:number_images]), figsize = (20, 2))
for idx, ax in enumerate(axs):
    ax.imshow(cv2.imread(train_path + labels.id[idx] + ".tif"))
    ax.set_title("Label: " + str(labels.label[idx]))

Match the image with the label to roughly identify which image is cancer.

In [None]:
def img_prep(directory, files, start = 0, end = -1, test=False):
    if end == -1:
        end = len(files)
    X = []
    if test:
        for image in files:
            img = cv2.imread( directory + image)
            img = cv2.resize(img, (96, 96))
            X.append(img)
        print("Image shape: ",X[0].shape)
        X = np.array(X)
        return X
    else:
        for image in files.id[start:end]:
            img = cv2.imread( directory + image + ".tif")
            img = cv2.resize(img, (96, 96))
            X.append(img)
        print("Image shape: ",X[0].shape)
        X = np.array(X)
    
    
        y = files.label[start:end]
        return X, y

Change the image size to 96x96. To learn from CNN, the image must be of the same size.

In [None]:
X_train, y_train = img_prep(train_path, labels, start=10_000, end = 120_000)

In [None]:
test = img_prep(test_path, test_files, test=True)

# 3. Build CNN Model

A CNN model is generated using the tensorflow framework. Check the accuracy while adding layers, and also add dropout to prevent overfitting.

**Create a Model**

> **Model 1**

In [None]:
model = tf.keras.models.Sequential([
            tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(96, 96, 3)),
            tf.keras.layers.MaxPooling2D(2, 2),
            tf.keras.layers.Conv2D(32, (3, 3), activation='relu'),
            tf.keras.layers.MaxPooling2D(2, 2),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(512, activation='relu'),
            tf.keras.layers.Dense(1, activation='sigmoid')
            ])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

**Train a Model**

In [None]:
history = model.fit(X_train, y_train, epochs=20, validation_split=.2)

**Check accuracy values**

In [None]:
# Plot training & validation accuracy values
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='best')
plt.show()

# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='best')
plt.show()

**Model result**

Train data has high accuracy and very little loss. However, if you look at the validation data, you will get different results. This can lead to the conclusion that overfitting has occurred severely.

> **Model 2**

Added one more CNN layer and adjusted the Neuron number.

In [None]:
model2 = tf.keras.models.Sequential([
            tf.keras.layers.Conv2D(12, (3, 3), activation='relu', input_shape=(96, 96, 3)),
            tf.keras.layers.MaxPooling2D(2, 2),
            tf.keras.layers.Conv2D(32, (3, 3), activation='relu'),
            tf.keras.layers.MaxPooling2D(2, 2),
            tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
            tf.keras.layers.MaxPooling2D(2, 2),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(512, activation='relu'),
            tf.keras.layers.Dense(1, activation='sigmoid')
            ])
model2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
history2 = model2.fit(X_train, y_train, epochs=20, validation_split=.2)

In [None]:
# Plot training & validation accuracy values
plt.plot(history2.history['accuracy'])
plt.plot(history2.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='best')
plt.show()

# Plot training & validation loss values
plt.plot(history2.history['loss'])
plt.plot(history2.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='best')
plt.show()

**Model Result**

It seems to have improved compared to Model 1, but overfitting is still severe.

> ****Model 3****

Dropout was added, and a Dense layer was added.
Optimizer changed from Adam to RMSprop.

In [None]:
model3 = tf.keras.models.Sequential([
            tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(96, 96, 3)),
            tf.keras.layers.MaxPooling2D(2, 2),
            tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
            tf.keras.layers.MaxPooling2D(2, 2),
            tf.keras.layers.Conv2D(128, (3, 3), activation='relu'),
            tf.keras.layers.MaxPooling2D(2, 2),
            tf.keras.layers.Dropout(0.5),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(512, activation='relu'),
            tf.keras.layers.Dense(128, activation='relu'),
            tf.keras.layers.Dense(1, activation='sigmoid')
            ])
model3.compile(optimizer=RMSprop(learning_rate=0.0001), loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
history3 = model3.fit(X_train, y_train, epochs=20, validation_split=.2)

In [None]:
# Plot training & validation accuracy values
plt.plot(history3.history['accuracy'])
plt.plot(history3.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='best')
plt.show()

# Plot training & validation loss values
plt.plot(history3.history['loss'])
plt.plot(history3.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='best')
plt.show()

**Model result**

It is better than Model 2. However, the model performance is very poor. Accuracy does not seem to have converged, and Loss does not converge and comes out very high.

> **Model 4**

In [None]:
model4 = tf.keras.models.Sequential([
            tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(96, 96, 3)),
            tf.keras.layers.Dropout(0.25),
            tf.keras.layers.MaxPooling2D(2, 2),
            tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
            tf.keras.layers.Dropout(0.25),
            tf.keras.layers.MaxPooling2D(2, 2),
            tf.keras.layers.Conv2D(128, (3, 3), activation='relu'),
            tf.keras.layers.Dropout(0.25),
            tf.keras.layers.MaxPooling2D(2, 2),
            tf.keras.layers.Dropout(0.25),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(512, activation='relu'),
            tf.keras.layers.Dense(1, activation='sigmoid')
            ])
model4.compile(optimizer=RMSprop(learning_rate=0.0001), loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
history4 = model4.fit(X_train, y_train, epochs=20, validation_split=.2)

In [None]:
# Plot training & validation accuracy values
plt.plot(history4.history['accuracy'])
plt.plot(history4.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='best')
plt.show()

# Plot training & validation loss values
plt.plot(history4.history['loss'])
plt.plot(history4.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='best')
plt.show()

**Model result**

It has improved a lot compared to the previous models. Both accuracy and loss are converging, and performance seems to have improved.

**Predict Labels**

In [None]:
pred_test = model4.predict(test)

# 4. Save Results

Save as a Submission.csv file.

In [None]:
#Prepare Submission.csv file
lst = []
for item in test_files:
    lst.append(item[:-4])
    
test_df = pd.DataFrame(lst)
test_df.head()

In [None]:
#Create Submission.csv file
predictions = np.array(pred_test)
test_df["label"] = predictions
test_df.columns = ["id", "label"]
submission = test_df

print(submission.head())
submission.to_csv("submission.csv", index = False, header = True)