# What things are included in this Kernel ?

* Problem Statment and the Analysis of the Problem Statment
* Data Understanding
* Designing the Model
* Validation And Analysis
    * Metrics
    * Prediction and Activation Visualizations
    * ROC AND AUC
* Submission


# 1 

# a). Problem Statment

> ### Task - The problem is mainly a BINARY IMAGE CLASSIFICATION PROBLEM. The Problem focuses on identifying the presence of metastases from a 96 * 96 digital histopathology images

> ### Metric Evaluation - Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target. 

<img src='https://i.stack.imgur.com/kqxaJ.png' style="width:500px;height:300px;">


# b). Analysis of the problem Statment

> ## What Exactly the problem statment conveys to us?
> ### 1. The problem deals with the Binary Classification of the Image that has a shape of 96px * 96px. It involves identifying the metastases from the 96px * 96px digital histapathology images.

> ### 2. One key challenge is that the metastases can be as small as single cells in a large area of tissue.


### The Histopathological Images:

### About the Domain: 
Obviously, I do not know much about Biology,I made some notes about the following terminologies :
* Histopathology
* Lymphocytes
* Lymph Nodes

### So, let us see some of the biological terminologies involved 

### 1. Histopathology - Histopathology is the diagnosis and study of diseases of the tissues, and involves examining tissues and/or cells under a microscope. Histopathologists are responsible for making tissue diagnoses and helping clinicians manage a patient's care.

<img src = 'https://www.news-medical.net/image.axd?picture=2018%2F12%2FBy_Vshivkova-1.jpg' style="width:500px;height:300px;">


#

### 2. Lymphocytes - Lymphocytes are white blood cells that are also one of the body's main types of immune cells. They are made in the bone marrow and found in the blood and lymph tissue. The immune system is a complex network of cells known as immune cells that include lymphocytes.


<img src = 'https://healthmattersio.files.wordpress.com/2018/05/lymphocytes-healthmatters-io.png?w=1600&h=1200&crop=1' style="width:500px;height:300px;">


#
### 3. Lymph Nodes- Lymph nodes are small lumps of tissue that contain white blood cells, which fight infection. They filter lymph fluid, which is composed of fluid and waste products from your body tissues. Lymph nodes also help activate your immune system if you have an infection.




### So, now let us dive into the domain which involves Data Collection:


* #### The data that is provided to us for classification are the histopathological images. These images are glass slide microscope images of lymph nodes that are stained with hematoxylin and eosin (H&E). 
* #### Hematoxylin and eosin (H&E) is the most widely used stain in histology and allows localization of nuclei and extracellular proteins. Hematoxylin, not a dye itself, produces the blue Hematin via an oxidation reaction with nuclear histones causing nuclei to show blue. 
* #### Typically nuclei are stained blue, whereas cytoplasm and extracellular parts in various shades of pink.

* #### Lymph nodes are small glands that filter the fluid in the lymphatic system and they are the first place a breast cancer is likely to spread. 

* #### Histological assessment of lymph node metastases is part of determining the stage of breast cancer in TNM classification which is a globally recognized standard for classifying the extent of spread of cancer. 

### Links for Reference

* <a href='https://www.cancer.net/navigating-cancer-care/cancer-basics/what-metastasis'>What is Metastatis?</a>
* <a href='https://en.wikipedia.org/wiki/H%26E_stain'>Hematoxylin and eosin (H&E) Staining</a>
* <a href='https://www.medicalnewstoday.com/articles/320987#:~:text=Lymphocytes%20are%20white%20blood%20cells,immune%20cells%20that%20include%20lymphocytes.'>Lymphocytes</a>
* <a href='https://www.healthdirect.gov.au/lymph-nodes'>Lymph Nodes</a>
* <a href='https://en.wikipedia.org/wiki/TNM_staging_system'>TNM Classification</a>

# 2.  Data Understanding

* The dataset contains the histopathological Images, each image is 96px * 96px. 

* A positive label indicates that the center 32x32px region of a patch contains at least one pixel of tumor tissue. Tumor tissue in the outer region of the patch does not influence the label. This outer region is provided to enable fully-convolutional models that do not use zero-padding, to ensure consistent behavior when applied to a whole-slide image.

* Kaggle says that :
                    'The original PCam dataset contains duplicate images due to its probabilistic sampling,
                    however, the version presented on Kaggle does not contain duplicates. We have otherwise 
                    maintained the same data and splits as the PCam benchmark.'
                   
* Also, one of the hing is that the problem states that the training Data contains **50/50** Images of both the labels i.e. the training contains equal proportion of both the labels, however on analysis it was found to be nearly equal to **60/40**, which we will consider while we design the model




* ### **IS DATA RELEVANT TO THE PROBLEM ?**
> This dataset is a combination of two independent datasets collected in Radboud University Medical Center (Nijmegen, the Netherlands), and the University Medical Center Utrecht (Utrecht, the Netherlands). The slides are produced by routine clinical practices and a trained pathologist would examine similar images for identifying metastases.

* ### So, now let us move forward to design our model

# 3. Designing the Model  (Coding Part)

In [22]:
# Importing  Libraries
from numpy.random import seed
seed(101)

import pandas as pd
import numpy as np


import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import Dense, Dropout, Flatten, Activation
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.optimizers import Adam

import os
import cv2

from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import itertools
import shutil
import matplotlib.pyplot as plt
%matplotlib inline
tf.random.set_seed(101)

In [23]:
# Setting Some Pre-Requisites
IMAGE_SIZE=96
IMAGE_CHANNELS=3
SAMPLE_SIZE=80000         # We will be training 80,000 samples from each label

In [24]:
# So, what are the files which are available?

os.listdir('../input/histopathologic-cancer-detection')

['sample_submission.csv', 'train_labels.csv', 'test', 'train']

In [25]:
# So, how many images are there in each of the folder in the training dataset?

print(len(os.listdir('../input/histopathologic-cancer-detection/train')))
print(len(os.listdir('../input/histopathologic-cancer-detection/test')))

220025
57458


In [26]:
# Creating a dataframe of all the training images

df_data = pd.read_csv('../input/histopathologic-cancer-detection/train_labels.csv')

# removing this image because it caused a training error previously
df_data = df_data[df_data['id'] != 'dd6dfed324f9fcb6f93f46f32fc800f2ec196be2']

# removing this image because it's black
df_data = df_data[df_data['id'] != '9369c7278ec8bcc6c880d99194de09fc2bd4efbe']


print(df_data.shape)

(220023, 2)


In [27]:
print(df_data.head())

                                         id  label
0  f38a6374c348f90b587e046aac6079959adf3835      0
1  c18f2d887b7ae4f6742ee445113fa1aef383ed77      1
2  755db6279dae599ebb4d39a9123cce439965282d      0
3  bc3f0c64fb968ff4a8bd33af6971ecae77c75e08      0
4  068aba587a4950175d04c680d38943fd488d6a9d      0


In [28]:
print(df_data['id'])

0         f38a6374c348f90b587e046aac6079959adf3835
1         c18f2d887b7ae4f6742ee445113fa1aef383ed77
2         755db6279dae599ebb4d39a9123cce439965282d
3         bc3f0c64fb968ff4a8bd33af6971ecae77c75e08
4         068aba587a4950175d04c680d38943fd488d6a9d
                            ...                   
220020    53e9aa9d46e720bf3c6a7528d1fca3ba6e2e49f6
220021    d4b854fe38b07fe2831ad73892b3cec877689576
220022    3d046cead1a2a5cbe00b2b4847cfb7ba7cf5fe75
220023    f129691c13433f66e1e0671ff1fe80944816f5a2
220024    a81f84895ddcd522302ddf34be02eb1b3e5af1cb
Name: id, Length: 220023, dtype: object


In [29]:
df_data['label'].value_counts()

0    130907
1     89116
Name: label, dtype: int64

In [None]:
# source: https://www.kaggle.com/gpreda/honey-bee-subspecies-classification

def draw_category_images(col_name,figure_cols, df, IMAGE_PATH):
    
    """
    Give a column in a dataframe,
    this function takes a sample of each class and displays that
    sample on one row. The sample size is the same as figure_cols which
    is the number of columns in the figure.
    Because this function takes a random sample, each time the function is run it
    displays different images.
    """
    

    categories = (df.groupby([col_name])[col_name].nunique()).index
    f, ax = plt.subplots(nrows=len(categories),ncols=figure_cols, 
                         figsize=(4*figure_cols,4*len(categories))) # adjust size here
    # draw a number of images for each location
    for i, cat in enumerate(categories):
        sample = df[df[col_name]==cat].sample(figure_cols) # figure_cols is also the sample size
        for j in range(0,figure_cols):
            file=IMAGE_PATH + sample.iloc[j]['id'] + '.tif'
            im=cv2.imread(file)
            ax[i, j].imshow(im, resample=True, cmap='gray')
            ax[i, j].set_title(cat, fontsize=16)  
    plt.tight_layout()
    plt.show()

In [None]:
IMAGE_PATH = '../input/histopathologic-cancer-detection/train/' 

draw_category_images('label',4, df_data, IMAGE_PATH)

In [30]:
# Create the Train and Validation Sets

df_0 = df_data[df_data['label']==0].sample(SAMPLE_SIZE,random_state=101)
df_1 = df_data[df_data['label']==1].sample(SAMPLE_SIZE,random_state=101)

# concat the dataframes
df_data = pd.concat([df_0, df_1], axis=0).reset_index(drop=True)
# shuffle
df_data = shuffle(df_data)

df_data['label'].value_counts()

1    80000
0    80000
Name: label, dtype: int64

In [11]:
from pathlib import Path


dir_path = Path('./base_dir')

try:
    dir_path.rmdir()
except OSError as e:
    print("Error: %s : %s" % (dir_path, e.strerror))


Error: base_dir : No such file or directory


In [31]:
# Now, for the train-test split

# stratify=y creates a balanced validation set.
y = df_data['label']

df_train, df_val = train_test_split(df_data, test_size=0.10, random_state=101, stratify=y)

print(df_train.shape)
print(df_val.shape)

(144000, 2)
(16000, 2)


In [32]:
# Create a new directory so that we will be using the ImageDataGenerator
base_dir='base_dir'
os.mkdir(base_dir)

# now we create 2 folders inside 'base_dir':

# train_dir
    # a_no_tumor_tissue
    # b_has_tumor_tissue

# val_dir
    # a_no_tumor_tissue
    # b_has_tumor_tissue



# create a path to 'base_dir' to which we will join the names of the new folders
# train_dir
train_dir = os.path.join(base_dir, 'train_dir')

os.mkdir(train_dir)

# val_dir
val_dir = os.path.join(base_dir, 'val_dir')

os.mkdir(val_dir)



# [CREATE FOLDERS INSIDE THE TRAIN AND VALIDATION FOLDERS]
# Inside each folder we create seperate folders for each class

# create new folders inside train_dir
no_tumor_tissue = os.path.join(train_dir, 'a_no_tumor_tissue')
os.mkdir(no_tumor_tissue)
has_tumor_tissue = os.path.join(train_dir, 'b_has_tumor_tissue')
os.mkdir(has_tumor_tissue)


# create new folders inside val_dir
no_tumor_tissue = os.path.join(val_dir, 'a_no_tumor_tissue')
#os.rmdir(no_tumor_tissue)
os.mkdir(no_tumor_tissue)
has_tumor_tissue = os.path.join(val_dir, 'b_has_tumor_tissue')
#os.rmdir(has_tumor_tissue)
os.mkdir(has_tumor_tissue)

FileExistsError: [Errno 17] File exists: 'base_dir'

In [33]:
# check that the folders have been created
os.listdir('base_dir/train_dir')

['b_has_tumor_tissue', 'a_no_tumor_tissue']

In [34]:
# Set the id as the index in df_data
df_data.set_index('id', inplace=True)

In [46]:
train_list = list(df_train['id'])
print(train_list)

['9d2f6bd5281f3aa2031057480f704f06c72a226d', '92848d6f956db07bece65d3c44cf9cb3f6237bea', '7544ffaef9ad0067543a029ff772c6920668e45b', '86a66236fe9c50c1e2b0595be256f90b87ce7cc7', 'a13b945354ec8b51679f6f93aba3624f6ad56b88', 'a1a518daf720c991d70ab15c5076976206612eda', '2c53117b3bf4fbb5491f71b033b164db08a9a593', '4269600ab0db6ca009fc28c11ad4f830e9794f4d', '8eabc80fce3a231915b93651f5e9238447d3db62', '2010414d2e2801ca8a92173e03de73665a88adda', '06c08558b6acdc3b3eaccd01c529b3107093d91c', 'b1bc51cd86df1d35dcabb61e60745d1d24a51890', '82baac45e9f5591d7e5dbcde93b684dd5515aea2', '54ad0c98de936aac6b492f27624eda2a8a329155', '4d4ae0809e14de58c28dc74538fe77ac0a381072', 'dc1a660e2d4814e069cda5565717ac969ef2c7b1', '82c8b4df3b6c709a12b6fd89bf4a34bc3b528599', 'a457722f0479538800196e24face566570e23c7f', '645b0c89c465e035195b34ed8da40333297dd1b7', 'a82286dfe44aa49a1220d25e35324fcf1a6f8103', '768da851dbed0c5a3e4f9ccab4c82e411adb09cd', '908ea67f330bf3bb18d8783fcfdc146e7f915efb', '0fd48d48366cdf78c946d970ce8fc4

In [48]:
# Get a list of train and val images
train_list = list(df_train['id'])
val_list = list(df_val['id'])

for image in train_list[:10]:
    fname = image+'.tif'
    target = df_data.loc[image]
    print(target)
#     if target == 0:
#         label = 'a_no_tumor_tissue'
#     if target == 1:
#         label = 'b_has_tumor_tissue'

label    0
Name: 9d2f6bd5281f3aa2031057480f704f06c72a226d, dtype: int64
label    0
Name: 92848d6f956db07bece65d3c44cf9cb3f6237bea, dtype: int64
label    0
Name: 7544ffaef9ad0067543a029ff772c6920668e45b, dtype: int64
label    1
Name: 86a66236fe9c50c1e2b0595be256f90b87ce7cc7, dtype: int64
label    1
Name: a13b945354ec8b51679f6f93aba3624f6ad56b88, dtype: int64
label    0
Name: a1a518daf720c991d70ab15c5076976206612eda, dtype: int64
label    0
Name: 2c53117b3bf4fbb5491f71b033b164db08a9a593, dtype: int64
label    1
Name: 4269600ab0db6ca009fc28c11ad4f830e9794f4d, dtype: int64
label    1
Name: 8eabc80fce3a231915b93651f5e9238447d3db62, dtype: int64
label    0
Name: 2010414d2e2801ca8a92173e03de73665a88adda, dtype: int64


In [None]:
# Get a list of train and val images
train_list = list(df_train['id'])
val_list = list(df_val['id'])


countt=1
# Transfer the train images

for image in train_list:
    
    # the id in the csv file does not have the .tif extension therefore we add it here
    fname = image + '.tif'
    # get the label for a certain image
    target = df_data.loc[image,'label'] # 0 or 1
    
    print(countt)
    # these must match the folder names
    if target == 0:
        label = 'a_no_tumor_tissue'
    if target == 1:
        label = 'b_has_tumor_tissue'
    
    # source path to image
    src = os.path.join('../input/histopathologic-cancer-detection/train', fname)
    # destination path to image
    dst = os.path.join(train_dir, label, fname)
    # copy the image from the source to the destination
    shutil.copyfile(src, dst)
    countt=countt+1


# Transfer the val images
count=1
for image in val_list:
    
    # the id in the csv file does not have the .tif extension therefore we add it here
    fname = image + '.tif'
    # get the label for a certain image
    target = df_data.loc[image,'label']
    print(count)
    
    # these must match the folder names
    if target == 0:
        label = 'a_no_tumor_tissue'
    if target == 1:
        label = 'b_has_tumor_tissue'
    

    # source path to image
    src = os.path.join('../input/histopathologic-cancer-detection/train', fname)
    # destination path to image
    dst = os.path.join(val_dir, label, fname)
    # copy the image from the source to the destination
    shutil.copyfile(src, dst)
    count=count+1

In [None]:
# check how many train images we have in each folder

print(len(os.listdir('base_dir/train_dir/a_no_tumor_tissue')))
print(len(os.listdir('base_dir/train_dir/b_has_tumor_tissue')))

In [None]:
# check how many val images we have in each folder

print(len(os.listdir('base_dir/val_dir/a_no_tumor_tissue')))
print(len(os.listdir('base_dir/val_dir/b_has_tumor_tissue')))

In [None]:
# Set up the generators
train_path = 'base_dir/train_dir'
valid_path = 'base_dir/val_dir'
test_path = '../input/histopathologic-cancer-detection/test'

num_train_samples = len(df_train)
num_val_samples = len(df_val)
train_batch_size = 10
val_batch_size = 10


train_steps = np.ceil(num_train_samples / train_batch_size)
val_steps = np.ceil(num_val_samples / val_batch_size)

In [None]:
datagen = ImageDataGenerator(rescale=1.0/255)

train_gen = datagen.flow_from_directory(train_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=train_batch_size,
                                        class_mode='categorical')

val_gen = datagen.flow_from_directory(valid_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=val_batch_size,
                                        class_mode='categorical')

# Note: shuffle=False causes the test dataset to not be shuffled
test_gen = datagen.flow_from_directory(valid_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=1,
                                        class_mode='categorical',
                                        shuffle=False)

### The model that I have choosen for this problem has been taken from <a href = 'https://www.kaggle.com/fmarazzi/baseline-keras-cnn-roc-fast-10min-0-925-lb'>Baseline Keras CNN</a>

In [None]:
kernel_size = (3,3)
pool_size= (2,2)
first_filters = 32
second_filters = 64
third_filters = 128

dropout_conv = 0.3
dropout_dense = 0.3


model = Sequential()
model.add(Conv2D(first_filters, kernel_size, activation = 'relu', input_shape = (96, 96, 3)))
model.add(Conv2D(first_filters, kernel_size, activation = 'relu'))
model.add(Conv2D(first_filters, kernel_size, activation = 'relu'))
model.add(MaxPooling2D(pool_size = pool_size)) 
model.add(Dropout(dropout_conv))

model.add(Conv2D(second_filters, kernel_size, activation ='relu'))
model.add(Conv2D(second_filters, kernel_size, activation ='relu'))
model.add(Conv2D(second_filters, kernel_size, activation ='relu'))
model.add(MaxPooling2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

model.add(Conv2D(third_filters, kernel_size, activation ='relu'))
model.add(Conv2D(third_filters, kernel_size, activation ='relu'))
model.add(Conv2D(third_filters, kernel_size, activation ='relu'))
model.add(MaxPooling2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

model.add(Flatten())
model.add(Dense(256, activation = "relu"))
model.add(Dropout(dropout_dense))
model.add(Dense(2, activation = "softmax"))

model.summary()

In [None]:
model.compile(Adam(lr=0.0001), loss='binary_crossentropy', 
              metrics=['accuracy'])

In [None]:
# Get the labels that are associated with each index
print(val_gen.class_indices)

In [None]:
filepath = "model.h5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, 
                             save_best_only=True, mode='max')

reduce_lr = ReduceLROnPlateau(monitor='val_acc', factor=0.5, patience=2, 
                                   verbose=1, mode='max', min_lr=0.00001)
                              
                              
callbacks_list = [checkpoint, reduce_lr]

history = model.fit_generator(train_gen, steps_per_epoch=train_steps, 
                    validation_data=val_gen,
                    validation_steps=val_steps,
                    epochs=10, verbose=1,
                   callbacks=callbacks_list)

In [None]:
# get the metric names so we can use evaulate_generator
model.metrics_names

In [None]:
# Here the best epoch will be used.



val_loss, val_acc = \
model.evaluate_generator(test_gen, 
                        steps=len(df_val))

print('val_loss:', val_loss)
print('val_acc:', val_acc)

In [None]:
#Save the last model  
model.save('model.h5')

In [None]:
# display the loss and accuracy curves

import matplotlib.pyplot as plt

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.figure()

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()

# 4. Validation and Analysis 

* ### Metrics
* ### Prediction and Activation Visualizations
* ### ROC and AUC

In [None]:
# make a prediction
predictions = model.predict_generator(test_gen, steps=len(df_val), verbose=1)

In [None]:
predictions.shape

In [None]:
#Save the last model
#model.save('../input/model.h5')

In [None]:
# This is how to check what index keras has internally assigned to each class. 
test_gen.class_indices

In [None]:
# Put the predictions into a dataframe.
# The columns need to be ordered to match the output of the previous cell

df_preds = pd.DataFrame(predictions, columns=['no_tumor_tissue', 'has_tumor_tissue'])

df_preds.head()


In [None]:
# Get the true labels
y_true = test_gen.classes

# Get the predicted labels as probabilities
y_pred = df_preds['has_tumor_tissue']

In [None]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_true, y_pred)

In [None]:
# Get the labels of the test images.

test_labels = test_gen.classes
test_labels.shape

In [None]:
# argmax returns the index of the max value in a row
cm = confusion_matrix(test_labels, predictions.argmax(axis=1))
# Print the label associated with each class
test_gen.class_indices

In [None]:
from sklearn.metrics import plot_confusion_matrix

In [None]:
# Delete base_dir and it's sub folders to free up disk space.

shutil.rmtree('base_dir')
#[CREATE A TEST FOLDER DIRECTORY STRUCTURE]

# We will be feeding test images from a folder into predict_generator().
# Keras requires that the path should point to a folder containing images and not
# to the images themselves. That is why we are creating a folder (test_images) 
# inside another folder (test_dir).

# test_dir
    # test_images

# create test_dir
test_dir = 'test_dir'
os.mkdir(test_dir)
    
# create test_images inside test_dir
test_images = os.path.join(test_dir, 'test_images')
os.mkdir(test_images)
# check that the directory we created exists
os.listdir('test_dir')

In [None]:
# Transfer the test images into image_dir

test_list = os.listdir('../input/histopathologic-cancer-detection/test')

for image in test_list:
    
    fname = image
    
    # source path to image
    src = os.path.join('../input/histopathologic-cancer-detection/test', fname)
    # destination path to image
    dst = os.path.join(test_images, fname)
    # copy the image from the source to the destination
    shutil.copyfile(src, dst)
# check that the images are now in the test_images
# Should now be 57458 images in the test_images folder

len(os.listdir('test_dir/test_images'))

In [None]:
test_path ='test_dir'


# Here we change the path to point to the test_images folder.

test_gen = datagen.flow_from_directory(test_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=1,
                                        class_mode='categorical',
                                        shuffle=False)

In [None]:
num_test_images = 57458



predictions = model.predict_generator(test_gen, steps=num_test_images, verbose=1)

In [None]:
# Are the number of predictions correct?
# Should be 57458.

len(predictions)

In [None]:
# Put the predictions into a dataframe

df_preds = pd.DataFrame(predictions, columns=['no_tumor_tissue', 'has_tumor_tissue'])

df_preds.head()

In [None]:
# This outputs the file names in the sequence in which 
# the generator processed the test images.
test_filenames = test_gen.filenames

# add the filenames to the dataframe
df_preds['file_names'] = test_filenames

df_preds.head()

In [None]:
# Create an id column

# A file name now has this format: 
# test_images/00006537328c33e284c973d7b39d340809f7271b.tif

# This function will extract the id:
# 00006537328c33e284c973d7b39d340809f7271b


def extract_id(x):
    
    # split into a list
    a = x.split('/')
    # split into a list
    b = a[1].split('.')
    extracted_id = b[0]
    
    return extracted_id

df_preds['id'] = df_preds['file_names'].apply(extract_id)

df_preds.head()

In [None]:
# Get the predicted labels.
# We were asked to predict a probability that the image has tumor tissue
y_pred = df_preds['has_tumor_tissue']

# get the id column
image_id = df_preds['id']

### Confusion Matrix

# 5. Submission

In [None]:
submission = pd.DataFrame({'id':image_id, 
                           'label':y_pred, 
                          }).set_index('id')

submission.to_csv('patch_preds.csv', columns=['label']) 
submission.head()

In [None]:
# Delete the test_dir directory we created to prevent a Kaggle error.
# Kaggle allows a max of 500 files to be saved.

shutil.rmtree('test_dir')

# Confusion Matrix

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt     

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax); #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix'); 

## I hope that you found my notebook useful. It took me 2 weeks to analyze the problem and then perform the coding. So, if you enjoyed the notebook, please leave an upvote. Thanks a lot for reading!!