
# CNN Cancer Detection Kaggle Mini-Project

The goal of this project is to classify images into two groups (cancer and non-cancer) or calculate the probability from them.

GitHub: https://github.com/hrkmr-tech/kaggle_cancer_detection

## Importing Libraries
First, we will import the libraries to use in this project.

In [1]:
from glob import glob 
import numpy as np
import pandas as pd
import keras, cv2, os

from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, BatchNormalization, Activation
from keras.layers import Conv2D, MaxPool2D

from tqdm import tqdm_notebook
import matplotlib.pyplot as plt

import gc

## Loading the Data

Now, we will load the data from the path. We can get the id of data from the filename.

In [2]:
train_path = '../input/train/'
test_path = '../input/test/'
label_path = '../input/train_labels.csv'

df = pd.DataFrame({'path': glob(os.path.join(train_path,'*.tif'))})
df['id'] = df.path.map(lambda x: x.split('/')[3].split(".")[0])
labels = pd.read_csv(label_path)

print("the number of samples is: {}".format(df.shape[0]))

## Merging the labels and ids

The labels and ids are combined here so that we can use it to train a model.

In [3]:
# merge all id, path and label
df = df.merge(labels, on = "id")
df.head()

## Loading the images

Now, we will load some images to explore the data. 

In [4]:
# Load 10,000 images
N=10000

X = np.zeros([N,96,96,3],dtype=np.uint8) 
y = np.squeeze(df.as_matrix(columns=['label']))[0:N]
for i, row in tqdm_notebook(df.iterrows(), total=N):
    if i == N:
        break
    # reading images
    X[i] = cv2.imread(row['path'])

## Exploratory Data Analysis (EDA)

We will do three things for EDA. 

1. Showing some images
1. Checking the ratio of classes
1. Exploring the characteristics

### Showing some images

Some images are shown to check what kinds of images are in the data.

In [5]:
def convert_to_label(val):
    return 'cancer' if val == 1 else 'not cancer'

# pick up some images
idx = [0, 100, 200, 300, 400]
for i in idx:
    plt.imshow(X[i])
    plt.title(convert_to_label(y[i]))
    plt.show()

### Checking the ratio of classes

We will check the ratio of classes. If the number of each class is radically unbalanced, we have to correct the ratio in order to train a better model.

In [6]:
# I found this nice shorthand here. https://stackoverflow.com/a/37060037
positive_numbers = (y==1).sum()
negative_numbers = (y==0).sum()


plt.bar([0,1], [negative_numbers, positive_numbers]) #plot a bar chart of the label frequency
plt.xticks([0,1],["Negative (N={})".format(negative_numbers),"Positive (N={})".format(positive_numbers)])
plt.ylabel("the number of samples")

The ratio is around 60 vs 40. This is not much a problem because the number of the samples at hand is relatively large (220,025 samples). 

### Exploring the characteristics

We will split the data into the two classes to compare the characteristics of each class.

In [7]:
positive_samples = X[y == 1]
negative_samples = X[y == 0]

Now, we will compare the distribution of pixel values for each of RGB.

In [8]:
nr_of_bins = 256
fig,axs = plt.subplots(3,2,sharey=True,figsize=(8,8),dpi=150)

#RGB channels
axs[0,0].hist(negative_samples[:,:,:,0].flatten(),bins=nr_of_bins,density=True)
axs[0,1].hist(positive_samples[:,:,:,0].flatten(),bins=nr_of_bins,density=True)
axs[1,0].hist(negative_samples[:,:,:,1].flatten(),bins=nr_of_bins,density=True)
axs[1,1].hist(positive_samples[:,:,:,1].flatten(),bins=nr_of_bins,density=True)
axs[2,0].hist(negative_samples[:,:,:,2].flatten(),bins=nr_of_bins,density=True)
axs[2,1].hist(positive_samples[:,:,:,2].flatten(),bins=nr_of_bins,density=True)

#Set image labels
axs[0,0].set_title("Negative samples (N = {})".format(negative_samples.shape[0]))
axs[0,1].set_title("Positive samples (N = {})".format(positive_samples.shape[0]))
axs[0,1].set_ylabel("R",rotation='horizontal',labelpad=35,fontsize=12)
axs[1,1].set_ylabel("G",rotation='horizontal',labelpad=35,fontsize=12)
axs[2,1].set_ylabel("B",rotation='horizontal',labelpad=35,fontsize=12)
for i in range(3):
    axs[i,0].set_ylabel("frequency")
axs[2,1].set_xlabel("the value of color element")
axs[2,0].set_xlabel("the value of color element")
fig.tight_layout()

We can know that from the histgrams:
* the negative samples have the higher frequency of large values in each of the colors.

In [9]:
nr_of_bins = 64 #we use a bit fewer bins to get a smoother image
fig,axs = plt.subplots(1,2,sharey=True, sharex = True, figsize=(8,2),dpi=150)
axs[0].hist(np.mean(negative_samples,axis=(1,2,3)),bins=nr_of_bins,density=True)
axs[1].hist(np.mean(positive_samples,axis=(1,2,3)),bins=nr_of_bins,density=True)
axs[0].set_title("Negative samples")
axs[1].set_title("Positive samples")
axs[0].set_xlabel("mean brightness")
axs[1].set_xlabel("mean brightness")
axs[0].set_ylabel("frequency")
axs[1].set_ylabel("frequency")

There is an obvious difference pretty obvious differenes between the positive and negative samples:
* the distribution of the brightness from the negative samples have two edges, while that from the positive ones have one edge

In [10]:
# memory release
positives_samples = None
negative_samples = None
gc.collect()

## Creating the model

### Defining the Model
Here is the definition of our model. We'll try to achive 80% accuracy against the validation data. 

In [11]:
# the definition of hyperparameters
kernel_size = (3,3)
pool_size= (2,2)
filter_1 = 32
filter_2 = 64
filter_3 = 128
dropout_conv = 0.3
dropout_dense = 0.5

# creating the model
model = Sequential()

# 1st convolusion layer
model.add(Conv2D(filter_1, kernel_size, input_shape = (96, 96, 3)))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPool2D(pool_size = pool_size)) 
model.add(Dropout(dropout_conv))

# 2nd convolusion layer
model.add(Conv2D(filter_2, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPool2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

# 3rd convolusion layer
model.add(Conv2D(filter_3, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPool2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

# dense layer
model.add(Flatten())
model.add(Dense(256, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Dropout(dropout_dense))

# output 0 or 1
model.add(Dense(1, activation = "sigmoid"))

model.compile(loss=keras.losses.binary_crossentropy,
              optimizer=keras.optimizers.Adam(0.001), 
              metrics=['accuracy'])

model.summary()

### Loading all the train data
We will train a model with all the data. As we checked it brefore, the unbalanced ratio of the classes is not a big problem because of the large samples.

In [12]:
N = df.shape[0] 

X = np.zeros([N,96,96,3],dtype=np.uint8) 
y = np.squeeze(df.as_matrix(columns=['label']))[0:N]
for i, row in tqdm_notebook(df.iterrows(), total=N):
    if i == N:
        break
    X[i] = cv2.imread(row['path'])

### Training the Model
Finally, we will train the model.

* batch_size: `50`
* epochs: `5`
* validation_split: `0.2`

In [13]:
batch_size = 50
epochs = 3
validation_split = 0.2

val_count = int(np.round(validation_split * y.shape[0]))
iterations = np.floor(val_count / batch_size).astype(int)

for epoch in range(epochs):
    iterations = np.floor(val_count / batch_size).astype(int)
    loss = 0
    acc = 0
    for i in range(iterations):
        start_idx = i * batch_size
        x_batch = X[start_idx:start_idx+batch_size]
        y_batch = y[start_idx:start_idx+batch_size]
        
        # training a model
        stat = model.train_on_batch(x_batch, y_batch)

        loss = loss + stat[0]
        acc = acc + stat[1]
        
print("train loss: {}".format(loss / iterations))
print("train acc: {}".format(acc / iterations))

iterations = np.floor((y.shape[0] - val_count) / batch_size).astype(int)

for i in range(iterations):
    start_idx = i * batch_size
    x_batch = X[start_idx:start_idx+batch_size]
    y_batch = y[start_idx:start_idx+batch_size]
    
    # test
    stat = model.test_on_batch(x_batch, y_batch)

    loss = loss + stat[0]
    acc = acc + stat[1]
        
print("validation loss: {}".format(loss / iterations))
print("validation acc: {}".format(acc / iterations))

In [14]:
# memory release
X = None
y = None
gc.collect()

## Submission
Submitting the final result against the test data.

In [15]:
test_files = glob(os.path.join(test_path,'*.tif'))

batch_size = 3000
max_idx = len(test_files)

submission = pd.DataFrame()
for idx in range(0, max_idx, batch_size):
    test_data = pd.DataFrame({'path': test_files[idx:idx+batch_size]})
    # get id from the filename
    test_data['id'] = test_data['path'].map(lambda x: x.split('/')[3].split(".")[0])
    test_data['image'] = test_data['path'].map(cv2.imread)
    K_test = np.stack(test_data["image"].values)
    # prediction
    test_data['label'] = model.predict(K_test,verbose = 1)
    submission = pd.concat([submission, test_data[["id", "label"]]])

# showing the result
submission.head()

In [16]:
submission.to_csv("submission.csv", index = False, header = True)