# Histopathologic Cancer Detection

### CSE4020-Machine Learning<br>Faculty: Dr. Syed Ibrahim<br>Slot: C1
### By:&ensp;Prishita Kapoor-17EC035 <br>&emsp;&ensp;

### Dataset Description
In this dataset, we are provided with a large number of small pathology images to classify. Files are named with an image id. The train_labels.csv file provides the ground truth for the images in the train folder. We are predicting the labels for the images in the test folder. A positive label indicates that the center 32x32px region of a patch contains at least one pixel of tumor tissue. Tumor tissue in the outer region of the patch does not influence the label. This outer region is provided to enable fully-convolutional models that do not use zero-padding, to ensure consistent behavior when applied to a whole-slide image.

### Methodology
We have used Convolutional Neural Networks to approach this problem. A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. A ConvNet is able to successfully capture the Spatial and Temporal dependencies in an image through the application of relevant filters. The architecture performs a better fitting to the image dataset due to the reduction in the number of parameters involved and reusability of weights.


In [1]:
#Load the modules
from glob import glob 
import numpy as np
import pandas as pd
import keras,cv2,os
from keras.callbacks import TensorBoard
import gc

Using TensorFlow backend.


In [2]:
path = "../input/"
train_path = path + 'train/'
test_path = path + 'test/'

df = pd.DataFrame({'path': glob(os.path.join(train_path,'*.tif'))}) # load the filenames
df['id'] = df.path.map(lambda x: x.split('/')[3].split(".")[0]) # keep only the file names in 'id'
labels = pd.read_csv(path+"train_labels.csv") # read the provided labels
df = df.merge(labels, on = "id") # merge labels and filepaths
df.head(3) # print the first three entrys

Unnamed: 0,path,id,label
0,../input/train/f46f19fc90347d350431da5bfcf955d...,f46f19fc90347d350431da5bfcf955d9c1418b43,1
1,../input/train/330c56d7a3a1a808d711386c136b874...,330c56d7a3a1a808d711386c136b874a87081526,0
2,../input/train/b7b8babd812d5edbad7dd9b155ee29f...,b7b8babd812d5edbad7dd9b155ee29fbede4ab81,0


<br>A function **_load _ data_** is created to load the images as a numpy array.<br>
We have used openCV to convert the image into matrix representation. 
The function cv2.imread() is used to read an image.<br><br>

In [3]:
def load_data(N,df):
    # allocate a numpy array for the images (N, 96x96px, 3 channels, values 0 - 255)
    X = np.zeros([N,96,96,3],dtype=np.uint8) 
    #convert the labels to a numpy array too
    y = np.squeeze(df.as_matrix(columns=['label']))[0:N]
    #read images one by one, tdqm notebook displays a progress bar
    for i, row in tqdm_notebook(df.iterrows(), total=N):
        if i == N:
            break
        X[i] = cv2.imread(row['path'])
          
    return X,y

<br>We have used **gc** to remove the dataframe stored in the RAM.<br>
This is done to free up some RAM else we can get _Memory Error_ in systems having low RAM.<br><br>

In [4]:
from tqdm import tqdm_notebook,trange
N = df["path"].size # get the number of images in the training data set
0df=None
gc.collect(); #garbage collector for memory management

  import sys


HBox(children=(IntProgress(value=0, max=220025), HTML(value='')))

In [5]:
#Importing necessary librares
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D

<br>We have split the train data into  **_80:20_**  in train and validation set. <br>
Before spliiting the data, we have shuffled the data randomly to make sure that there is no bias in the data.
<br><br>

In [6]:
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=0)
X_train, X_test = X[:train_pct_index], X[train_pct_index:]
y_train, y_test = y[:train_pct_index], y[train_pct_index:]
X=None
gc.collect()
y=None
gc.collect()

0

<br>Defining the kernel size,pool size and the numbers of filter to be used in each layer<br>
**kernel size:** Refers to the size of the convolutional filter<br><br>

In [7]:
kernel_size = (3,3)
pool_size= (2,2)
first_filters = 64
second_filters = 128
third_filters = 256

### Dropout Rate
In this we have defined the droupout rate fro convolution layer as well as dense layer<br>
<br>
Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out” randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.
<br><br>
This in turn results in a network that is capable of better generalization and is **less likely to overfit the training data**.<br>

In [8]:
dropout_conv = 0.2 #dropout ratein convolution layer
dropout_dense = 0.2 #dropout rate in dense layer

### Creating a model and adding layers to the model<br>
**Conv2D:** This layer creates a convolution kernel that is convolved with the layer input to produce a tensor of outputs.<br><br>
**Batch Normalization:** Normalize the activations of the previous layer at each batch, i.e. applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.<br>
<br>
**Activation:** Define the activation function for the Layer<br>
<br>
**MaxPool2D:** Max pooling operation for temporal data.This is to decrease the computational power required to process the data through dimensionality reduction. <br>
<br>
**Dropout:** Applies Dropout to the input.Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting.<br>
<br>
**Flatten:** A flatten operation on a tensor reshapes the tensor to have a shape that is equal to the number of elements contained in the tensor. This is the same thing as a 1d-array of elements.<br>
<br>
**Dense:** A Dense layer feeds all outputs from the previous layer to all its neurons, each neuron providing one output to the next layer.A Dense(512) has 512 neurons.


In [9]:

from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, BatchNormalization, Activation
from keras.layers import Conv2D, MaxPool2D
model = Sequential()

#now add layers to it

#conv block 1
model.add(Conv2D(first_filters, kernel_size, input_shape = (96, 96, 3)))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Conv2D(first_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Conv2D(first_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPool2D(pool_size = pool_size)) 
model.add(Dropout(dropout_conv))

#conv block 2
model.add(Conv2D(second_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Conv2D(second_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Conv2D(second_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPool2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

#conv block 3
model.add(Conv2D(third_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Conv2D(third_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Conv2D(third_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPool2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

#a fully connected (also called dense) layer at the end
model.add(Flatten())
model.add(Dense(256, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Dropout(dropout_dense))

#finally convert to values of 0 to 1 using the sigmoid activation function
model.add(Dense(1, activation = "sigmoid"))

### Compiling the Model

The model is compiled with binary crossentropy as the loss function. The Optimizer used in the model is **_Adam_** with a learning rate of **_0.001_**

In [10]:
model.compile(loss=keras.losses.binary_crossentropy,optimizer=keras.optimizers.Adam(0.001),metrics=['accuracy'])

In [11]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 94, 94, 64)        1792      
_________________________________________________________________
batch_normalization_1 (Batch (None, 94, 94, 64)        256       
_________________________________________________________________
activation_1 (Activation)    (None, 94, 94, 64)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 92, 92, 64)        36864     
_________________________________________________________________
batch_normalization_2 (Batch (None, 92, 92, 64)        256       
_________________________________________________________________
activation_2 (Activation)    (None, 92, 92, 64)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 90, 90, 64)        36864     
__________

In [12]:
tensorboard = TensorBoard(log_dir='./logs', histogram_freq=0,write_graph=True, write_images=False)
# define model
model.fit(X_train, y_train,
          epochs=10,
          validation_data=(X_test, y_test),
          shuffle=True,
          callbacks=[tensorboard])

Train on 176020 samples, validate on 44005 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10

In [13]:
base_test_dir = path + 'test/' #specify test data folder
test_files = glob(os.path.join(base_test_dir,'*.tif')) #find the test file names
submission = pd.DataFrame() #create a dataframe to hold results
file_batch = 5000 #we will predict 5000 images at a time
max_idx = len(test_files) #last index to use
for idx in range(0, max_idx, file_batch): #iterate over test image batches
    print("Indexes: %i - %i"%(idx, idx+file_batch))
    test_df = pd.DataFrame({'path': test_files[idx:idx+file_batch]}) #add the filenames to the dataframe
    test_df['id'] = test_df.path.map(lambda x: x.split('/')[3].split(".")[0]) #add the ids to the dataframe
    test_df['image'] = test_df['path'].map(cv2.imread) #read the batch
    K_test = np.stack(test_df["image"].values) #convert to numpy array
    predictions = model.predict(K_test,verbose = 1) #predict the labels for the test data
    test_df['label'] = predictions #store them in the dataframe
    submission = pd.concat([submission, test_df[["id", "label"]]])

Indexes: 0 - 5000
Indexes: 5000 - 10000
Indexes: 10000 - 15000
Indexes: 15000 - 20000
Indexes: 20000 - 25000
Indexes: 25000 - 30000
Indexes: 30000 - 35000
Indexes: 35000 - 40000
Indexes: 40000 - 45000
Indexes: 45000 - 50000
Indexes: 50000 - 55000
Indexes: 55000 - 60000


<br>
The code in the cell below is used to run tensorboard on kaggle<br>
<br>

In [14]:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
LOG_DIR = './logs' # Here you have to put your log directory
get_ipython().system_raw(
    'tensorboard --logdir {} --host 0.0.0.0 --port 6006 &'
    .format(LOG_DIR)
)
get_ipython().system_raw('./ngrok http 6006 &')
! curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

--2019-03-19 15:39:29--  https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 34.206.130.40, 52.55.191.55, 34.232.40.183, ...
Connecting to bin.equinox.io (bin.equinox.io)|34.206.130.40|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14910739 (14M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip’


2019-03-19 15:39:30 (41.4 MB/s) - ‘ngrok-stable-linux-amd64.zip’ saved [14910739/14910739]

Archive:  ngrok-stable-linux-amd64.zip
  inflating: ngrok                   
https://6ad22fd5.ngrok.io


<br>
Exporting the data as CSV<br>
<br>

In [15]:
submission.to_csv("submission.csv", index = False, header = True) #create the submission file