# Intel Cervical Cancer Screening
### April 21, 2017
## Satchel Grant

### Overview
The goal of this notebook is to classify a woman's cervical type into 1 of 3 classes from medical imaging data. This assists in determination of cancer diagnoses and treatments.

### Initial Imports

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import os
from sklearn.utils import shuffle
import scipy.misc as sci
import time
from PIL import Image

%matplotlib inline

def show_img(img):
    plt.imshow(img)
    plt.show()

### Reading in the Data
The images are stored as jpg files, stored in folders corresponding to their classification. I read in the image os paths to be converted to images later in batches. I store their classification in a parallel array. 

In [4]:
root_path = './train/'

def read_paths(path, no_labels=False):
    file_paths = []
    labels = []
    labels_to_nums = dict()
    for dir_name, subdir_list, file_list in os.walk(path):
        if len(subdir_list) > 0:
            n_labels = len(subdir_list)
            for i,subdir in enumerate(subdir_list):
                labels_to_nums[subdir] = i
        else:
            type_ = dir_name.split('/')[-1]
        for img_file in file_list:
            if '.jpg' in img_file.lower():
                file_paths.append(os.path.join(dir_name,img_file))
                if no_labels: labels.append(img_file)
                else: labels.append(labels_to_nums[type_])
    return file_paths, labels, n_labels
    

image_paths, labels, n_labels = read_paths(root_path)
image_paths, labels = shuffle(image_paths, labels)

print("Number of data samples: " + str(len(image_paths)))
print("Number of Classes: " + str(n_labels))

Number of data samples: 1481
Number of Classes: 3


This is a relatively small number of samples to use for deep learning... Luckily Kaggle provided more samples than just those in the train set. I will read those in as well after initial prototyping is finished.

### Data Augmentation
The following cells add rotations and translations to the dataset. This increases the number of samples for training which helps the model generalize better. This prevents overfitting the training set.

In [5]:
def rotate(image, angle, ones, color_range):
    rot_image = sci.imrotate(image, angle).astype(np.float32)
    edge_filler = np.random.random(rot_image.shape).astype(np.float32)*color_range
    rot_image[ones[:,:,:]!=1] = edge_filler[ones[:,:,:]!=1]
    return rot_image

def translate(image, row_shift, col_shift, color_range):
    trans_image = np.random.random(image.shape).astype(np.float32)*color_range
    if row_amt > 0:
        if col_amt > 0:
            translation[row_amt:,col_amt:] = img[:-row_amt,:-col_amt]
        elif col_amt < 0:
            translation[row_amt:,:col_amt] = img[:-row_amt,-col_amt:]
        else:
            translation[row_amt:,:] = img[:-row_amt,:]
    elif row_amt < 0:
        if col_amt > 0:
            translation[:row_amt,col_amt:] = img[-row_amt:,:-col_amt]
        elif col_amt < 0:
            translation[:row_amt,:col_amt] = img[-row_amt:,-col_amt:]
        else:
            translation[:row_amt,:] = img[-row_amt:,:]
    else:
        if col_amt > 0:
            translation[:,col_amt:] = img[:,:-col_amt]
        elif col_amt < 0:
            translation[:,:col_amt] = img[:,-col_amt:]
        else:
            return img.copy()
    return translation

def add_augmentations(paths, rot_angles=[10,-10], row_shift=15, col_shift=15):
    img = mpimg.imread(paths[0])
    ones = [sci.imrotate(np.ones_like(img),rot_angles[i]) for i in range(len(rot_angles))]
    for path in paths:
        img = mpimg.imread(path)
        for i,angle in enumerate(rot_angles):
            add_augmentation(img,path,angle,row_shift,col_shift,ones[i])

def add_augmentation(img,path,angle,row_shift,col_shift,ones):
    color_range = 255
    new_img = rotate(img,angle,ones, color_range)
    new_img = translate(new_img,random.randint(-row_shift,row_shift),random.randint(-col_shift,col_shift), color_range)
    new_img = new_img.astype(np.uint8)
    split_path = path.split('/')
    i = 1
    if angle < 0: i = 2
    split_path[-1] = 'augmented_'+ str(i)+"_"+ split_path[-1]
    new_path = '/'.join(split_path)
    jpeg = Image.fromarray(new_img)
    jpeg.save(new_path)


def one_hot_encode(labels, n_classes):
    one_hots = []
    for label in labels:
        one_hot = [0]*n_classes
        one_hot[label] = 1
        one_hots.append(one_hot)
    return np.array(one_hots,dtype=np.float32)


### Split into Training and Validation Sets
It is important to set aside images for validation. This is how you can determine if your model is overfitting or underfitting during training.

Since I am completing this notebook over the course of multiple days, I save the training paths and validation paths into seperate csv files along with their classification. This is essentially a checkpoint step so that it is easy to repeatedly save and restore the weights of the model later in the process.

In [7]:
training_percentage = .75
total_samples = len(image_paths)
split_index = int(training_percentage*total_samples)

X_train_paths, y_train = image_paths[:split_index], labels[:split_index]
X_valid_paths, y_valid = image_paths[split_index:], labels[split_index:]

In [8]:
print("Number of Training Samples: " + str(len(y_train)))
print("Number of Validation Samples: " + str(len(y_valid)))

Number of Training Samples: 1110
Number of Validation Samples: 371


In [9]:
def save_paths(file_name, paths, labels):
    with open(file_name, 'w') as csv_file:
        for path,label in zip(paths,labels):
            csv_file.write(path + ',' + str(label) + '\n')

save_paths('train_set.csv', X_train_paths, y_train)
save_paths('valid_set.csv', X_valid_paths, y_valid)

### Generator and Image Reader
To maximize memory, the images for both testing and training can be read in in batches. This increases the amount of images that can be trained on in a single epoch which helps the model generalize. In most cases, more training data is better for deep learning. 

In [21]:
def convert_images(paths,resize_dims=None):
    images = []
    for path in paths:
        img = mpimg.imread(path).astype(np.float32)
        print(img.shape)
        if resize_dims:
            img = sci.imresize(img, resize_dims)
        images.append(img)
    return np.array(images, dtype=np.float32)

def image_generator(paths, labels, batch_size, resize_dims=None):
    while True:
        for batch in range(0,len(paths),batch_size):
            image_batch = convert_images(paths[batch:batch+batch_size])
            label_batch = labels[batch:batch+batch_size]
            yield image_batch, label_batch

Notbook on pause to learn about recurrent neural networks. 

The cervical images are of different sizes. I'm currently unsure of image is required for resizing. RNNs can be used to find specific objects within the picture. Potentially I could run an RNN to find the appropriate cropping diminsions for the image and then resize apporopriately.

In [22]:
image_gen = image_generator(image_paths[:10], labels[:10], 2)
for i in range(5):
    imgs, labels = next(image_gen)
    show_img(imgs[0])
    print(labels[0])

(3264, 2448, 3)
(4128, 3096, 3)


ValueError: setting an array element with a sequence.