# CNN with Brain Cell Images

Nicholas Larsen
Steven Larsen

This data came from real world microscopic images.  Each image is a blood smear from a patient that was then placed on a slide for imaging.  This data was collected with the intention of classifing Acute Lymphoblastic Leukemia (ALL).  This can be a difficult task in, due to the differences between healthy and cells with leukemia being extremely small.  Each image from the data set was analyzed by an expert oncologist.  


# Load images, show a few examples

In [44]:
from PIL import Image
from os import listdir
import numpy as np
from matplotlib import pyplot as plt
from skimage.feature import daisy
from sklearn.metrics.pairwise import pairwise_distances
from skimage.io import imshow
from ipywidgets import widgets  # make this interactive!
from ipywidgets import fixed
import copy
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import make_scorer, accuracy_score,precision_score, recall_score, f1_score
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit

In [7]:
def gray_sacle(data):
    return np.dot(data[...,:3], [0.299, 0.587, 0.114])

def read_images(directories, grey_scale=False, verb = False):
    """Reads in the all and rem directoires under each directory in the list directories"""
    X = []
    y = []
    for direct in dirs:
        if verb:
            print(f"Reading {direct}")
        direct_all = f"{direct}\\all"
        for file in listdir(direct_all):
            if verb:
                print(f"Reading file: {file}")
            image = Image.open(f"{direct_all}\\{file}")
            data = np.asarray(image)
            if grey_scale:
                data = gray_sacle(data)
            #data = data.ravel()
            X.append(data)
            y.append(1)
                
        direct_rem = f"{direct}\\hem"
        for file in listdir(direct_rem):
            if verb:
                print(f"Reading file: {file}")
            image = Image.open(f"{direct_rem}\\{file}")
            data = np.asarray(image)
            if grey_scale:
                data = gray_sacle(data)
            #data = data.ravel()
            X.append(data)
            y.append(0)
                
    return np.asarray(X), np.asarray(y)

def plot_gallery(images, titles, h, w, n_row=3, n_col=6):
    """Helper function to plot a gallery of portraits"""
    plt.figure(figsize=(1.7 * n_col, 2.3 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())

In [9]:
%%time
dirs = [
#    r'..\archive\C-NMC_Leukemia\training_data\fold_0',
#    r'..\archive\C-NMC_Leukemia\training_data\fold_1',
#    r'..\archive\C-NMC_Leukemia\training_data\fold_2'
    r'..\archive\C-NMC_Leukemia\training_data\fold_small'
]
X, y = read_images(dirs, verb=False)
print(X.shape)

(522, 450, 450, 3)
Wall time: 9.88 s


# Preparation

## Explain Metrics

We are very interested int the recall score.  Simply put this is because our data is operating in the medical field.  Our algorithm giving the OK to a patient that does have the disease we are trying to predict would be a very bad outcome.  We are still interested in the accuracy in general, since our algorithm will likely be used supplementary to an expert's opinion.  If our recall score is high enough we will be able to reduce the number of images doctors will have to sift through.

## Define splitting Techniques (why is this realistic in practice)

Our data is going to be split into training and testing (80 / 20).  On the 80 we will perform stratified K folds.  This will just be used to compare the different models produced in this lab.

In [13]:
#Split the data
X_train, X_test, y_train, y_test =\
    train_test_split(X, y, test_size=0.2, stratify=y)
X_train.shape

(417, 450, 450, 3)

In [25]:
X_train_mlp = X_train.reshape(X_train.shape[0],X_train.shape[1]*X_train.shape[2]*X_train.shape[3])
X_train_mlp.shape
X_test_mlp = X_test.reshape(X_test.shape[0],X_test.shape[1]*X_test.shape[2]*X_test.shape[3])

In [33]:
%%time
#Standard MLP for comparison
#from sklearn import veresion as sklearn_version

#print(sklearn_version)
# these values have been hand tuned
def MLP_create():
    
    
    clf = MLPClassifier(hidden_layer_sizes=(50, 25, 12), 
                        activation='relu', # compare to sigmoid
                        solver='adam', 
                        alpha=1e-4, # L2 penalty
                        batch_size=128, # min of 200, num_samples
                        learning_rate='adaptive', # decrease rate if loss goes up
                        #learning_rate_init=0.1, # only SGD
                        #power_t=0.5,    # only SGD with inverse scaling
                        max_iter=20, 
                        shuffle=True, 
                        random_state=1, 
                        tol=1e-9, # for stopping
                        verbose=False, 
                        warm_start=False, 
                        #momentum=0.9, # only SGD
                        #nesterovs_momentum=True, # only SGD
                        early_stopping=False, 
                        validation_fraction=0.1, # only if early_stop is true
                        beta_1=0.9, # adam decay rate of moment
                        beta_2=0.999, # adam decay rate of moment
                        epsilon=1e-08) # adam numerical stabilizer
    return clf
clf = MLP_create()
clf.fit(X_train_mlp,y_train)
yhat = clf.predict(X_test_mlp)
print('Validation recall:',recall_score(yhat,y_test))
print('Validation Acc:',accuracy_score(yhat,y_test))



Validation recall: 0.2962962962962963
Validation Acc: 0.7428571428571429
Wall time: 3min 4s


In [48]:
#https://medium.com/@literallywords/stratified-k-fold-with-keras-e57c487b1416
def stratifiedKFoldRuns(K=2, scorer=recall_score,model_create=MLP_create):
    cv = StratifiedKFold(n_splits=K, shuffle=True)
    my_scorer = make_scorer(scorer)

    scores = [] 
    for index, (train_indices, val_indices) in enumerate(cv.split(X_train, y_train)):
        print(f"Training on fold {index+1}/{K}...")

        _X_train = X_train_mlp[train_indices]
        _X_test = X_train_mlp[val_indices]
        _y_train = y_train[train_indices]
        _y_test = y_train[val_indices]
        
        
        
        model = model_create()
        model.fit(_X_train,_y_train)
        yhat = clf.predict(_X_test)
        score = scorer(yhat,_y_test)
    
        scores.append(score)
        print("Last training score: ",score)
    return scores
stratifiedKFoldRuns()

Training on fold 1/2...
Last training score:  0.4878048780487805
Training on fold 2/2...




Last training score:  0.3953488372093023


[0.4878048780487805, 0.3953488372093023]

____

# Modeling

## Set up Data Expansion in Keras. 
### Options from town hall
* Data augmentation he showed an example of. Tends to be slow
* Go through and a couple of passes of expansion'
* Use expansion for a couple of epic at the end
### Reasoning


## Create Convolutional Neural Network using Keras. 
* Investigate different parameters on at least two different network architectures
* Architectural Differences
 * Number of layers
 * Whether or not using residual paths
 * Seperable convolutions
 
Need a total of 4 models

In [5]:
# Code here

## Visualize the final Results
* Visualize
* Compare statistically
* Compare the performance to a standard ML_P using the receiver operating characteristic and the area under the curve
This includes:
* Which one is the best
* Which one you should choose
* How might you deploy it
* All of the things you might be interested in

In [6]:
# Code here

# Use transfer learning to pre-train weights of your initial layers of CNN
* Compare to best other model
* There is an exmaple in his notebook. Use Img Net weights, VGG. Compare from scratch from above

In [None]:
# Code here