# Tissue Classification using Neural Networks
In this lab we will explore the use of texture in images and traditional machine learning approaches such as clustering. The dataset we will be using is available here: http://dx.doi.org/10.5281/zenodo.53169. 

![alt text](https://www.researchgate.net/profile/Jakob_Kather/publication/303998214/figure/fig7/AS:391073710002224@1470250646407/Representative-images-from-our-dataset-Here-the-first-10-images-of-every-tissue-class.png)

The above figure shows the 8 different classes of tissue we will be trying to identify. 

In [0]:
# Imports
from __future__ import print_function
import os
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import MaxPool2D
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.utils import to_categorical
import tensorflow as tf

## Step 1
* Load the data (done for you)
 * The "data" variable stores 5000 images of shape 150x150. This means data has shape (5000, 150, 150). These images are loaded here as grayscale.
 * The "labels" variable stores 5000 labels (0-7). This means "labels" has shape (5000,)
* Split data into training and testing subsets (left up to you)
 * Check out the sklearn function train_test_split from sklearn.model_selection

In [2]:
! git clone https://github.com/BeaverWorksMedlytics/Week3_public.git

# Build the path to the data folder. No need to change directories
# There are a total of 6 files you will have to load
data_dir = os.path.join( os.getcwd(), 'Week3_public', 'data', 'crc')

fatal: destination path 'Week3_public' already exists and is not an empty directory.


In [3]:
# Load data and split into training, testing sets
y = np.load(os.path.join(data_dir, 'rgb01.npz'))
labels = y['labels']
data = y['rgb_data']
data = data[:,0:64,0:64,0]
label_str = y['label_str']
label_str = label_str.tolist() # this is to convert label_str back to a dictionary
y = []

print(data.shape)
for ii in range(2,6):
    filename = os.path.join(data_dir, 'rgb0' + str(ii) + '.npz')
    print('loading ', filename)
    y = np.load(filename)
    labels = np.append(labels, y['labels'], axis=0)
    data = np.append(data, y['rgb_data'][:,0:64,0:64,0], axis=0)
    print(data.shape)
    y = []


print( data.shape )
print( labels.shape )

(1000, 64, 64)
loading  /content/Week3_public/data/crc/rgb02.npz
(2000, 64, 64)
loading  /content/Week3_public/data/crc/rgb03.npz
(3000, 64, 64)
loading  /content/Week3_public/data/crc/rgb04.npz
(4000, 64, 64)
loading  /content/Week3_public/data/crc/rgb05.npz
(5000, 64, 64)
(5000, 64, 64)
(5000,)


In [0]:
num_images, nrows, ncols = data.shape

labels = to_categorical(labels)

train_data, test_data, y_train, y_test = train_test_split(data, labels, test_size = .2)

# split into training and testing sets

# convert the labels from 1-D arrays to categorical type 


## Normalize data
All images should be normalized to the range 0-1 by dividing by 255.

#### Note
* Using the La\*b colorspace : If you convert your images to the La\*b colorspace, the scaling factor will change. Each channel in this colorspace will have a different range and normalization of each space will involve scaling each channel separately. Additionally, the a\* channel can have a negative range. This also needs to be taken into account. 
* Using the HSV/HSI colorspace : Similar considerations apply if you are using the HSV/HSI colorspace. The only difference is that the HSV/HSI colorspace will have all positive values.

In [5]:
# Assuming we are using the RGB colorspace
# Normalize all images so that they are 0-1
num_images, nrows, ncols = train_data.shape
train_data = train_data.astype('float')/255
train_data = train_data.reshape(num_images,nrows*ncols)

num_images2, nrows2, ncols2 = test_data.shape
test_data = test_data.astype('float')/255
test_data = test_data.reshape(num_images2,nrows2*ncols2)

print(train_data.shape)


(4000, 4096)


## Step 2
At this point, the data has been split into training and testing sets and normalized. We will now design a fully connected neural network for texture classification. 


![alt text](http://adventuresinmachinelearning.com/wp-content/uploads/2017/04/CNN-example-block-diagram.jpg)


( Image from http://adventuresinmachinelearning.com/keras-tutorial-cnn-11-lines/ )

When designing a fully connected network for classification, we have several decisions to make.

**Network Architecuture**
* How many layers will our network have ?
* How many convolutional filters per layer ?
    * What is an appropriate filter size ? 
* What is an appropriate batch size, learning rate and number of training epochs ?

**Data input**
* Do we use the raw data ?
    * RGB or just gray channel ?
* Does the use of different colorspaces lead to better results for a given network architecture ?
* Can we use any of the texture features from the previous lab as inputs to this model ?
* How does data augmentation affect the results ? 

Other considerations, we will not be exploring :
* What is the trade-off between input data sizes and batch size ?
* Is the GPU always the appropriate platform for training ?
* How does hardware influence inputs and batch sizes for a given desired accuracy ?

In [0]:
# Define the data shapes based on your decision to use rgb or grayscale or other colorpsaces or texture features or 
# some combination of these inputs
num_classes = 8 
input_shape = nrows, ncols, 1
train_data = train_data.reshape(train_data.shape[0], nrows, ncols, 1)
test_data = test_data.reshape(test_data.shape[0], nrows, ncols, 1)

## Step 3
Design your network here using Keras

In [7]:
# Create your network
model = []
model = Sequential()

# Add input layer

# Add fully connected layers 
# See keras.io for Conv2D, MaxPool2D, Dropout documentation


# Add final output layer - This should have as many neurons as the number
# of classes we are trying to identify
model.add(tf.keras.layers.Conv2D(32, kernel_size=3, strides = 1, padding = 'same',
                 activation = tf.nn.relu,
                 input_shape = input_shape))
model.add(tf.keras.layers.Conv2D(32, kernel_size=3, strides = 1, padding = 'same',
                 activation = tf.nn.relu,
                 input_shape = input_shape))

model.add(tf.keras.layers.MaxPooling2D(pool_size=(2,2), padding = 'valid'))

model.add(tf.keras.layers.Dropout(.25))

model.add(tf.keras.layers.Conv2D(32, kernel_size=3, strides = 1, padding = 'same',
                 activation = tf.nn.relu))

model.add(tf.keras.layers.MaxPooling2D(pool_size=(2,2), padding = 'valid'))
model.add(tf.keras.layers.Flatten())

model.add(tf.keras.layers.Dense(8 ,activation = tf.nn.softmax))

model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 64, 64, 32)        320       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 64, 64, 32)        9248      
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 32, 32, 32)        0         
_________________________________________________________________
dropout (Dropout)            (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 32, 32, 32)        9248      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 16, 16, 32)        0         
_________________________________________________________________
flatten (Flatten)            (None, 8192)              0         
__________

## Step 4
Compile the model you designed. Compiltation of the Keras model results in the initialization of model weights and sets other model properties.

In [0]:
model.compile(loss='categorical_crossentropy', optimizer=tf.train.RMSPropOptimizer(learning_rate=0.001), metrics=['accuracy'])

In [10]:
print(train_data.shape)
print(test_data.shape)

(4000, 64, 64, 1)
(1000, 64, 64, 1)


## Step 5
Train model

In [0]:
y = model.fit(train_data, y_train, epochs = 100, validation_data = (test_data,y_test))

Train on 4000 samples, validate on 1000 samples
Epoch 1/100
Epoch 2/100
 416/4000 [==>...........................] - ETA: 49s - loss: 1.4981 - acc: 0.3365

## Step 6
See how your model performs by uisng it for inference.
* What is the accuracy of classification ?
* Change your model, re-compile and test. Can you improve the accuracy of the model ?


In [0]:
# predict labels - use the test set for prediction
pred_labels = model.predict()

In [0]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

# We need to convert the categorical array test_labels into a vector
# in order to use it in the calculation of the confusion matrix
mat = confusion_matrix(np.argmax(test_labels, axis=1), pred_labels)
acc = accuracy_score(np.argmax(test_labels, axis=1), pred_labels)
print(acc)
print(mat)

In [0]:
plt.figure(figsize=(8,6))
plt.imshow(mat, cmap='hot', interpolation='nearest')
plt.grid(False)
plt.colorbar()
plt.xlabel('true label')
plt.ylabel('predicted label')
plt.show()

## Assignment
* In Step 3 design your own network
* Does the model perform better if you use all three RGB channels ?
* How does the performance change when using the La*b colorspace ?


In [0]:

# Load data as RGB
y = np.load(os.path.join(data_dir, 'rgb01.npz'))
labels = y['labels']
data_rgb = y['rgb_data']
label_str = y['label_str']
label_str = label_str.tolist() # this is to convert label_str back to a dictionary
y = []

print(data_rgb.shape)
for ii in range(2,6):
    filename = os.path.join(data_dir, 'rgb0' + str(ii) + '.npz')
    print('loading ', filename)
    y = np.load(filename)
    labels = np.append(labels, y['labels'], axis=0)
    data_rgb = np.append(data_rgb, y['rgb_data'])
    print(data_rgb.shape)
    y = []

data_rgb = data_rgb.astype('float')
data_rgb = data_rgb.reshape(5000, 150, 150, 3)

print( data_rgb.shape )
print( labels.shape )

num_images, nrows, ncols, dims = data_rgb.shape