# Tissue Classification using Neural Networks
In this lab we will explore the use of texture in images and traditional machine learning approaches such as clustering. The dataset we will be using is available here: http://dx.doi.org/10.5281/zenodo.53169. 

![alt text](https://www.researchgate.net/profile/Jakob_Kather/publication/303998214/figure/fig7/AS:391073710002224@1470250646407/Representative-images-from-our-dataset-Here-the-first-10-images-of-every-tissue-class.png)

The above figure shows the 8 different classes of tissue we will be trying to identify. 

In [0]:
# Imports
from __future__ import print_function
import os
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.utils import to_categorical

## Step 1
* Load the data (done for you)
 * The "data" variable stores 5000 images of shape 150x150. This means data has shape (5000, 150, 150). These images are loaded here as grayscale.
 * The "labels" variable stores 5000 labels (0-7). This means "labels" has shape (5000,)
* Split data into training and testing subsets (left up to you)
 * Check out the sklearn function train_test_split from sklearn.model_selection

In [83]:
! git clone https://github.com/BeaverWorksMedlytics/Week3_public.git

# Build the path to the data folder. No need to change directories
# There are a total of 6 files you will have to load
data_dir = os.path.join( os.getcwd(), 'Week3_public', 'data', 'crc')

fatal: destination path 'Week3_public' already exists and is not an empty directory.


In [84]:
# Load data and split into training, testing sets
y = np.load(os.path.join(data_dir, 'rgb01.npz'))
labels = y['labels'][0:500]
data = y['rgb_data']
data = data[0:500,:,:,]
label_str = y['label_str']
label_str = label_str.tolist() # this is to convert label_str back to a dictionary
y = []

print(data.shape)
for ii in range(2,6):
    filename = os.path.join(data_dir, 'rgb0' + str(ii) + '.npz')
    print('loading ', filename)
    y = np.load(filename)
    labels = np.append(labels, y['labels'][0:500], axis=0)
    data = np.append(data, y['rgb_data'][0:500,:,:,], axis=0)
    print(data.shape)
    y = []


print( data.shape )
print( labels.shape )

(500, 150, 150, 3)
loading  /content/Week3_public/data/crc/rgb02.npz
(1000, 150, 150, 3)
loading  /content/Week3_public/data/crc/rgb03.npz
(1500, 150, 150, 3)
loading  /content/Week3_public/data/crc/rgb04.npz
(2000, 150, 150, 3)
loading  /content/Week3_public/data/crc/rgb05.npz
(2500, 150, 150, 3)
(2500, 150, 150, 3)
(2500,)


In [0]:
num_images, nrows, ncols, ncolors = data.shape

# split into training and testing sets
labels = to_categorical(labels, num_classes=8) #makes it a one hot, should this be after split?

X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size = .2)
# convert the labels from 1-D arrays to categorical type 
#print(labels)
#print(labels)


In [0]:
print(X_train)

## Normalize and Reshape Data
All images should be normalized to the range 0-1 by dividing by 255.

Additionally, because this is a ANN, not a CNN, we need to reshape the data to be one dimensional. In training and test data, colapse the row and column dimensions into one dimension using reshape().
#### Note
* Using the La\*b colorspace : If you convert your images to the La\*b colorspace, the scaling factor will change. Each channel in this colorspace will have a different range and normalization of each space will involve scaling each channel separately. Additionally, the a\* channel can have a negative range. This also needs to be taken into account. 
* Using the HSV/HSI colorspace : Similar considerations apply if you are using the HSV/HSI colorspace. The only difference is that the HSV/HSI colorspace will have all positive values.

In [86]:
# Assuming we are using the RGB colorspace
# Normalize all images so that they are 0-1
#print(X_train[6,: ,: ,:])
#print(X_train.shape[0])

#for i in range(X_train.shape[0]): #for each image
X_train = np.divide(X_train.astype('float'), 255.)#X_train[i, :, :,0:3]/255
X_test = np.divide(X_test.astype('float'), 255.)
#plt.imshow(X_train[0,:,:,0])
  #print(X_train[i,: ,: ,:2])
#for i in range(X_test.shape)
# Reshape the data 
np.reshape(X_train, (X_train.shape[0], 150*150*3))
np.reshape(X_test, (X_test.shape[0], 150*150*3))

array([[0.26666667, 0.13333333, 0.32941176, ..., 0.49019608, 0.26666667,
        0.4745098 ],
       [0.50588235, 0.2       , 0.38823529, ..., 0.70980392, 0.34117647,
        0.48627451],
       [0.94901961, 0.87058824, 0.83529412, ..., 0.89803922, 0.72156863,
        0.73333333],
       ...,
       [0.74509804, 0.48627451, 0.6745098 , ..., 0.70196078, 0.44313725,
        0.62352941],
       [0.43529412, 0.17254902, 0.34117647, ..., 0.60784314, 0.38431373,
        0.55294118],
       [0.94509804, 0.94509804, 0.94509804, ..., 0.96862745, 0.95294118,
        0.95686275]])

## Step 2
At this point, the data has been split into training and testing sets and normalized. We will now design a fully connected neural network for texture classification. 

<img src="http://cs231n.github.io/assets/nn1/neural_net2.jpeg" width="50%"></img>

( Image from http://cs231n.github.io/convolutional-networks/ )

When designing a fully connected network for classification, we have several decisions to make.

**Network Architecuture**
* How many layers will our network have ?
* How many neurons per layer ?
* What is an appropriate batch size, learning rate and number of training epochs ?

**Data input**
* Do we use the raw data ?
    * RGB or just gray channel ?
* Does the use of different colorspaces lead to better results for a given network architecture ?
* Can we use any of the texture features from the previous lab as inputs to this model ?
* How does data augmentation affect the results ? 

Other considerations, we will not be exploring :
* What is the trade-off between input data sizes and batch size ?
* Is the GPU always the appropriate platform for training ?
* How does hardware influence inputs and batch sizes for a given desired accuracy ?

In [0]:
# Define the data shapes based on your decision to use rgb or grayscale or other colorpsaces or texture features or 
# some combination of these inputs
num_classes = 8 
input_shape = nrows*ncols*ncolors

## Step 3
Design your network here using Keras

In [88]:
# Create your network
model = []
model = Sequential()

# Add input layer
model.add(Dense(32, input_shape=(X_train.shape[0],input_shape)))
# Add fully connected layers 
model.add(Dense(256))
# See Dense : https://keras.io/layers/core/#dense

# Add final output layer - This should have as many neurons as the number
# of classes we are trying to identify
model.add(Dense(num_classes))

model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_8 (Dense)              (None, 2000, 32)          2160032   
_________________________________________________________________
dense_9 (Dense)              (None, 2000, 256)         8448      
_________________________________________________________________
dense_10 (Dense)             (None, 2000, 8)           2056      
Total params: 2,170,536
Trainable params: 2,170,536
Non-trainable params: 0
_________________________________________________________________


## Step 4
Compile the model you designed. Compiltation of the Keras model results in the initialization of model weights and sets other model properties.

In [0]:
model.compile(loss='mean_squared_error', optimizer='sgd')

## Step 5
Train model

In [92]:
y = model.fit(X_train, y_train, epochs=10)

ValueError: ignored

## Step 6
See how your model performs by uisng it for inference.
* What is the accuracy of classification ?
* Change your model, re-compile and test. Can you improve the accuracy of the model ?


In [0]:
# predict labels - use the test set for prediction
pred_labels = model.predict(???)

In [0]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

# We need to convert the categorical array test_labels and pred_labels into a vector
# in order to use it in the calculation of the confusion matrix (i.e. convert from one-hot to integers)
mat = confusion_matrix(np.argmax(test_labels, axis=1), pred_labels)
acc = accuracy_score(np.argmax(test_labels, axis=1), np.argmax(pred_labels, axis=1))
print(acc)
print(mat)

In [0]:
plt.figure(figsize=(8,6))
plt.imshow(mat, cmap='hot', interpolation='nearest')
plt.grid(False)
plt.colorbar()
plt.xlabel('true label')
plt.ylabel('predicted label')
plt.show()

## Assignment
* In Step 3 design your own network
* Does the model perform better if you use all three RGB channels ?
* How does the performance change when using the La*b colorspace ?
