## Assingment 2

Let's again do something very similar to the lab

**Hand in:**This notebook, and a pdf of this notebook. No written answers to the questions are required, they are only here to help you learn

**You are free to discuss the general concepts with other groups, but not code specifics.**

NOTE: Concepts introduced in this lab may appear on the exam, in which case there will be general questions. Try therefore to understand what you are doing in the assignment, and why. Teachers will be available during the scheduled times, and you can also ask questions in the discussion forums.

## The assignment

You have access to 6 glass slides of cells from patients with tumors (sick), and 4 slides from healthy patients. If you do not understand what a glass slide is, simply think of it as a collection of images from one patient. 

Your task is to make as good as model as you can* in order to predict whether a new glass slide contains cells from a tumor or healthy cells, thus aiding in the diagnosis of patients. 

This dataset is the same dataset as the one in this master thesis, where you can read more about the data  http://uu.diva-portal.org/smash/get/diva2:1119167/FULLTEXT02.pdf, which resulted in this article which is significantly shorter: https://openaccess.thecvf.com/content_ICCV_2017_workshops/papers/w1/Wieslander_Deep_Convolutional_Neural_ICCV_2017_paper.pdf. 

/* Within reason, specific instructions will be given.

## First we need to import all of the packages we need

import numpy as np
import tensorflow as tf
import pandas as pd
from PIL import Image
import IPython
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import h5py
import os, shutil

from tensorflow.keras.preprocessing.image import ImageDataGenerator

from tifffile import imsave
import cnn_helper


In [None]:
    
def plot_history(model_history, model_name):
    fig = plt.figure(figsize=(15, 5), facecolor='w')
    ax = fig.add_subplot(131)
    ax.plot(model_history.history['loss'])
    ax.plot(model_history.history['val_loss'])
    ax.set(title=model_name + ': Model loss', ylabel='Loss', xlabel='Epoch')
    ax.legend(['train', 'valid'], loc='upper right')
    
    ax = fig.add_subplot(132)
    ax.plot(np.log(model_history.history['loss']))
    ax.plot(np.log(model_history.history['val_loss']))
    ax.set(title=model_name + ': Log model loss', ylabel='Log loss', xlabel='Epoch')
    ax.legend(['Train', 'Test'], loc='upper right')    

    ax = fig.add_subplot(133)
    ax.plot(model_history.history['accuracy'])
    ax.plot(model_history.history['val_accuracy'])
    ax.set(title=model_name + ': Model accuracy', ylabel='Accuracy', xlabel='Epoch')
    ax.legend(['train', 'valid'], loc='upper right')
    plt.show()
    plt.close()

def plot_confusion_matrix(cm, classes, model_name,
                          cmap=plt.cm.Blues):
    title = model_name + ': Confusion Matrix'
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()
    
def valid_evaluate(y_true, y_pred, model_name):
    
    class_names = ['Aur', 'Ch', 'Eg5', 'PS', 'DR', 'DS']
    cnf_matrix = confusion_matrix(y_true, y_pred)
    np.set_printoptions(precision=2)
    plt.figure(figsize=(15,5), facecolor='w')
    plot_confusion_matrix(cnf_matrix, classes=class_names, model_name=model_name)
    plt.show()
    plt.close()
    
    print('')
    print('classification report for validation data:')
    print(classification_report(y_true, y_pred, digits=3))

# Set up the data, look at it

In [None]:
## Set up where to find our data
data_directory = "./LabData/HPV_slides/"



### The image data 
The image data comes in a hdf5 format, which means that we have to do some fun things to make this work. You can read more about the data type here https://support.hdfgroup.org/HDF5/whatishdf5.html

In [None]:
slides_healthy = ['glass3', 'glass4', 'glass5', 'glass6', 'glass7', 'glass8']
slides_tumor = ['glass12', 'glass36', 'glass37', 'glass38']

Let's look at one slide first to get a feeling for the data format

In [None]:
## My code

slide = slides_healthy[0]
f = h5py.File(data_directory + slide + '.hdf5' , 'r')

dset = f['data'][0:2][1:][:][:]
print(dset.shape)

In [None]:

dset = f['data']
print(dset.shape)

In [None]:

## Let's look at the images - always a good start to the project

figure, ax = plt.subplots(2, 3, figsize=(14, 10))
figure.suptitle("Examples of images", fontsize=20)
axes = ax.ravel()

for i in range(0,5):
    image = f['data'][i:i+1][:][:][:]
    image = image.reshape(80, 80)
    axes[i].set_title("image" + str(i)) 
    axes[i].imshow(image)
    axes[i].set_axis_off()
    
plt.subplots_adjust(wspace=0.05, hspace=0.05)
plt.show()
plt.close()

In [None]:
## What happens if we crop the images?
## You can read more about squeeze at https://numpy.org/doc/stable/reference/generated/numpy.squeeze.html

# overall image size
im_x = 80
im_y = 80

center_xy = [int(im_x/2), int(im_y/2)]

# offset = 24 -> tile size of 48 x 48 
# (divides nicely by two for when max pooling, captures most of all cells, will learn faster than 80 x 80 original images) ## This was a note from someone who has worked with the dataset
offset = 24
xydim = offset * 2

        

figure, ax = plt.subplots(2, 3, figsize=(14, 10))
figure.suptitle("Examples of cropped images", fontsize=20)
axes = ax.ravel()

for i in range(0,5):
    image = f['data'][i:i+1][:][:][:]
    
    image = image.reshape(80, 80)
    crop = image[(center_xy[0] - offset):(center_xy[0] + offset), (center_xy[1] - offset):(center_xy[1] + offset)]
    print(crop.shape)
    image_new = crop.reshape(xydim, xydim)
    print(image_new.shape)
    axes[i].set_title("image" + str(i)) 
    axes[i].imshow(image_new)
    axes[i].set_axis_off()
    
plt.subplots_adjust(wspace=0.05, hspace=0.05)
plt.show()
plt.close()


## Divide into training, validation and test set


 Mix the data so that images from one slide may be in the train, validation *or* the test set, or both or all three. Simply make the selection of images for a set totally random.

In [None]:
 
for these images it may be easier to use the flow_from_directory command  see https://keras.io/api/preprocessing/image/   


In [None]:
# Write your code for the train, valid and test generators here. You may add more cells above this one if you like. 

# Transfer learning

Read more about transfer learning here: https://keras.io/guides/transfer_learning/
Share extra resources in the discussion forums.

Transfer learning is a vital part of most deep learning within life sciences, can you think of why? 

Hint: Understanding the general concepts of why transfer learning is useful may be useful on the exam. Maybe. 

## VGG16

In [None]:
# TODO: Load a VGG16 model, remove the last layer, and then train the model


In [None]:
## Compile model
## Don't worry about the details here yet
ann_model.compile(optimizer=keras.optimizers.Adam(), loss='categorical_crossentropy', metrics = ['accuracy'])

In [None]:
## Actually train model
## Don't worry about the details here yet

epochs = 5
history = model.fit_generator(generator=train_generator,
                    steps_per_epoch= train_steps,
                    validation_data= valid_generator,
                    validation_steps= validation_steps,
                    epochs= epochs
        )

In [None]:
## Plot results
plot_history(history, "test_Name")

In [None]:
# plot confusion matrix
plot_confusion_matrix_from_generator(model, valid_generator)



### TODO Change the code above so that you get atleast an 80% accuracy


## Resnet50

In [None]:
## Todo: make a transferlearning model using Resnet50 trained on ImageNet. 


### Q: Is the initial results better or worse than your VGG16 results? Why do you think this is?

### TODO: vary the model above using atleast 3 different parameters/ architectures. Show the results for each change and then combine every parameter/architecture that improved the model.

## Unknown model

Pick another model to try, and try reaching a percentage higher than your initial VGG16 results. 

# Data augmentation

Pick your best model above, and introduce atleast 3 different data augmentations 1 by 1. Keep the ones that improve the network (if any) for your final test.

# Final Test

Finally, pick the best model you have chosen and test it. (so far you have evaluated(validated) your model)

### Q: If a clinician requires a 95% accuracy in their models would you recommend the model you have generated? Why or why. not? What would your next steps be to generate a better neural network?

# A different dataset

### Q: In this previous excercise we have mixed the patients all together. What is the main drawback with this type of datamixing?

Using your best model from above, make a test set that includes 2 patients, and where the training and validation do not contain the same patients. In short, make a data division such that you could attempt to evaluate how well your model would behave with a new patient. Do not be supprised if the model is suddenly much less good. 

### Q: It is likely that this second training of the model will be worse, why? Why is it not guaranteed?