# Image classification using convolutional neural networks

## Introduction - the dataset

Image classification has historically been one of the most challenging tasks in computer vision.
Convolutional neural networks (CNN) have allowed an unprecedented improvement in the accuracy of image classification.

For this workshop we will use a modified version of the [ISIC 2020 Challenge dataset](https://challenge2020.isic-archive.com/). The original version of this dataset contains 33126 images benign and malignant skin lesions from over 2000 patients. The dataset is particularly challenging to work with (well, it's from a challenge, after all!) because of the very imbalanced training dataset, which contains only less than 600 images of malignant lesions.

To simplify matters, and allow you to run this in a reasonable time frame, I have created a balanced version of the dataset, with 584 benign and 584 malignant images. Images in the original dataset are quite large (some are up to 4000 x 6000 pixels) and of different size; for simplicity I have resized them to 500 x 500 which should be sufficient for this workshop.

Finally, note that all of the images used in this workshop come from the training set of the challenge. This is so that we have the ground truth for each image (which is obviously unavailable for the test set in the original data).

## Learning objectives

At the end of this workshop you should be able to

- Create a CNN classifier using Keras
- Use regularization to avoid overfitting
- Use data augmentation to avoid overfitting and improve accuracy

We start by importing our _usual_ libraries, such as keras, matplotlib numpy and pandas.

In [3]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import keras

Make sure you have the `ISIC2020_Small_metadata.csv` and `ISIC2020_Small.zip` files in the same directory as this notebook. Unzip the `ISIC2020_Small.zip` into a folder called "ISIC2020_Small" in the same directory as this notebook.

The `ISIC2020_Small_metadata.csv` file contains information about all the images in the dataset. Let's open it and see what we find.

Use the `value_counts` function to count the number of images in each class (benign vs malignant). 

How many are coming from men and how many from women?

<details>
<summary style="cursor: pointer;">Click here to reveal a hint.</summary>
Try using

<code>
metadata['column_name'].value_counts()
</code>

You can even pass a list of columns instead of a single string!
</details>

In [None]:
image_dir = 'ISIC2020_Small/'
metadata = pd.read_csv('ISIC2020_Small_metadata.csv')

# Print the metadata and the summary counts
# Your code here

Because this is a very small dataset, it will easily fit into memory, so we could read all images.

This can quickly become unmanageable for larger datasets, though, but luckily Keras has a built-in function to help with this.

We are going to use the [keras.preprocessing.image_dataset_from_directory](https://keras.io/api/preprocessing/image/) function to create an image generator that will pull images when needed.

The function can infer labels directly from the subdirectory names, which makes life a lot easier!

We can also provide a batch size, which is the number of images we want to load at once. We can set it to 64, which should be easy enough to handle, but you might want to experiment with that.

Note that we call the same function twice, once for the training set and once for the validation set. We use the `validation_split` parameter to split the dataset into training and validation sets (80% and 20%, respectively) and, **very importantly**, we set the `seed` parameter to the same number (you can use any number you like) to ensure that the same images are used for training and validation in both cases! This also ensures reproducibility.

In [None]:
from keras.utils import image_dataset_from_directory

training_dataset = image_dataset_from_directory(
    image_dir, 
    labels="inferred",   
    batch_size = 64,
    subset='training',
    validation_split=0.2,
    seed=12345)

test_dataset = image_dataset_from_directory(
    image_dir,
    labels="inferred",
    batch_size = 64,
    subset='validation',
    validation_split=0.2,
    seed=12345) 

We can now use the `take` method of the generator to load some images for display.
We can pass 1 to `take` to get a single batch of images, then iterate through them.

`take` returns a TensorFlow dataset, which is an iterable object, so we can use a for loop to iterate through it.

Alternatively using

`next(iter(training_dataset))` will return the next batch of (images, labels) from the dataset.

Note that the images are returned as TensorFlow tensors, which are not directly displayable (but are very efficient for training!). We can use the `numpy()` method to convert them to numpy arrays, which can be displayed using `matplotlib`.

<details>
<summary style="cursor: pointer;">Click here to reveal a hint.</summary>
Try using

```
# Note this returns *the whole batch*
for images, labels in training_dataset.take(1): 
    for im in images:
        ...
    <display images>

```

You can even pass a list of columns instead of a single string!
</details>

In [None]:
# Display the first 9 images
# Your code here

Ok, we have our images and labels. Let's create a model!

Create a convolutional neural network (CNN) model with the following architecture:

- 3 modules of 3x3 convolutional layers with 32, 64, and 128 filters respectively followed by a max pooling layer
- 2 dense layers with 512 and 64 units respectively
- a final dense layer with 1 unit for the output

Use ReLU activation for all layers except the last one where you can use a sigmoid.

Compile the model. 

**What optimizer and loss function do you want to use?**
**What metrics do you want to calculate?**

In [None]:
model = keras.models.Sequential()

# Build your model here

print(model.summary())

We can now proceed to train our model. We will train for 50 epochs, with a batch size of 128. 

So, every epoch all of the training images are used to train the network, 128 images at a time.

Note that we use our test set for validation, so we have no proper validation set for this example.

The `train` function returns the model's history, which contains information about the accuracy over training. This is important to check how well the model is working.

Since this is a fairly time-consuming process, you might want to run this on Google Colab, using a GPU.

In [None]:
batch_size = 128
epochs_num = 50

res = model.fit(
    # your code here
    )

We can now plot loss and accuracy for our model.

**What can you tell from the plot?** (you might want to zoom in on the y axis)

In [None]:
# Plot loss and validation for training and validation sets

Optionally, you can save the model for later use, so you don't have to train it again.

In [None]:
model.save("mymodels/melanoma_model_01")
# You can load back the model at any time using
# model = keras.models.load_model('mymodels/melanoma_model_01)

We have slightly more than 75% accuracy, which is OK, but can definitely be improved!

What do you think is happening here?

<details>
<summary style="cursor: pointer;">Click here to reveal the solution.</summary>
Our model seems to be **overfitting**! 
**How do we tell that?**

1. The loss continues to decrease, but validation loss goes up
2. Training accuracy goes up, but validation reaches a plateau
</details>


There are several things that we can improve.

1. We can regularize the model by adding **dropout**. This would be especially important for the dense layers.

2. We can add regularization (e.g. **L2 regularization**) to the model

3. Part of the issue is that the training set is somewhat limited. This is partly because I have given you a only amount of a  of images, however keep in mind that having a very large amount of image data is not always a given, actually it is quite uncommon! 
As we saw in the lectures, we can use **data augmentation** to increase the size of our dataset.

Let's train a new model exactly in the same way, but adding dropout and regularization.

1. Add 40% Dropout layers after each Dense layer (not the last one, obviously!)
2. Add L2 regularization to the Conv2D layers

In [None]:
model2 = keras.models.Sequential()

#  Build a regularized version of the model here

print(model2.summary())

As before, repeat the training.

In [None]:
batch_size = 128
epochs_num = 50

res2 = model2.fit(
    # Your code here
    ) 

# Save model
model2.save("mymodels/melanoma_02")
# You can also save the history!
np.save('mymodels/melanoma_02_history.npy', res2.history)

Plot the loss and accuracy for the regularised model

In [None]:
# Plot loss and validation for training and validation sets

Now, that's much better!

The model is now regularised, so it does not overfit anymore (or not as badly). You can play with the hyperparameters to see if you can improve the model even further.

Finally, we can use the model to predict the class of some images. We are going to use the validation set for this, although it would be better to have a separate test set.

In [None]:
from sklearn.metrics import confusion_matrix

images, labels = next(iter(test_dataset))

predictions = ______________

conf_matr = ______________

print(conf_matr)

How did your model do? Can you improve it further?


And that is the end of worshop 6! Hope you enjoyed it and keep (deep) learning!