<a href="https://colab.research.google.com/github/piziomo/Machine-Learning/blob/main/CNN/AI_in_10_Lines_of_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Scientists Hate Him: Train A Tissue Classifier In 4 Simple Steps

Image classification is a common task in machine learning and computational pathology. Image classification is predicting a label (AKA class) for a given image. E.g. is this an image of a dog or a cat?

## Which of These are Cats and Which are Dogs?

![Cats and dogs](https://blogs.rstudio.com/tensorflow/posts/2017-12-14-image-classification-on-small-datasets/images/cats_vs_dogs_samples.jpg)
(Rstudio blog)

We intuitively know which of these images are of cats and which are of dogs. How would you tell a computer how to do this? Defining how to do this is hard!

Instead of describing the process by hand we can use a large dataset of images with the labels/classes assigned by a human. From this dataset we can find patterns in them images which correlate with cats and patterns which correlate with dogs. In practice, this often works quite well.

This process can also be used to classifying patches of a slide into classes e.g. tumour, simple stroma, background, mucosa etc.

![Tissue Patches from Kather 2016](https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fsrep27988/MediaObjects/41598_2016_Article_BFsrep27988_Fig1_HTML.jpg?as=webp)

*(a) tumour epithelium, (b) simple stroma, (c) complex stroma (stroma that contains single tumour cells and/or single immune cells), (d) immune cell conglomerates, (e) debris and mucus, (f) mucosal glands, (g) adipose tissue, (h) background.*

(Kather et al. 2016)

## Hands On: Training Your Own Model

Training most machine learning models involves a few main steps:

1. Aquiring / preparing data.
2. Loading data.
3. Training a model (AKA fitting the model to the data).
4. Evaluating the model predictions (AKA inference).


Data is usually split into two sets: training and validation.

- Training data is used to iteratively make small changes (AKA updates) to the model so that it more closely relates the data to their respective labels (AKA classes).
- Validation is not used to update the model during training. It is used to assess the performance of the model during training.

The rest of this notebook will walk you through those steps.

---

## Step 0: Before We Begin: Make a Copy and Enable a GPU

### Make a Copy

To run the code you need to make a copy of this notebook and sign into a Google account.

1. Click 'Open in Playground'
2. Click 'Copy to Drive'

### Enable GPU

(Optional but much faster than CPU only)!

1. Click 'Runtime' from the menu above ☝️
2. Select 'Change runtime type'
3. Under 'Hardware Accelerator' Select the 'GPU' option
4. Congratulations! You have enabled the GPU!
5. Run the cell below to double check 👇


## Step 1: Download The Data (1 Minute)

The following two lines will

1. Download a zip of the dataset.
2. Unzip the data.

> ▶️ Run the cell below¹ to load the data and see a short summary 👇

¹ Click inside the cell then press the play button to the side or press ⌘/Ctrl+Enter.

In [None]:
# Use wget to download the dataset
# Short URL: https://tinyurl.com/kather-5000
! wget https://zenodo.org/record/53169/files/Kather_texture_2016_image_tiles_5000.zip
# Unzip it
# (-o = overwrite if needed, -q = be quiet i.e. don't print out lots of unnecessary things)
! unzip -o -q Kather_texture_2016_image_tiles_5000.zip

## Step 2: Import The fast.ai Module & Load The Dataset (1 Second)

The [fast.ai](https://www.fast.ai/) python module simplifies training a basic model by doing a lot of the 'heavy lifting' for us. By importing fast.ai we can get to training a model very quickly!

The following lines will

1. Import the fast.ai module.
2. Load the dataset from a folder, setting aside 20% for validation.
3. Print out a summary of the data.

Note that `seed=123`. This is the random seed used for reproducability. We set the seed to a constant value here so that if we run the same code again, the same 'random' (pseudorandom) 20% of the dataset is used for validation.

The data is loaded as $x$ and $y$ pairs. The $x$ here is an image and the $y$ is the value we want to predict from the image AKA the label.

> ▶️ Run the cell below to load the data and see a short summary 👇

In [None]:
from fastai.vision import *
data = ImageDataBunch.from_folder('Kather_texture_2016_image_tiles_5000', valid_pct=0.2, seed=123)
data

ImageDataBunch;

Train: LabelList (4000 items)
x: ImageList
Image (3, 150, 150),Image (3, 150, 150),Image (3, 150, 150),Image (3, 150, 150),Image (3, 150, 150)
y: CategoryList
03_COMPLEX,03_COMPLEX,03_COMPLEX,03_COMPLEX,03_COMPLEX
Path: Kather_texture_2016_image_tiles_5000;

Valid: LabelList (1000 items)
x: ImageList
Image (3, 150, 150),Image (3, 150, 150),Image (3, 150, 150),Image (3, 150, 150),Image (3, 150, 150)
y: CategoryList
05_DEBRIS,01_TUMOR,04_LYMPHO,04_LYMPHO,03_COMPLEX
Path: Kather_texture_2016_image_tiles_5000;

Test: None

Above we can see that 80% (4000/5000) of our data is in the training set, and 20% (1000/5000) is in the validation set.

The `x:` and `y:` is showing a preview of the first 5 pairs items in each split of the dataset.

We can see that they both contain images which are 150px by 150px square with three colour channels for red, green and blue (RGB). This is show next to `x:` by `Image (3, 150, 150)`.

The labels corresponding to each image are shown next to `y:` e.g. `01_TUMOR,08_EMPTY,03_COMPLEX,03_COMPLEX,02_STROMA` (validation labelds for seed 123).

> ▶️ Run the cell below to see a preview of the images and their labels 👇

In [None]:
data.show_batch(rows=4, figsize=(6,6))

## Step 3: Train The Model (2 Minutes)

Next, we load a model (here a convolutional neural network (CNN) called ResNet) and then train on (or 'fit to') the data.

The following lines will

1. Load the model, in this case ResNet-18.
2. Use accuracy as a metric to measure the model performance.¹
3. Tell the model to plot a graph of loss² over time using a 'callback'.³
4. Train the model, AKA fit to the data.

¹Note that fastai metrics are calculated on the validation set.

²Loss is a measure of the difference between what the model predicted and what it, ideally, should have predicted. This could be simply the average of the difference between the prediced and the true image labels, or it could be something more complicated e.g. to penalised larger differences more harshly.

³A callback is something which the training function can reference after each epoch (pass through all the training data). Here the callback is a function which can plot a graph.


> ▶️ Run the cell below to train the model 👇

In [None]:
learn = cnn_learner(data, models.resnet18, metrics=accuracy, callback_fns=ShowGraph)
learn.fit(5)  # Train for 5 epochs (passes through all of the training data)

The table above shows a summary of the training process after each epoch with the training loss, validation loss, (validation) accuracy and time taken.

## Step 4: Plot A Confusion Matrix (1 Minute)

The accuracy alone from the previous step tells us how well the model perfomed **on average** across all of the images. This **doesn't tell us how accurately it can predict each type of image** e.g. tumour. One way to find out more about the model's performance on each label or class is to plot a confusion matrix.

> ▶️ Run the cell below to plot a confusion matrix 👇

In [None]:
preds, y, losses = learn.get_preds(with_loss=True)
interp = ClassificationInterpretation(learn, preds, y, losses)
print(accuracy(preds, y)) # Same as accuracy in table from previous cell
interp.plot_confusion_matrix()

Here you can see that most of the images lie on the diagonal which is good. This means that the predictions line up well with the actual labels/classes.

Note that there are a few stray high values off of the diagonal e.g. actual debris and predicted stroma. The debris and stroma images often look quite similar to each other. Therefore, it make sense that the model struggles to seperate these classes more than others.

Additionally, complex and stroma appear to be sometimes confused. Does this also make sense?

# Congratulations!

You have trained a neural network image classifier and evaluated its preictions in just 10 lines of Python code!

![Congratulations](https://media2.giphy.com/media/pAaROqrcFT5Ze/giphy.gif?cid=790b761123ced419de2444db29c2657385c36ec5944f6e96&rid=giphy.gif)

# Extra Bits To Play With

Here are some extra bits which you can play aound with.

## Show An Image & Its Prediction

In [None]:
n = 0
img = learn.data.valid_ds[n][0] # Get the first (nth) image from the test set
img.show() # Show the image
print(learn.predict(img)[0]) # Make a prediction and display it

## Show Top Losses

Plot the images which resulted in the highest loss i.e. show images which the network predicted incorrectly.

In [None]:
interp.plot_top_losses(9, figsize=(14, 7))

# Where to Go from Here

There are many resources on the web to help develop an understanding of machine learning methods. Below I have linked just a small selection of accesible resources to get started.

## Videos

There are many assicible video which can give a good background to machine learning and sub-topics.

### Computerphile (YouTube)

This is an YouTube channel run by Brady John Haran which features accesible videos on computer science topics and is the sister channel of 'Numberphile'.

- [Inside a Neural Network](https://youtu.be/BFdMrDOx_CM)
- [How Blurs & Filters Work (A core concept behind CNNs)](https://youtu.be/C_zFhWdM4ic)
- [Neural Network that Changes Everything (CNNs)](https://youtu.be/py5byOOHZM8)
- [Deep Learning with CNNs](https://youtu.be/TJlAxW-2nmI)
- [Deep Learning (2017)](https://youtu.be/l42lr8AlrHk)
- [Deep Learning (2019)](https://youtu.be/TJlAxW-2nmI)
- [Data Analysis with Dr Mike Pound (Playlist)](https://www.youtube.com/playlist?list=PLzH6n4zXuckpfMu_4Ff8E7Z1behQks5ba)

### CGP Grey (YouTube)

- [How Machines Learn](https://youtu.be/R9OHn5ZF4Uo)

## Online Interactive Demos

- [A Neural Network Playground - Tensorflow](https://playground.tensorflow.org/)
- [Distill - An Interactive Online Journal](https://distill.pub/)

## Glossaries

- [Machine learning glossary of important terms - Microsoft](https://docs.microsoft.com/en-us/dotnet/machine-learning/resources/glossary)
- [Glossary of Machine Learning Terms - Semantic Bits](https://semanti.ca/blog/?glossary-of-machine-learning-terms)


## Books

- Hello World: How to be Human in the Age of the Machine by Hannah Fry, Associate Professor in mathematics at University College London.
  - A tour of ethical implications which arise when trusting machines to make decisions for us. Topics include crime and punishment, medical interventions, and self driving cars.