# Welcome

This python notebook is a step-by-step guide to training a neural network on Google Colab using the *MISO* particle classification library.

The MISO library is a set of python scripts that simplify creating, training and saving a convolutional neural network (CNN) primarily for particle images, such as foraminifera.

The MISO library can be used to train common CNN topologies such as ResNet. It also includes two custom full CNN designs, "base_cyclic" and "resnet_cyclic", that were developed at CEREGE and give good results with quick training time, and a ResNet50 transfer learning method that is extremely fast to train.

Training assumes single particle images with the particles roughly centred in the image, and saved in jpeg, bmp, tiff or png format.

**Note:** The notebook is interactive - you run the code inside it. Code cells are coloured light grey. To run a cell, hover the mouse above it and click the play arrow in the top-left corner.

***

***

# Getting Started

## Save a copy (recommended)

If you want to keep a copy of this notebook and any changes you make, please make a copy:

*   Click *File* -> *Save a copy in Drive...* to save this notebook in your Google Drive.
*   A new Google Colab tab will open up with a copy of the notebook.
*   Click *File* -> *Rename...* to give the notebook a more memorable name.

## Enabling the GPU (Important)

For fast training we need to enable the GPU. 

In the menu bar of the Google Colab webpage, click *Runtime* -> *Change runtime type* and in the dialog that pops up, change *Hardware accelerator* to *GPU*. Click save to restart Google Colab with a GPU.

**Note:** You can check which GPU has been enabled using the `!nvidia-smi` command. The `T4` is 3-4 times faster than the `K80`.

In [None]:
!nvidia-smi

## MISO library

The MISO python library contains the code for creating and training the neural networks. It uses **Tensorflow version 1.14 / 1.15.X**.

Run the cell below to set the Tensorflow version and install the latest version of MISO from its bitbucket repository.

**Note:** Google Colab will prompt you (at the bottom of the cell) to restart the runtime if you have already installed the library this session.

In [None]:
%tensorflow_version 1.x
import tensorflow as tf
print(tf.__version__)

!pip install -U git+https://www.github.com/microfossil/particle-classification.git

***

***

# Dataset

## Preparation

The training set consists of images of the classes to classify, sorted into directories by class name.

For detailed instructions on constructing a training set please see [this tutorial][1]. _ParticleTrieur_ is a cross-platform java app to help create training sets, it can be downloaded from [here][2].

[1]: https://particle-classification.readthedocs.io/en/latest/tutorial/dataset_creation.html
[2]: https://particle-classification.readthedocs.io

**Structure**

The structure of the dataset must be a 1) single root folder with the dataset name, containing 2) a subfolder for each class, with the same name as the class, and 3) all the images for that class inside of it.

## Make available to Google Colab

To use an image dataset for training, it must be accessable to Google Colab

There are three method to do this

1.   Zip the dataset and upload it to a sharing site that supports direct links, such as Dropbox, Onedrive, Owncloud, Nuage, etc. *Google Drive does not work*
2.   Upload the dataset to Google Drive
3.   Zip the dataset and upload it to Google Colab (not recommended)


**Before continuing**

If not already open, click the small arrow under the Colab icon (CO) in the upper left corner of the webpage to open the left pane, and then go to the *Files* tab.

## Method 1: Upload to Dropbox / OneDrive / OwnCloud / Nuage etc

This method is quick, doesn't require Google Drive installed, and you can share the link to the training set with other people.

It does not work with Google Drive, as it is currently impossible to get a direct download link for large datasets.

### 1. Create a zip file

We must create a zip file of the dataset

Make sure the name of the root folder is a unique name that will not conflict with any other dataset.

Compress the root folder of the dataset into a .zip file

Windows: Right-click the root folder, then click *Send to --> Compressed (zipped) folder*

Mac: Right-click the root folder, then click *Compress items*

### 2. Upload

Upload the zip file onto your favourite cloud sharing provider, e.g. Dropbox or Onedrive

### 3. Create a sharing link

Once the file has finished uploading or syncing, create a sharing link for the file.

For example, in Dropbox in Windows, right-click your zip file in the dropbox folder and select Copy Dropbox Link.

If you have the option to set permissions for the link (e.g. with OneDrive) make sure they are read-only so that you can share the data without others being able to edit it.

### 4. Convert the link into a direct download link

For some sharing services such as Dropbox and Onedrive, the sharing link points to a *website* where you can download the file. We need a link that goes directly to the file instead.

Thus, the link must be changed from the website link to the direct download link.

**Dropbox:**

Change www.dropbox.com to dl.dropboxusercontent.com:

For example,

https://www.dropbox.com/s/wlxcp29u8t0z9yw/DummyFile.TXT?dl=0

becomes

https://dl.dropboxusercontent.com/s/wlxcp29u8t0z9yw/DummyFile.TXT?dl=0

**OneDrive:**

Change the ms in the first part of the link to ws.

For example,

https://1drv.ms/u/s!AiQM7sVIv7fah4IZlw0GmHAwmOT9DY

becomes

https://1drv.ws/u/s!AiQM7sVIv7fah4IZlw0GmHAwmOT9DY

**Nuage / Owncloud:**

Add `/download` to the end of the link.

For example,

https://nuage.osupytheas.fr/s/crYrKyXdQqAR5E6

becomes

https://nuage.osupytheas.fr/s/crYrKyXdQqAR5E6/download

**Other services:**

For other services, check if the sharing link is a direct download link by pasting it into the address bar of your internet browser and pressing enter.

If the zip file starts downloading, it is a direct download link already, and nothing needs to be changed. If it goes to a website, you may need to search how to create a direct download link for you particular service. Not that for Google Drive files there is currently no way to create a direct download link for large files (it will go to a anti-virus confirmation website).

## Method 2: Upload to Google Drive

Googe Drive is a sharing service that gives you about 15GB of storage. Everyone with a Google account automatically has a Google Drive account. 

### 1. Upload

There is no need to create a zip file with this method.

Instead, create a `datasets` directory on your Google Drive and save the root folder to it. This can either be done online, or by syncing with Google Drive on your computer.

To download Google Drive, go to https://drive.google.com/drive/my-drive, click the settings icon (gear icon in top right) and then click *Get Backup and Sync for Windows*.

### 2. Mount

Run the cell below to mount your google drive on Google Colab. Once mounted, all your files on Google Drive are accessable from the *Files* tab on the left under `/content/drive`

In [None]:
from google.colab import drive
drive.mount('/content/drive')


## Method 3: Transfer from your computer

You can transfer the images directly from your computer and save them on Google Colab. 

The disadvantages of this method are that ploading the data can be slow and it is deleted from Google Colab when the session ends.

### 1. Create a zip file

We must create a zip file of the dataset as for the first method.

Make sure the name of the root folder is a unique name that will not conflict with any other dataset.

Compress the root folder of the dataset into a .zip file

Windows: Right-click the root folder, then click Send to --> Compressed (zipped) folder

Mac: Right-click the root folder, then click Compress items

### 2. Create directory

Now we create a directory called `datasets` in Google Colab by running the following cell:

In [None]:
import os
os.makedirs("/content/datasets/", exist_ok=True)

Click *Refresh* in the *Files* tab in the left pane of the Google Colab screen. You should see  a new directory called datasets that contains the dataset folder.

### 3. Upload

In the *Files* tab in the left pane of Google Colab, right-click the newly created *datasets* directory and click *Upload*.

Choose the zip file and start the upload. The progress is shown at the bottom of the *Files* tab.

### 4. Unzip

Once the file has uploaded, right-click the zip filename in the *Files* pane, and click *Copy path*.

Paste the copied path in the cell below (replacing PASTE_HERE) and run it:

In [None]:
!unzip PASTE_HERE

You should see a new folder *drive* in the *Files* tab. If not, click *Refresh*




***

***

# Training

We are now ready to configure the network and begin training!

We shall use the simple training interface provided by the MISO python library.

It allows us to train the dataset using a variety of pre-made neural network topologies. The results and the trained neural network are saved on Google Colab for download.

**Note:** The directories on Google Colab are cleared after each session. Remember to download the results before quitting!

## Configuration

### 1. Setup

First we load the training method and create a dictionary to hold the configuration parameters.

The `default_params` function is used to initialise the parameters.

In [None]:
import ssl
from miso.training.model_trainer import train_image_classification_model
from miso.training.model_params import default_params

ssl._create_default_https_context = ssl._create_unverified_context

params = default_params()
print(params)



Add a short name and description for the network.

*   The short name will be used to identify the network and construct the output save directory.
*   The description can be a more in-depth summary of the network and dataset. Set to *None* to have the description automatically generated.




In [None]:
params['name'] = 'google_colab_example'
params['description'] = None

### 2. Input Source

Configure the location of the training set according to the method used to upload previously:

**Zip file on Dropbox / OneDrive / OwnCloud / Nuage etc.:**

Set the `input` parameter as the direct download link URL. (The training script will download the file and unzip it to a directory when run). Enclose the address with quotes and with `r` at the front. For example:

`params['input_source'] = r'https://nuage.osupytheas.fr/s/crYrKyXdQqAR5E6/download'`

**Folder on Google Drive**

Use the path to the folder on Google Drive. To easily get the path, navigate to the base folder of the dataset in the *Files* tab, right-click the folder and select *Copy Path*. For example:

`params['input_source'] = r'/content/drive/My Drive/datasets/DATASET_NAME'`

**Folder transferred from computer**

Use the path to the folder.

In [None]:
# Uncomment (remove the hash symbol) the line corresponding to the method used:

# Zip file from cloud storage:
params['input_source'] = r'https://1drv.ws/u/s!AiQM7sVIv7fah4MN2gWCXDWX_DT0OA?e=Eu3lZh'

# Folder on Google Drive
# params['input_source'] = r'/content/datasets/DATASET_NAME'

# Folder transferred from computer
# params['input_source'] = r'/content/drive/My Drive/datasets/DATASET_NAME'

### 3. Input Options

Other input parameters are:

#### Minimum count per class

For training to work well there should be a minimum number of examples in each class. We recommend at least 40, and ideally 200, but it depends on how variable the images are in the dataset. Setting `data_min_count` excludes any classes which have fewer than that many images.

If desired, the images from classes with less than `data_min_count` examples can be collected into a single _others_ class by setting `data_map_others` to `True`.

#### Test / train split

A random proportion of the dataset is set aside for testing / validation. That is, it is not used in training and is instead used to evaluate the accuracy of the network. The proportion of test images is usually around 20% and is set with `data_split`. The split between test and train is normally random, but you can set the random `seed` to an integer ensure the same random ordering is used if necessary.

When using N-fold cross-validation we can also set `data_split_offset`. This chooses the fold that is used in training. E.g. if the split is 0.2 (20%) then we may select 0, 1, 2, 3, or 4 to use the respective 20% as the test data.

We do NOT use a validation set, training is stopped based on training loss!

#### Class weights

If the dataset is heavily unbalanced with lots of images in only a few classes, training may give good accuracy on those classes at the expense of accuracy for the classes with few images. To account for this, we can weight the importance of the classes according to their counts by settings `use_class_weights` to `True`

In [None]:
params['data_min_count'] = 40
params['data_map_others'] = False

params['data_split'] = 0.20
params['data_split_offset'] = 0
params['seed'] = None

params['use_class_weights'] = True

### 4. Output Location

The output location specifies where the training results (trained CNN model, graphs, etc) will be saved on Google Colab.

You can save them in the Google Colab:

`params['output_dir'] = r'/content/output/'`

Or in your Google Drive folder, e.g.:

`params['output_dir'] = r'/content/drive/My Drive/output/'`

Saving to Google Drive has the advantage that the results will be synced to your computer automatically if you have Google Drive installed.

In [None]:
# Save to Google Colab:
params['output_dir'] = r'/content/output/'

# Save to Google Drive
# params['output_dir'] = r'/content/drive/My Drive/output/

**Note:** For Google Drive, if you have not already done so, the drive must be mounted on Google Colab but running the following cell and entering the code:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### 5. Output Options

#### Model Format

There are three option to save the model (trained network) using the `save_model` parameter:

*  `None`: Don't save the model
*  `saved_model`: Save the model in Tensorflow saved model format
*  `frozen`: Freeze and save the model **(recommended)**

The `frozen`method is recommended so that the trained network can be used in the *ParticleTrieur* program. Models can be quite large (20 - 100MB depending on type).

#### Estimate mislabeled

If enabled, mislabeled images in the training (and test) dataset are estimated by using k-NN to compare the CNN feature vector (output of penultimate dense layer) with other images, and flagging those images where the more similar images belong to a different class. 

E.g. if the image is labeled in class A but the more similar images are in class B, it will be flagged as mislabeled. A figure is generated for each potentially mislabeled image in the output directory. Set `save_mislabeled` to `True` to enable this.

In [None]:
params['save_model'] = 'frozen'
params['save_mislabeled'] = True

### 6. Convolutional Neural Network

The type of convolutional neural network determines the accuracy of classification, the time to finish training and the size of the network files.

We recommend starting with the `resnet50_cyclic_tl` network to get an idea of the rough accuracy for your dataset and to check that everything is running correctly.

The types available are:

#### Base / ResNet Cyclic (Custom)

Codes: `base_cyclic, resnet_cyclic`

These CNNs was developed at CEREGE for using with foraminifera particle images. The `base_cyclic` design uses blocks of two convolutional layers, while the `resnet_cyclic` design uses ResNet-style blocks. 

They include *cyclic layers* that give some rotational invariance internal to the network. The recommened input is greyscale images (single channel) with size 128 x 128 pixels.

The networks also has some extra parameters:

*   `filters`: Number of filters in the first convolution layer (default: 4)
*   `use_batch_norm`: Use batch norm layers (default: True)
*   `global_pooling`: Use average (`avg`), maximum (`max`) or no global pooling layer (`None`) (default: None)
*   `activation`: Use ReLU (`relu`), ELU (`elu`) or SeLU (`selu`) for the activation function (default: relu)

#### Transfer Learning (Fast)

Codes: `resnet50_tl, resnet50_cyclic_tl, resnet50_cyclic_gain_tl`

This uses a ResNet50 network that has been pre-trained on ImageNet to generate feature vectors which are then trained in a two-layer flat network. It is almost as accurate as the previous network types but much much faster to train. 

The cost of the increased speed is that no image augmentation is performed. The cyclic and gain variants add some internal variation to the network to compensate, using cylic layers and random gain layers before the ResNet50 model, respectively.

#### Pre-constructed (Keras Applications)

The MISO library includes the classification_models library from https://github.com/qubvel/classification_models

You can use any of the models described in the above link (which includes accuracy and running times for the ImageNet training set): `vgg16, vgg19, resnet18, resnet34, resnet50, resnet101, resnet152, resnet50v2, resnet101v2, resnet152v2, resnext50, resnext101, densenet121, densenet169, densenet201, inceptionv3, xception, inceptionresnetv2, seresnet18, seresnet34, seresnet50, seresnet101, seresnet152, seresnext50, seresnext101, senet154, nasnetlarge, nasnetmobile, mobilenet, mobilenetv2`.

The image size and type (colour or grayscale) are also described. Most of these networks are designed for very large training sets with lots of classes, and therefore take a long time to train. We recommend trying these networks as a first step:

*    `resnet18`: ResNet-style network with 18 layers. ResNet networks use skip connections to help propogate small features to the deeper layers.
*    `seresnet18`: Same as `resnet18` but with squeeze excitation layers
*    `densenet121`: DenseNet-style network with 121 layers. DenseNet network use lots of layers with only a small number of filters, where every layer is connected to the layers following it, inside a block.
*    `mobilenetv2`: A small, accurate network that is fast to train.

These four designs use 224 x 224 images, however we have found that larger images may give better results. Either colour or grayscale can be used.

#### Image Size

Finally, the image size must be selected.

Choose the appropriate width and height (these must be the same!) and number of channels, 1 for grayscale, 3 for colour

In [None]:
params['type'] = 'resnet50_cyclic_tl'
params['img_height'] = 224
params['img_width'] = 224
params['img_channels'] = 3

# Custom parameters for base_cyclic and resnet_cyclic
params['filters'] = 4
params['use_batch_norm'] = True
params['global_pooling'] = None
params['activation'] = 'relu'

### 7. Training

#### Batch size

A default batch size of 64 is used for training. The batch size can be reduced if there are out-of-memory errors, however this should not be a problem on Google Colab.

#### Adaptive learning rate (ALR)

When training the network, the learning rate (how rapidly the network weights are allowed to change) is dropped by half whenever the improvement in loss (how well the network fits the training data) reaches a plateau. The plateau is detected by looking at the loss over the most recent number of epochs. Training is stopped after this plateau is reached a certain number of times.

This adaptive learning rate system is controlled by two parameters:

*   `alr_epochs`: The number of epochs (complete runs through the training data) to consider when detecting the plateau.
*   `alr_drops`: The number of times the learning rate is dropped (plateau detected) after which training is stopped.

A larger `alr_epochs` will result in better accuracy but longer training time, with diminishing returns. From experience we have found that a value of 40 works well for datasets with about 200 images per class, and 5-10 for large datasets with 1000+ images per class.

#### Maximum limit

There is another parameter `max_drops` that sets the maximum number of epochs after which training will be stopped regardless. Typically we set this to a high number, as the ALR system will usually stop training before this is reached. However, you can also set it to a very small number, e.g. 2, to quickly run the training just to check everything is working, before set back to a high number, e.g. 10000, for proper training.



In [None]:
params['batch_size'] = 64
params['alr_epochs'] = 40
params['alr_drops'] = 4
params['max_epochs'] = 10000

## Execution

Now that the parameters have been configured, run the cell below to start training!

The output will show, in order:

*   Loading of the images and if any classes have been skipped due to too few images.
*   The topography of the network (layers and dimensions).
*   A text-based graph showing the progress of training in real-time.
*   Loss and accuracy graphs.
*   Confusion matrix with precision and recall bar graphs.

Training can take a long time depending on the type of network, size of the dataset and number of ALR epochs.

In [None]:
model, vector_model, datasource, result = train_image_classification_model(params)

***

***


# Results

The results of training are saved in the output folder, under the network name and the date and time of training.

## Saved Files

###Model

The model folder contains two versions of the saved model.

*   `saved_model.pb` and the `variables` directory contain the model saved in Tensorflow Saver format.
*   `frozen_model.pb` is the model that has been frozen and is ready for use in classification programs.
*   `network_info.xml` is an XML description of the frozen model that describes the structure of the model (e.g. which tensors are used for input and output) and the class labels.

###Mislabeled

The *Mislabeled* directory contains some plots showing images that may have been mislabeled when creating the dataset.

It does this by generating a *feature vector* from the output the second-last dense layer of the network, for each image. The vectors are compared using k-NN for each image. If the class of image predicted using k-NN is different to the label given to the image it is flagged as possibly mislabeled.

###Downloading

Each file can be downloaded individually from the output folder.

If you want to download all files at once, create a zip of the folder:

*   Locate the folder containing the output files in the *Files* tab.
*   Right-click on it and select *Copy Path*.
*   Paste the path in the cell below and run it.
*   Right-click *output.zip* and select *Download*

In [None]:
!zip -r /content/output.zip PAST_PATH_HERE


## Python output

The training function also returns python variables that can be used for further inference

*   `model`: The trained Keras model.
*  `vector_model`: Sub-model of the trained model that outputs the feature vector.
*   `datasource`: The training and test images and class labels.
*   `result`: The results of training: accuracy and per-class precision and recall.

Run the cell below to print the attributes of each object:

In [None]:
def print_attributes(obj, name):
  print("---------------------------------------------------------------------")
  print(name + ":")
  print("---------------------------------------------------------------------")
  print("\n".join([attr for attr in dir(obj) if not attr.startswith('__')]))
  print()
  
print_attributes(result, "result")
print_attributes(datasource, "datasource")