# Continual Learning

## Introduction

### What is Continual Learning?

Continual learning is a sub-topic of AI focused on techniques to enable a machine to learn adaptively when new inputs are presented over time. Traditional machine learning tasks have focused on making a machine learn by training it on a specific set of data inputs focusing on narrow task domains. If a new class or instance presents itself in the future, then the entire model needs to be completely re-trained. This is not practical in most real-world scenarios where an autonomous agent is acting in real time.

Enabling an agent to re-use and retain knowledge that it has previously learned without having to completely re-train the model from scratch is difficult. This is a hard problem to solve due to catastrophic failure. Catastrophic failure is when a model completely forgets prior learnings when trying to gradually update its memory, due to the difference between the data distributions of the batches.

### Approach Used

*Batch level Experience Replay with Review:*

Replay
At a very high level Experience Replay stores previously encountered examples and revisits them when learning something new.

Experience Replay stores a subset of the samples from the past batches in a buffer. When training a current batch, it concatinates the incoming batch with another batch of samples retreived from the buffer. Then, Stochastic Gradient Descent update step is performed with this new combined batch.

In this exact case for every epoch a random batch is taken from memory with a certain replay size (this is important!, our experiments play with this replay size), concatenate it with the current batch, conduct SGD.

Review part: after training all the batches a review batch is randomly taken from memory and SGD is conducted again to perform a final update to the weights.

### Acknowledgements

The starting code of this repository is based on the official starting [repository](https://github.com/vlomonaco/cvpr_clvision_challenge) with enhancement by the winning team of the competition - Zheda Mai(University of Toronto), Hyunwoo Kim(LG Sciencepark), Jihwan Jeong (University of Toronto), Scott Sanner (University of Toronto, Vector Institute) which can be found [here](https://github.com/RaptorMai/CVPR20_CLVision_challenge).

Since the original competition is over and the fact we were only investigating the impact on accuracy of varying the number of replay examples used (as opposed to the multi-metric nature of the original competition) we eliminated the overhead of creating "submission files" to minimize the run durations.

### Platform Architecture

#### Hardware

We ran our experiments on a linux desktop with a nVidia GTX 1660 Ti GPU. The size of the memory of the GPU guided our decisions in terms of strucuring the experiments (e.g., we were only able to extend the number of replay examples to 1.25 times the number used by the winning team) as well as the way in which we created the notebook cells.

#### Software

We leveraged the frameworks provided and modified them to suit our goals of focusing on accuracy measurements. The framework was very well parameterized, allowing setting up runs based on yml files located in the /config directory. The files general_main.py and final_submission.py, while not used directly in our experiments, provide a good reference on the parameter settings. The framework also provided functionality such as image cropping to better isolate the actual object to be recognized, image resizing to allow enlarging the original dataset images to the minimum size (224x224 pixels) recommended for use with the classifier pretrained model, and finally image augmentation that would randomly alter (flip, rotate, change the contrast) the image for better model generalization. We utilized all of these techniques.

### Our Experiments

Our experiments focused on playing with the number of replay examples that are randomly drawn from the memory. We wanted to see if increasing the replay size (concatinate with the current batch) would increase the models ability to not forget what it has learned previously.

We found that increasing the size of the replay samples did had a minimal effect on validation accurracy performance.

***As a note, since the dataset was distributed as part of a competition we, were limted to testing on the validation set only. We investigated and observed that the validation set was approximately one twentieth the size of the full test set, and given that the competition concluded in early 2020, contacted the event organizers in an attempt to obtain labeled test results for analysis (we have not received a reply as of this writing). Also, while the original competition was based on a composite score using a weighted sum of five metrics - accuracy on the test set, average accuracy on the test set, total training and test runtime, memory usage, and disk usage, we choose to focus on the accuracy metric.***

### ResNet18 vs. DenseNet161_Freeze

We first evaluated the relative performance of two densely connected Convolutional Neural Network classifiers - ResNet18 (used in the starting baseline framework)and a customized version of DenseNet161 (which was used by the winners of the competition). These classifiers employ skip connections between the layers to attempt to address the vanishing gradient problem associated with deeper networks, as well as strengthen feature [propagation and reuse](https://arxiv.org/pdf/1608.06993.pdf). The results of this experiment led us to use the modified DenseNet161 model as our baseline model for future experiments as that model proved more accurate for all three of the test scenarios.

The modified version of DenseNEt161 that we used had the first 3 layers frozen. This allowed use of the classiifier pretraining on the ImageNet datasetBy freezing the first 2 layers, thereby ensuring that the model can leverage the pretraining and still extract features from the images. In addition, the training time in decreased. 

### Scenario 1 - New Classes (NC)

The New Classes scenario did not use the replay memory methodology. A new independent model is assigned to each batch.
50 different classes are split into 9 batches. The label is provided during this scenario. Inference outweighs transfer when sharing 1 model across all batches. So instead a new fresh model is assigned to each batch. In this scenario, a text file containing the training labels is provided as part of the data set. Our objective here was to get a baseline average validation accurracy value to see if our results were compatible with published values for this scenario - basically servine as a sanity check.

### Scenario 2 - New Instances (NI)

In the NI scenario there are 8 training batches each containing the same 50 classes. No batch labels are provided.

### Scenario 3 - New Instances and Classes (NIC)

In the NIC scenario, 391 training batches each containing 300 images of a single class. No batch labels are provided.

### Import required modules for subsequent tests

In [1]:
# Required Imports
import argparse
import time
import torch
from utils.io import load_yaml
from types import SimpleNamespace
from utils.names_match_torch import methods
import os
from utils.common import create_code_snapshot
import numpy as np

### ResNet18 / DenseNet161 comparison
For the comparison we chose to run the NC scenario since this one does not use replay, thereby minimzing the run duration.

In [3]:
# Comparison of baseline Resnet and DenseNet classifiers - use NC case to minimize runtime

criterion = torch.nn.CrossEntropyLoss()
print('Running ResNet18:')
params = load_yaml('config/resnet/nc_resnet.yml')
final_params = SimpleNamespace(**params)
method = methods[final_params.method](final_params, criterion, final_params.use_cuda)
resnet_valid_acc, resnet_elapsed, resnet_ram_usage, resnet_ext_mem_sz, resnet_preds = method.train_model(tune=False)
print('ResNet validation accuracy: {:.3f}'.format(sum(resnet_valid_acc)/len(resnet_valid_acc)))
print('\nRunning DenseNet_Freeze')
params = load_yaml('config/densenet/nc_densenet.yml')
method = methods[final_params.method](final_params, criterion, final_params.use_cuda)
densenet_valid_acc, densenet_elapsed, densenet_ram_usage, densenet_ext_mem_sz, densenet_preds = method.train_model(tune=False)
print('DenseNet validation accuracy: {:.3f}'.format(sum(densenet_valid_acc)/len(densenet_valid_acc)))

Running ResNet18:

Loading data...
Loading paths...
Loading LUP...
Loading labels...
preparing CL benchmark...
----------- batch 0 -------------
x shape: (23980, 128, 128, 3), y shape: (23980,)
Task Label:  0
----------- batch 1 -------------
x shape: (11993, 128, 128, 3), y shape: (11993,)
Task Label:  1
----------- batch 2 -------------
x shape: (11990, 128, 128, 3), y shape: (11990,)
Task Label:  2
----------- batch 3 -------------
x shape: (11993, 128, 128, 3), y shape: (11993,)
Task Label:  3
----------- batch 4 -------------
x shape: (11989, 128, 128, 3), y shape: (11989,)
Task Label:  4
----------- batch 5 -------------
x shape: (11979, 128, 128, 3), y shape: (11979,)
Task Label:  5
----------- batch 6 -------------
x shape: (11990, 128, 128, 3), y shape: (11990,)
Task Label:  6
----------- batch 7 -------------
x shape: (11987, 128, 128, 3), y shape: (11987,)
Task Label:  7
----------- batch 8 -------------
x shape: (11993, 128, 128, 3), y shape: (11993,)
Task Label:  8
Trainin

### Scenario 1 - New Classes

This scenario was done for the DenseNet classifier as part of the previous cell and the results show both comparable performance with the ResNet classifier as well as consistency with the published competition values.

### Scenario 2 - New Instances

We needed to perform this trial in two steps because of the previously described system memory limitations

In [2]:
# Run NI and print results - first two replay sample sizes
print('Running DenseNet_Freeze with .5x replay examples:')
params = load_yaml('config/densenet/ni_densenet_50.yml')
criterion = torch.nn.CrossEntropyLoss()
final_params = SimpleNamespace(**params)
method = methods[final_params.method](final_params, criterion, final_params.use_cuda)
ni50_densenet_valid_acc, ni50_densenet_elapsed, ni50_densenet_ram_usage, ni50_densenet_ext_mem_sz, ni50_densenet_preds = method.train_model(tune=False)
print('NI scenario .5x replay samples validation accuracy: {:.3f}'.format(sum(ni50_densenet_valid_acc)/len(ni50_densenet_valid_acc)))

print('\nRunning DenseNet_Freeze with .75x replay examples:')
params = load_yaml('config/densenet/ni_densenet_75.yml')
criterion = torch.nn.CrossEntropyLoss()
final_params = SimpleNamespace(**params)
method = methods[final_params.method](final_params, criterion, final_params.use_cuda)
ni75_densenet_valid_acc, ni75_densenet_elapsed, ni75_densenet_ram_usage, ni75_densenet_ext_mem_sz, ni75_densenet_preds = method.train_model(tune=False)
print('NI scenario .75x replay samples validation accuracy: {:.3f}'.format(sum(ni75_densenet_valid_acc)/len(ni75_densenet_valid_acc)))

Running DenseNet_Freeze with .5x replay examples:
Loading paths...
Loading LUP...
Loading labels...
preparing CL benchmark...

Loading data...
------------------------------------------
Batch validation accuracy: 0.779
------------------------------------------

Loading data...
------------------------------------------
Batch validation accuracy: 0.846
------------------------------------------

Loading data...
------------------------------------------
Batch validation accuracy: 0.916
------------------------------------------

Loading data...
------------------------------------------
Batch validation accuracy: 0.933
------------------------------------------

Loading data...
------------------------------------------
Batch validation accuracy: 0.940
------------------------------------------

Loading data...
------------------------------------------
Batch validation accuracy: 0.930
------------------------------------------

Loading data...
-----------------------------------------

In [2]:
# Run NI and print results - second two replay sample sizes
print('Running DenseNet_Freeze with default replay examples:')
params = load_yaml('config/densenet/ni_densenet_default.yml')
criterion = torch.nn.CrossEntropyLoss()
final_params = SimpleNamespace(**params)
method = methods[final_params.method](final_params, criterion, final_params.use_cuda)
ni_default_densenet_valid_acc, ni_default_densenet_elapsed, ni_default_densenet_ram_usage, ni_default_densenet_ext_mem_sz, ni_default_densenet_preds = method.train_model(tune=False)
print('NI scenario default replay samples validation accuracy: {:.3f}'.format(sum(ni_default_densenet_valid_acc)/len(ni_default_densenet_valid_acc)))

print('\nRunning DenseNet_Freeze with 1.25x replay examples:')
params = load_yaml('config/densenet/ni_densenet_125.yml')
criterion = torch.nn.CrossEntropyLoss()
final_params = SimpleNamespace(**params)
method = methods[final_params.method](final_params, criterion, final_params.use_cuda)
ni125_densenet_valid_acc, ni125_densenet_elapsed, ni125_densenet_ram_usage, ni125_densenet_ext_mem_sz, ni125_densenet_preds = method.train_model(tune=False)
print('NI scenario 1.25x replay sampeles validation accuracy: {:.3f}'.format(sum(ni125_densenet_valid_acc)/len(ni125_densenet_valid_acc)))

Running DenseNet_Freeze with default replay examples:
Loading paths...
Loading LUP...
Loading labels...
preparing CL benchmark...

Loading data...
------------------------------------------
Batch validation accuracy: 0.774
------------------------------------------

Loading data...
------------------------------------------
Batch validation accuracy: 0.792
------------------------------------------

Loading data...
------------------------------------------
Batch validation accuracy: 0.927
------------------------------------------

Loading data...
------------------------------------------
Batch validation accuracy: 0.916
------------------------------------------

Loading data...
------------------------------------------
Batch validation accuracy: 0.926
------------------------------------------

Loading data...
------------------------------------------
Batch validation accuracy: 0.929
------------------------------------------

Loading data...
-------------------------------------

### Scenario 3 - New Instances and Classes

*** Our intention was to run this experiment, but time limitations prevented us from doing so. We provide the following code to jump-start others wishing to run the experiment***

Here again we were forced to run the trial in two separate cells due to GPU memory limitations.

In [None]:
# Run NIC and print results - first two replay sample sizes
print('Running DenseNet_Freeze with .5x replay examples:')
params = load_yaml('config/densenet/nic_densenet_50.yml')
criterion = torch.nn.CrossEntropyLoss()
final_params = SimpleNamespace(**params)
method = methods[final_params.method](final_params, criterion, final_params.use_cuda)
nic50_densenet_valid_acc, nic50_densenet_elapsed, nic50_densenet_ram_usage, nic50_densenet_ext_mem_sz, nic50_densenet_preds = method.train_model(tune=False)
print('NIC scenario .5x replay samples validation accuracy: {:.3f}'.format(sum(nic50_densenet_valid_acc)/len(nic50_densenet_valid_acc)))

print('\nRunning DenseNet_Freeze with .75x replay examples:')
params = load_yaml('config/densenet/nic_densenet_75.yml')
criterion = torch.nn.CrossEntropyLoss()
final_params = SimpleNamespace(**params)
method = methods[final_params.method](final_params, criterion, final_params.use_cuda)
nic75_densenet_valid_acc, nic75_densenet_elapsed, nic75_densenet_ram_usage, nic75_densenet_ext_mem_sz, nic75_densenet_preds = method.train_model(tune=False)
print('NIC scenario .75x replay samples validation accuracy: {:.3f}'.format(sum(nic75_densenet_valid_acc)/len(nic75_densenet_valid_acc)))

In [None]:
# Run NIC and print results - second two replay sample sizes
print('Running DenseNet_Freeze with default replay examples:')
params = load_yaml('config/densenet/nic_densenet_default.yml')
criterion = torch.nn.CrossEntropyLoss()
final_params = SimpleNamespace(**params)
method = methods[final_params.method](final_params, criterion, final_params.use_cuda)
nic_default_densenet_valid_acc, nic_default_densenet_elapsed, nic_default_densenet_ram_usage, nic_default_densenet_ext_mem_sz, nic_default_densenet_preds = method.train_model(tune=False)
print('NIC scenario default replay samples validation accuracy: {:.3f}'.format(sum(nic_default_densenet_valid_acc)/len(nic_default_densenet_valid_acc)))

print('\nRunning DenseNet_Freeze with 1.25x replay examples:')
params = load_yaml('config/densenet/nic_densenet_125.yml')
criterion = torch.nn.CrossEntropyLoss()
final_params = SimpleNamespace(**params)
method = methods[final_params.method](final_params, criterion, final_params.use_cuda)
nic125_densenet_valid_acc, nic125_densenet_elapsed, nic125_densenet_ram_usage, nic125_densenet_ext_mem_sz, nic125_densenet_preds = method.train_model(tune=False)
print('NIC scenario 1.25x replay samples validation accuracy: {:.3f}'.format(sum(nic125_densenet_valid_acc)/len(nic125_densenet_valid_acc)))

## Discussion

The results of our experiments are tabluated below:

| CLASSIFIER COMPARISON| (NC scenario) |
| ----------- | ----------- |
| Classifier | Average Validation Accuracy |
| ResNet18 | .535 |
| DenseNet_Freeze | .525 |
As previously stated, the results show comparable performance between the classifiers as well as consistency with the published competition values.

  
NEW INSTANCES (NI scenario)  

| Samples | Batch0 | Batch1 | Batch2 | Batch3 | Batch4 | Batch5 | Batch6 | Batch7 | Avg Val Acc |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 5000 | .799 | .846 | .916 | .933 | .940 | .930 | .908 | .943 | .899 |
| 7500 | .805 | .874 | .924 | .923 | .938 | .945 | .900 | .919 | .903 |
| 10000 | .774 | .792 | .927 | .916 | .926 | .929 | .932 | .919 | .899 |
| 12500 | .765 | .850 | .920 | .919 | .927 | .891 | .920 | .928 | .890 |  

The results display consistency within batches/sample size in terms of accuracy (i.e., Batch 0 shows the lowest accuracy), with the highest average accuracy appearing for the run with 7500 samples. Surprisingly, the lowest average accuracy is associated with the largest number replay samples, which seems counter-intuitive and therefore an area for further experimentation.
Note that we are reporting average validation accuracy values rather than a final validation accuracy value because we are looking for the impact of varying the replay number of replay samples on catastrophic forgetting, and since the forgetting occurs between successive batches taking an average over all of them is the correct way to see it.

In summary, our tests yielded values in accord with published results for similar trials and suggest minor sensitivity for average accuracy to the number of replay samples used over the range of 5000-12500. Areas for further experimentation include using even larger replay sample sizes (given adequate GPU memory),varying the number of samples used in the review phase of training, as well varying the number of epochs for both replay and review training.