# Continual Learning

According to the traditional machine learning approach, it is assumed that data is already and always available. Hence, the model could work nicely if it is trained by the data on-hand. Recently, scientists and researchers realize that it is not the case. They, unfortunately, found that artificial neural networks have a tendency to forget. They even illustrated that the networks forget previously learned information completely and abruptly upon learning new information which is called **catastrophic forgetting**.

When the trained model is faced with an unfamiliar observation or task, it could confuse and lead to wrong answers. Therefore, it has been suggested that this static approach should be replaced with the more dynamic one so that machine learning models can adapt themselves to the new tasks or experiments.

Continual Learning is the concept to learn a model for a large number of tasks sequentially without forgetting knowledge obtained from the preceding tasks, where the data in the old tasks are not available anymore during the training of the new ones.

Tasks should have the following criteria to be continuous:

* Data (and tasks) become available only during the time.
* No access to previously encountered data.
* Constant computational and memory resources (efficiency)
* Incremental development of ever more complex knowledge and skills (scalability)

There are other also other approaches that should not be confused with the continual learning to make the machine learning systems more dynamic and adaptive:

* Multi-Task Learning
* Meta-Learning / Learning to
Learn
* Transfer Learning & Domain
Adaptation
* Online / Streaming Learning 

To sum up, continual learning aims find a well-generalized model that remembers past concepts while learning the new concepts as well. It is not an easy job since the model should understand which biased information should be removed and which biased information should be preserved. Continual learning also assumes that there will be no access to previously encountered data and tasks. That assumption is important from two different aspects: First, in a continuous environment, the thing that is gone is gone. It can not exist again. Second, from the resource aspect, it is not possible to store all the previous information the model has encountered before. As an example, an average human consumes approximately 50 GB/s while observing stuff which makes ~30240 TB of data after only a week.

## Catastrophic Forgetting

Forgetting is something that is not just related to machines but it is part of the very nature of humans and the process of learning. It is actually important to forget the biased information or all information acquired over time in terms of effectiveness and efficiency. Models cannot remember everything and even if it remembers them it is just a waste of resource to remember things that are not useful to reach a particular goal.

However, the level of forgetting is what matters here. The model should forget the unnecessary to generalize well yet it should retain the key information to 'learn'. The term **catastrophic forgetting**  means losing information completely and abruptly upon learning new information mostly due to the *gradient descent*. That is because, in the traditional approach, the model tries to minimize the loss **on a given task** which can be completely different for another task. Hence, continual learning shifts its standpoint a bit and it suggests that the model should try to minimize the loss **over the entire stream of task** which is represented as L<sub>$s$</sub>.

The interesting and funny point here is this: the model could work perfectly if all the data at hand were fed to the model at once, yet when we split the data into smaller pieces and feed them into the model as a stream of tasks, the model is starting to forget. It is very remarkable to see how such a small change affects the performance of the model. However, it is not possible to have all the data at hand at once in a real case scenario which shows the motivation behind the continual learning and the significance of the catastrophic forgetting.

## Dataset Shifts in ML

In the traditional offline machine leraning, data and its distrubtuion is fixed while in the real world scanerio data is something dynamic and can change over time. To define these kind of changes researchers have come up with different terms:
* Covariate Shift (shift happens in inputs): P(X) - same input returns same output but inputs are changed a bit.
* Prior Probabilty Shift (shift happpens in output): P(y) - same input returns same output but frequency or the distribution of the output is changed a bit.
* Concept Shift (shift happpens in output): P(y|X) - same input does not return same output anymore. Model should adapt itself - means retraining is necessary.

The first two terms are classified under a virtual drift which is concerned by continual learning since the shifts we encounter is just a the result of a sample or task selection order (or bias). Therefore, in the virtual drift case, it should be assumed that all the previous examples or the previous observations are still valid and we need to accumulate knowledge over time. The other term is classified under a real drift which mostly investigated by online learning and AutoML. To sum up, one should be able to identify the type of shift so that the necessary action can be taken.

<img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/datashift.png?raw=true" width=700>



## Common Assumptions in Continual Learning

* *Shift is only virtual* which means forgetting is not needed only accumulation of knowledge would be enough.
* *No conflicting evidence* that stands for one x value can be only 
valid for one y since we are modeling a mathematical function.
* *Unbounded time between two experiences* which describes multiple time of training can possible for one experiment.
* *Data processing valid in each experience* so that data in a given experience can be shuffled and processed freely.