## [River - Python library for online machine learning](https://github.com/online-ml/river). 

It is the result of a merger between creme and scikit-multiflow. River's ambition is to be the go-to library for doing machine learning on streaming data.

## Batch Learning vs. Online Learning

### Batch Learning

Basic approach is the following:

1. Collect some data, i.e. features $X$ and labels $Y$
2. Train a model on $(X, Y)$, i.e. generate a function $f(X) \approx Y$
3. Save the model somewhere
4. Load the model to make predictions

Some drawbacks of batch learning are:

- Models have to be retrained from scratch with new data
- Models always "lag" behind
- With increasing data, the comp. requirements increase
- Batch models are **static** 
- Some locally developed features are not available in production/real-time

Batch learning is popular mainly since it is taught at university, it is the main source of competitions on Kaggle, there are **libraries available** and one may achieve higher levels of accuracy in a direct comparison to online learning.

Available libraries: [sklearn](https://scikit-learn.org/stable/), [pytorch](https://pytorch.org/), [tensorflow](https://tensorflow.org), etc. 

---

### Online Learning

Video sources: [Max Halford](https://www.youtube.com/watch?v=P3M6dt7bY9U), [Andrew Ng](https://www.youtube.com/watch?v=dnCzy_XKGbA)

Literature sources: [Comprehensive Survey](https://arxiv.org/pdf/1802.02871.pdf)

Different names for the same thing: **Incremental Learning**, *Sequential Learning*, **Iterative Learning**, *Out-of-core Learning*

Basic features:

- Data comes from a stream, i.e. in sequential order
- Models learn 1 observation at a time
- Observations do not have to be stored 
- Features and labels are dynamic 
- Models can dynamically adapt to new patterns in the data

Available libraries: [river](https://github.com/online-ml/river), [vowpal wabbit](https://vowpalwabbit.org/)

Usefull applications in

- Time series forecasting 
- Spam filters and recommender systems
- IoT
- Basically, **anything event based** 

Algorithmic scheme of online learning:
    <div style="background-color:rgba(0, 0, 0, 0.0670588); padding:5px 0;font-family:monospace;">
    <font color = "red">Forever do</font><br>
    &nbsp;&nbsp;&nbsp;&nbsp; Get $(x,y)$ corresponding to new data.    
    &nbsp;&nbsp;&nbsp;&nbsp; Update $\Theta$ using $(x,y)$ with SGD step:<br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $\Theta_j := \Theta_j - \gamma \nabla L$.<br>
    </div>


Major Drawbacks:

- [Catastrophic inference](https://www.wikiwand.com/en/Catastrophic_interference): NN abruptly forgets what it has learned, first brought to the attention in 1989



## Online Deep Learning ODL

based on [paper](https://www.ijcai.org/Proceedings/2018/0369.pdf)

The main challenges for DNNs in the online learning setting are

- vanishing gradients: DNNs loose the abiltiy to learn because gradients tend towards 0
- diminishing feature reuse
- saddle points
- immense number of parameters to optimize
- internal covariate shifts

In the paper above, the authors extend the classic DNN such that each layer $h^i$ acts as output layer. The final prediction $F_t$ for data $t$ of the model is then given as linear combination of the predictions $f^i(x)$ of the individual layers. For parameter optimization, they introduce the so-called **Hedge Backpropagation**. Both is visualized in the paper in the following figure, in which $h^i$ is the respective layer of the DNN, i.e. nonlinear activation function of a linear combination of the inputs, $f^i$ is the respecitve prediction of the layer and $\alpha_i$ is the respective weight.

![Hedge Backpropagation](HBP.PNG)


## [Catastrophic Inference](https://www.wikiwand.com/en/Catastrophic_interference)

Catastrophic interference, also known as catastrophic forgetting, is the tendency of an artificial neural network to completely and abruptly forget previously learned information upon learning new information. Catastrophic interference is an important issue to consider when creating connectionist models of memory. It was originally brought to the attention of the scientific community by research from McCloskey and Cohen (1989), and Ratcliff (1990). 

It is a radical manifestation of the **'sensitivity-stability' dilemma** or the **'stability-plasticity' dilemma**. Specifically, these problems refer to the challenge of making an artificial neural network that is sensitive to new information but not disrupted by. Lookup tables and connectionist networks lie on the opposite sides of the stability plasticity spectrum. The former (LuT) remains completely stable in the presence of new information but lacks the ability to generalize, i.e. to infer general principles from new inputs. On the other hand, neural networks like the standard backpropagation network can generalize to unseen inputs, but they are very sensitive to new information. Backpropagation models can be considered good models of human memory insofar as they mirror the human ability to generalize but these networks often exhibit less stability than human memory. Notably, these backpropagation networks are susceptible to catastrophic interference. 

The main cause of catastrophic interference seems to be overlap in the representations at the hidden layer of distributed neural networks. In a distributed representation, each input tends to create changes in the weights of many of the nodes. Catastrophic forgetting occurs because when many of the weights where "knowledge is stored" are changed, it is unlikely for prior knowledge to be kept intact. During sequential learning, the inputs become mixed, with the new inputs being superimposed on top of the old ones. Another way to conceptualize this is by visualizing learning as a movement through a weight space. This weight space can be likened to a spatial representation of all of the possible combinations of weights that the network could possess. When a network first learns to represent a set of patterns, it finds a point in the weight space that allows it to recognize all of those patterns. However, when the network then learns a new set of patterns, it will move to a place in the weight space for which the only concern is the recognition of the new patterns. To recognize both sets of patterns, the network must find a place in the weight space suitable for recognizing both the new and the old patterns.
