## [River - Python library for online machine learning](https://github.com/online-ml/river). 

It is the result of a merger between creme and scikit-multiflow. River's ambition is to be the go-to library for doing machine learning on streaming data.

## Batch Learning vs. Online Learning

### Batch Learning

Basic approach is the following:

1. Collect some data, i.e. features $X$ and labels $Y$
2. Train a model on $(X, Y)$, i.e. generate a function $f(X) \approx Y$
3. Save the model somewhere
4. Load the model to make predictions

Some drawbacks of batch learning are:

- Models have to be retrained from scratch with new data
- Models always "lag" behind
- With increasing data, the comp. requirements increase
- Batch models are **static** 
- Some locally developed features are not available in production/real-time

Batch learning is popular mainly since it is taught at university, it is the main source of competitions on Kaggle, there are **libraries available** and one may achieve higher levels of accuracy in a direct comparison to online learning.

---

### Online Learning

Video sources: [Max Halford](https://www.youtube.com/watch?v=P3M6dt7bY9U), [Andrew Ng](https://www.youtube.com/watch?v=dnCzy_XKGbA)

Different names for the same thing: **Incremental Learning**, *Sequential Learning*, **Iterative Learning**, *Out-of-core Learning*

Basic features:

- Data comes from a stream, i.e. in sequential order
- Models learn 1 observation at a time
- Observations do not have to be stored 
- Features and labels are dynamic 
- Models can dynamically adapt to new patterns in the data

Usefull applications in

- Time series forecasting 
- Spam filters and recommender systems
- IoT
- Basically, **anything event based** 

Algorithmic scheme of online learning:
    <div style="background-color:rgba(0, 0, 0, 0.0670588); padding:5px 0;font-family:monospace;">
    <font color = "red">Forever do</font><br>
    &nbsp;&nbsp;&nbsp;&nbsp; Get $(x,y)$ corresponding to new data.    
    &nbsp;&nbsp;&nbsp;&nbsp; Update $\Theta$ using $(x,y)$ with SGD step:<br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $\Theta_j := \Theta_j - \gamma \nabla L$.<br>
    </div>


Major Drawbacks:

- [Catastrophic inference](https://www.wikiwand.com/en/Catastrophic_interference): NN abruptly forgets what it has learned, first brought to the attention in 1989



## [Catastrophic Inference](https://www.wikiwand.com/en/Catastrophic_interference)

Catastrophic interference, also known as catastrophic forgetting, is the tendency of an artificial neural network to completely and abruptly forget previously learned information upon learning new information. Catastrophic interference is an important issue to consider when creating connectionist models of memory. It was originally brought to the attention of the scientific community by research from McCloskey and Cohen (1989), and Ratcliff (1990). 

It is a radical manifestation of the 'sensitivity-stability' dilemma or the 'stability-plasticity' dilemma. Specifically, these problems refer to the challenge of making an artificial neural network that is sensitive to, but not disrupted by, new information. Lookup tables and connectionist networks lie on the opposite sides of the stability plasticity spectrum. The former (LuT) remains completely stable in the presence of new information but lacks the ability to generalize, i.e. infer general principles, from new inputs. On the other hand, connectionist networks like the standard backpropagation network can generalize to unseen inputs, but they are very sensitive to new information. Backpropagation models can be considered good models of human memory insofar as they mirror the human ability to generalize but these networks often exhibit less stability than human memory. Notably, these backpropagation networks are susceptible to catastrophic interference. This is an issue when modelling human memory, because unlike these networks, humans typically do not show catastrophic forgetting.

The main cause of catastrophic interference seems to be overlap in the representations at the hidden layer of distributed neural networks. In a distributed representation, each input tends to create changes in the weights of many of the nodes. Catastrophic forgetting occurs because when many of the weights where "knowledge is stored" are changed, it is unlikely for prior knowledge to be kept intact. During sequential learning, the inputs become mixed, with the new inputs being superimposed on top of the old ones. Another way to conceptualize this is by visualizing learning as a movement through a weight space. This weight space can be likened to a spatial representation of all of the possible combinations of weights that the network could possess. When a network first learns to represent a set of patterns, it finds a point in the weight space that allows it to recognize all of those patterns. However, when the network then learns a new set of patterns, it will move to a place in the weight space for which the only concern is the recognition of the new patterns. To recognize both sets of patterns, the network must find a place in the weight space suitable for recognizing both the new and the old patterns.


In [39]:
from sklearn import datasets
from sklearn import linear_model
from sklearn import metrics
from sklearn import model_selection
from sklearn import pipeline
from sklearn import preprocessing


# Load the data
dataset = datasets.load_breast_cancer()
X, y = dataset.data, dataset.target

# Define the steps of the model
model = pipeline.Pipeline([
    ('scale', preprocessing.StandardScaler()),
    ('lin_reg', linear_model.LogisticRegression(solver='lbfgs'))
])

# Define a determistic cross-validation procedure
cv = model_selection.KFold(n_splits=5, shuffle=True, random_state=42)

# Compute the MSE values
scorer = metrics.make_scorer(metrics.roc_auc_score)
scores = model_selection.cross_val_score(model, X, y, scoring=scorer, cv=cv)

# Display the average score and it's standard deviation
print(f'ROC AUC: {scores.mean():.3f} (± {scores.std():.3f})')

ROC AUC: 0.975 (± 0.011)


In [56]:
for xi, yi in zip(X, y):
    xi = dict(zip(dataset.feature_names, xi))
    print(xi["mean area"])
    
    

1001.0
1326.0
1203.0
386.1
1297.0
477.1
1040.0
577.9
519.8
475.9
797.8
781.0
1123.0
782.7
578.3
658.8
684.5
798.8
1260.0
566.3
520.0
273.9
704.4
1404.0
904.6
912.7
644.8
1094.0
732.4
955.1
1088.0
440.6
899.3
1162.0
807.2
869.5
633.0
523.8
698.8
559.2
563.0
371.1
1104.0
545.2
531.5
1076.0
201.9
534.6
449.3
561.0
427.9
571.8
437.6
1033.0
712.8
409.0
1152.0
656.9
527.2
224.5
311.9
221.8
645.7
260.9
499.0
668.3
269.4
394.1
250.5
502.5
1130.0
244.0
929.4
584.1
470.9
817.7
559.2
1006.0
1245.0
506.3
401.5
520.0
1878.0
1132.0
443.3
1075.0
648.2
1076.0
466.1
651.9
662.7
728.2
551.7
555.1
705.6
1264.0
451.1
294.5
412.6
642.5
582.7
143.5
458.7
298.3
336.1
530.2
412.5
466.7
1509.0
396.5
290.2
480.4
629.9
334.2
230.9
438.6
245.2
682.5
782.6
982.0
403.3
1077.0
1761.0
640.7
553.5
588.7
572.6
1138.0
674.5
1192.0
455.8
748.9
809.8
761.7
1075.0
506.3
423.6
399.8
678.1
384.8
288.5
813.0
398.0
512.2
355.3
432.8
432.0
689.5
640.1
585.0
519.4
203.9
300.2
381.9
538.9
460.3
963.7
880.2
448.6
366.8
419.8
1157.

In [54]:
xi

{'mean radius': 17.99,
 'mean texture': 10.38,
 'mean perimeter': 122.8,
 'mean area': 1001.0,
 'mean smoothness': 0.1184,
 'mean compactness': 0.2776,
 'mean concavity': 0.3001,
 'mean concave points': 0.1471,
 'mean symmetry': 0.2419,
 'mean fractal dimension': 0.07871,
 'radius error': 1.095,
 'texture error': 0.9053,
 'perimeter error': 8.589,
 'area error': 153.4,
 'smoothness error': 0.006399,
 'compactness error': 0.04904,
 'concavity error': 0.05373,
 'concave points error': 0.01587,
 'symmetry error': 0.03003,
 'fractal dimension error': 0.006193,
 'worst radius': 25.38,
 'worst texture': 17.33,
 'worst perimeter': 184.6,
 'worst area': 2019.0,
 'worst smoothness': 0.1622,
 'worst compactness': 0.6656,
 'worst concavity': 0.7119,
 'worst concave points': 0.2654,
 'worst symmetry': 0.4601,
 'worst fractal dimension': 0.1189}