# 10.01 Online Learning

There are different models that support online learning but the technique to change
the model parameters during the learning is (almost) always some variant on top of
Gradient Descent (GD).
The GD is a technique which attempts to find
a minimal model error by walking through the function of model parameters.

![Spruce-Fir Forest Cover](ol-forest-spruce-fir.svg)

<div style="text-align:right;"><sup>ol-forest-spruce-fir.svg</sup></div>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from mpl_toolkits import mplot3d
plt.style.use('seaborn-talk')

## Gradient Descent

Imagine a model with two parameters (e.g. a linear regression in 2 dimensions),
for each combination of parameters we have some model error (misclassification
or distance to regression line).
The shape of this function is determined by the data to which we are fitting the model.

In offline/single batch learning we fit a model by repeating techniques that find an almost
optimal value for the parameters based on a training set.
We use cross-validation and a test set to achieve some degree of generalization.
If we try to fit this same model to a different set of data we would repeat the entire
technique and end at a reasonable solution for the new dataset.

With online learning we try to optimize the model parameters in a slightly different way.
We initialize the parameters at random and calculate the gradient over this model error
function.
The gradient is:

$$
\nabla E = \frac{\partial E}{\partial w_1}\hat{\imath} + \frac{\partial E}{\partial w_2}\hat{\jmath} + \ldots
$$

i.e. it is the partial derivative against each function parameter, in two dimensions
it is against two parameters only ($\hat{\imath}$ and $\hat{\jmath}$ are *versors*).

![Lodgepole Pine Forest Cover](ol-forest-lodgepole-pine.svg)

<div style="text-align:right;"><sup>ol-forest-lodgepole-pine.svg</sup></div>

The gradient tells us how a function varies around the current model parameters,
and therefore it tells us in which direction lie a better (lower model error) parameters.
Yet, it does not tell us *how far away these parameter lie*.
The *learning rate*, in online learning, is the distance that we will move in the direction
of lower model error.
And after we move to those parameters we will look at the gradient
again and repeat the procedure, until convergence (or maximum iterations) is reached.

By default `sklearn` uses a learning rate (`eta0`) that is reduced at each iteration,
this allows for a technique called *Stochastic Gradient Descent* (SDG).
In SDG only some of the samples are used at each time to determine the gradient.
The decreasing learning rate allows for convergence despite the fact that not all samples
are used to calculate the gradient.
The default learning rate starts at $1/\alpha$,
where $\alpha$ is the constant multiplying the regularization term.

Let's try to visualize this on a surface:

In [None]:
x = np.linspace(0, 10, 100)
y = np.linspace(0, 10, 100)
xx, yy = np.meshgrid(x, y)
z = 3*np.sin(xx) - 1 + np.sin(xx + 6) + np.cos(yy) + np.cos(yy - 0.5) + 0.6*x
z = 1 - z/20 - 0.5
fig = plt.figure(figsize=(16, 10))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(xx, yy, z, cmap='spring')
ax.set_zlim(0, 1)
ax.set_zlabel('Model Error', fontsize=20, labelpad=10)
ax.set_xlabel('Model Parameter', fontsize=20, labelpad=15)
ax.set_ylabel('Model Parameter', fontsize=20, labelpad=15)
ax.view_init(elev=30., azim=30)

In a real problem the surface will be a high dimensional hyperplane,
since most machine learning models work in very high dimensions.
Yet, the same technique works on any number of dimensions.

The function we try to optimize is called a *cost function*,
and in several cases a cost function may be different from the
actual function we are fitting the model with.
We do not really care about all components of the gradient vector.
Each parameter in the model is a dimension of the cost function,
and the component of the gradient vector in that same dimension is an indicator of how
the error changes if we change this specific parameter/weight.
In other words the ratio of change (derivative) between error and parameter
tells us in which direction a specific parameter should be updated to reach a smaller error.
It is often written that instead of the gradient, what is used are
*directed derivatives* in the direction of every parameter/weight.

$$
\nabla E \times \vec{u} = \| \nabla E \|_2 \| u \|_2 \cos \theta
$$

For example the `Ridge` regression added an extra term to the function
we tried to optimize, and that extra term allowed us for a better fit.
The gradient descent optimizes the cost function but the model itself
predicts values based on the actual fitted function (from the model
parameters only).
In other words, we have the function (e.g. classification) we are trying to optimize,
and to find that we build and optimize a cost function,
which is a completely different function.

In `sklearn` we have `SDGClassifier` which will perform the technique
above to achieve online learning on top of linear SVMs, logistic regression, or a perceptron.
The `SDGRegressor` performs a linear regression as online learning.
Note that this means that we can only find solutions to problems
that can be approximated linearly.
For non-linear online learning we need neural networks (which we will see soon).

![Gradient Direction](ol-gradient-direction.svg)

<div style="text-align:right;"><sup>ol-gradient-direction.svg</sup></div>

## Forest Cover Type Dataset

For a change let's take on a dataset that is not present inside `sklearn`.
The forest cover dataset are cartographic data about types of forests
in the Roosevelt National Forest in Colorado.
First let's define a couple of details about the features of the dataset.
It has several continuous features and then several categorical ones.
The categorical features are already one-hot-encoded for us.

In [None]:
continuous = [
    'Elevation',
    'Aspect',
    'Slope',
    'HHydro',  # Horizontal Distance to Hydrology
    'VHydro',  # Vertical Distance to Hydrology
    'Road',    # Horizontal Distance to Roadways
    'Shade_9am',
    'Shade_Noon',
    'Shade_3pm',
    'Fire',    # Horizontal Distance to Fire Points
]
categorical = [
    'wild=1',  # Rawah Wilderness Area
    'wild=2',  # Neota Wilderness Area
    'wild=3',  # Comanche Peak Wilderness Area
    'wild=4',  # Cache la Poudre Wilderness Area
    'soil=1','soil=2','soil=3','soil=4','soil=5','soil=6','soil=7','soil=8','soil=9','soil=10',
    'soil=11','soil=12','soil=13','soil=14','soil=15','soil=16','soil=17','soil=18','soil=19','soil=20',
    'soil=21','soil=22','soil=23','soil=24','soil=25','soil=26','soil=27','soil=28','soil=29','soil=30',
    'soil=31','soil=32','soil=33','soil=34','soil=35','soil=36','soil=37','soil=38','soil=39','soil=40',
]
columns = continuous + categorical + ['label']
target_names = ['Spruce/Fir', 'Lodgepole Pine', 'Ponderosa Pine',
                'Cottonwood/Willow', 'Aspen', 'Douglas-fir', 'Krummholz']

![Ponderosa Pine Forest Cover](ol-forest-ponderosa-pine.svg)

<div style="text-align:right;"><sup>ol-forest-ponderosa-pine.svg</sup></div>

Based on the features we can then classify an area of forest cover into
one of the seven classification (targets/labels).
The set has more then half a million rows of data,
it a reasonably sized dataset.

To keep with the spirit of what we have been doing until now we will
write a function to actually retrieve the dataset.
We will duplicate the `sklearn` convention for dataset loading and construct
our `load_cov_type` function.
The function not only allows for easy download of the dataset
but also caches the downloaded data on the filesystem,
so one does not need to download it the next time.
A couple of things worth mentioning are that:
the dataset is taken from the
*University of California Irvine Machine Learning Repository*
and is kept within their archives g-zipped.
Also, the labels in the dataset start from $1$,
we adjust the labels to start from $0$ during the dataset load.

In [None]:
import os
import sys
import zlib
import requests
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.utils import Bunch


def load_cover_type():
    cov_dir = 'uci_cover_type'
    data_dir = datasets.get_data_home()
    data_path = os.path.join(data_dir, cov_dir, 'covtype.data')
    descr_path = os.path.join(data_dir, cov_dir, 'covtype.info')
    cov_data_gz = 'https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz'
    cov_descr = 'https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info'
    os.makedirs(os.path.join(data_dir, cov_dir), exist_ok=True)
    try:
        with open(descr_path, 'r') as f:
            descr = f.read()
    except IOError:
        print('Downloading file from', cov_descr, file=sys.stderr)
        r = requests.get(cov_descr)
        with open(descr_path, 'w') as f:
            f.write(r.text)
        descr = r.text
        r.close()
    try:
        data = pd.read_csv(data_path, delimiter=',', names=columns)
    except IOError:
        print('Downloading file from', cov_data_gz, file=sys.stderr)
        r = requests.get(cov_data_gz)
        cov_data = zlib.decompress(r.content, wbits=16+zlib.MAX_WBITS)  # obscure but works
        cov_data = cov_data.decode('utf8')
        with open(data_path, 'w') as f:
            f.write(cov_data)
        r.close()
        data = pd.read_csv(data_path, delimiter=',', names=columns)
    X = data[continuous + categorical].values
    y = data['label'].values - 1
    return Bunch(DESCR=descr,
                 data=X,
                 feature_names=columns[:-1],
                 feature_continuous=continuous,
                 feature_categorical=categorical,
                 target=y,
                 target_names=target_names)


covtype = load_cover_type()
print(covtype.DESCR)

A quick look at the dataset is always a good idea.
One can see all three parts of the set:
the continuous features, the one-hot-encoded categorical features,
and the forest cover type labels.

In the loading function we have also added a distinction between
continuous and categorical features.
This concept is often useful as one may need to scale continuous
features but scaling one-hot-encoded features makes little sense.

In [None]:
import pandas as pd

df = pd.read_csv('covtype.data', delimiter=',', names=columns)
df

![Cottonwood Forest Cover](ol-forest-cottonwood.svg)

<div style="text-align:right;"><sup>ol-forest-cottonwood.svg</sup></div>

Half a million rows is a good dataset.
It perhaps does not require online learning on most machines
but on some it might.

That said, for presentation purposes it may take too long
to run out code on the full dataset.
Instead we will take two forest types: Ponderosa Pine and Douglas-fir,
and use only the part of the dataset with these labels.
Note how we change the labels to be $0$ and $1$
for a classification between only two forest types.

In [None]:
X = covtype.data
y = covtype.target
X = X[(y == 2) | (y == 5)]
y = y[(y == 2) | (y == 5)]
y[y == 2] = 0  # Ponderosa Pine
y[y == 5] = 1  # Douglas-fir
df = pd.DataFrame(X, columns=covtype.feature_names)
df

![Willow Forest Cover](ol-forest-willow.svg)

<div style="text-align:right;"><sup>ol-forest-willow.svg</sup></div>

Looking at the data we can easily see that the continuous features
have very distinct value ranges.
And therefore will require scaling.

We have the columns that are continuous in an attribute
of the loaded dataset.
If we now scale only those columns and place them back together
with the categorical columns we have a dataset we can work with.

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_cont = sc.fit_transform(df[covtype.feature_continuous].values)
X_cat = df[covtype.feature_categorical].values
X = np.c_[X_cont, X_cat]
X.shape

We have a real dataset, we should treat it as a real problem.
We take out a test set which we will not touch.

In [None]:
from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)

And train on the training set with cross-validation.

In [None]:
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier

model = make_pipeline(
    PCA(n_components=20),
    SGDClassifier(loss='log', penalty='l1', max_iter=500, alpha=0.01, tol=0.01))
param_grid = {
    'sgdclassifier__alpha': [0.001, 0.01, 0.1],
    'sgdclassifier__tol': [0.01, 0.1],
}
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(xtrain, ytrain)
grid.best_score_

![Aspen Forest Cover](ol-forest-aspen.svg)

<div style="text-align:right;"><sup>ol-forest-aspen.svg</sup></div>

Some of the models during grid search may not converge.
That's completely fine, we did root them out through the cross-validation.
The best model is likely to be one that did converge.

In [None]:
grid.best_estimator_

Finally evaluate on the test set.

In [None]:
from sklearn.metrics import classification_report

names = ['Ponderosa Pine', 'Douglas-fir']
yfit = grid.best_estimator_.predict(xtest)
print(classification_report(ytest, yfit, target_names=names))

![Douglas-Fir Forest Cover](ol-forest-douglas-fir.svg)

<div style="text-align:right;"><sup>ol-forest-douglas-fir.svg</sup></div>

Hey!  That wasn't online learning.

Right, it was not.
The full dataset may need online learning
but out sample of two forest types only fit fine in memory.
Yet we can make up the idea of a dataset that is too big
by slitting into two sets of data and train SGD in an online fashion.
We will *misuse* the `train_test_split` function for this.

In [None]:
cov1, cov2, ycov1, ycov2 = train_test_split(X, y, test_size=0.5)

In `sklearn` there are two ways of using online learning.
One is to use a method called `partial_fit`, instead of `fit`,
which will update parameters instead of fitting completely new ones.
Another way to enable online learning is to pass `warm_start=True`,
this forces `fit` to always work like `partial_fit`.

Both methods only work on models that inherently support online learning.

In [None]:
xtrain1, xtest1, ytrain1, ytest1 = train_test_split(cov1, ycov1, test_size=0.2)
model = make_pipeline(
    PCA(n_components=10),
    SGDClassifier(loss='hinge', penalty='l1', max_iter=200, alpha=0.001, tol=0.01, warm_start=True))
model.fit(xtrain1, ytrain1)
yfit = model.predict(xtest1)
print(classification_report(ytest1, yfit, target_names=names))

We know about half of the data and we can, more-or-less, classify that.
But if we try to classify the data we do not know about we may run into trouble.

In [None]:
xtrain2, xtest2, ytrain2, ytest2 = train_test_split(cov2, ycov2, test_size=0.2)
yfit = model.predict(xtest2)
print(classification_report(ytest2, yfit, target_names=names))

We can train with some data from the second dataset and see if things improve.

In [None]:
model.fit(xtrain2, ytrain2)
yfit = model.predict(xtest2)
print(classification_report(ytest2, yfit, target_names=names))

![Krummholz Forest Cover](ol-forest-krummholz.svg)

<div style="text-align:right;"><sup>ol-forest-krummholz.svg</sup></div>

## Extra: Non-GD Optimisation

We saw SGD and said that it is the most often used optimization technique.
But what are the others?
One technique is **simulated annealing** which works by slow cooling.
In summary: simulated annealing tries random neighbors at each iteration and keeps
track of of the point with the lowest value of the cost function.
The search space for a new neighbor (i.e. the maximum distance form the lowest point
found until now) reduces at each iteration.
This is similar to SGD with a decreasing learning rate.

But there are more techniques.
Notably **swarm intelligence** provides us with several optimization algorithms:

- particle swarm
- bat swarm
- cuckoo search

And **genetic algorithms** also work reasonably in an online learning scenario.

## References

[UCI - Forest Cover Type Dataset][1]

[1]: https://archive.ics.uci.edu/ml/datasets/Covertype "UCI Forest Cover Type"