# How to use learners

In CNTK learners are implementations of gradient-based optimization algorithms. CNTK automatically computes the gradient of your criterion/loss with respect to each learnable parameter but how this gradient is combined with the current parameter value to provide a new parameter value is left to the learner. 

CNTK provides three ways to define your learner, which we describe in detail in this notebook. You can
- use a builtin learner. Builtin learners are very fast.
- define your learner as a CNTK expression. This is not as fast as the builtin learners but more flexible. 
- define your learner as a Python function. This is even more flexible but even less fast. 

Here's a "hello world" example for learners.

In [1]:
import cntk as C
import numpy as np
import math
np.set_printoptions(precision=4)

features = C.input_variable(3)
label = C.input_variable(2)
z = C.layers.Sequential([C.layers.Dense(4, activation=C.relu), C.layers.Dense(2)])(features)

lr_schedule_m = C.learning_rate_schedule(0.5, C.UnitType.minibatch)
lr_schedule_s = C.learning_rate_schedule(0.5, C.UnitType.sample)

sgd_learner_m = C.sgd(z.parameters, lr_schedule_m)
sgd_learner_s = C.sgd(z.parameters, lr_schedule_s)

We have created two learners here. When creating a learner we have to specify a learning rate schedule, which can be as simple as specifying a single number (0.5 in this example) or it can be a list of learning rates that specify what the learning rate should be at different points in time. 

Currently, the best results with deep learning are obtained by having a small number of *phases* where inside each phase the learning rate is fixed and the learning rate decays by a constant factor when moving from phase `i` to phase `i+1`. We will come back to this point later.

The second parameter in the learning rate schedule can be one of two different value:
- Per minibatch
- Per sample

To understand the difference and get familiar with the learner properties and methods, let's write a small function that inspects the effect of a learner on the parameters assuming the parameters are all 0 and the gradients are all 1.

In [2]:
def inspect_update(learner, mbsize, count=1):
    # Save current parameter values
    old_values = [p.value for p in learner.parameters]
    # Set current parameter values to all 0
    for p in learner.parameters:
        p.value = 0 * p.value
    # create all 1 gradients and associate them with the parameters
    updates = {p: p.value + 1 for p in learner.parameters}    
    # do 'count' many updates
    for i in range(count):
        learner.update(updates, mbsize)
    ret_values = [p.value for p in learner.parameters]
    # Restore values
    for p, o in zip(learner.parameters, old_values):
        p.value = o
    return ret_values

print('\nunit = sample\n', inspect_update(sgd_learner_s, mbsize=2))
print('\nunit = minibatch\n', inspect_update(sgd_learner_m, mbsize=2))


unit = sample
 [array([[-0.5, -0.5],
       [-0.5, -0.5],
       [-0.5, -0.5],
       [-0.5, -0.5]], dtype=float32), array([-0.5, -0.5], dtype=float32), array([[-0.5, -0.5, -0.5, -0.5],
       [-0.5, -0.5, -0.5, -0.5],
       [-0.5, -0.5, -0.5, -0.5]], dtype=float32), array([-0.5, -0.5, -0.5, -0.5], dtype=float32)]

unit = minibatch
 [array([[-0.25, -0.25],
       [-0.25, -0.25],
       [-0.25, -0.25],
       [-0.25, -0.25]], dtype=float32), array([-0.25, -0.25], dtype=float32), array([[-0.25, -0.25, -0.25, -0.25],
       [-0.25, -0.25, -0.25, -0.25],
       [-0.25, -0.25, -0.25, -0.25]], dtype=float32), array([-0.25, -0.25, -0.25, -0.25], dtype=float32)]


With the knowledge that SGD is the update `parameter = old_parameter - learning_rate * gradient`, we can conclude that when the learning rate schedule is per minibatch, the learning rate is divided by the minibatch size, when the learning rate schedule is per sample the learning rate is not divided by the minibatch size. CNTK offers both options because in some setups it is more convenient to work with per sample learning rates than per minibatch learning rates and vice versa. For example, per minibatch learning rate schedules, typically don't require retuning when you want to change the minibatch size, but per sample schedules do. On the other hand with distributed training it is more correct to specify the learning rate schedule as per sample rather than per minibatch.

Calling update manually on the learner (as `inspect_update` does) is very tedious. Besides, you need to arrange for the gradients to be computed and provided to the learner. Fortunately, if you use a **`Trainer`**, you don't have to do any of that. 
Outside this document you will never see any examples manually calling update on the learner!  

## Trainers and Learners

A closely related class to the `Learner` is the `Trainer`. In CNTK a `Trainer` brings together all the ingredients necessary for training models:
- the model itself
- the loss function (a differentiable function) and the evaluation function which is not necessarily differentiable (such as error rate)
- the learners
- optionally progress writers that log the training progress

While in the most typical case a `Trainer` has a single learner that handles all the parameters, it is possible to have **multiple learners** each working on a different subset of the parameters. Parameters that are not covered by any learner **will not** be updated.  Here is an example that illustrates typical use.


In [3]:
lr_schedule = C.learning_rate_schedule([0.05]*3 + [0.025]*2 + [0.0125], C.UnitType.minibatch, epoch_size=100)
sgd_learner = C.sgd(z.parameters, lr_schedule)
loss = C.cross_entropy_with_softmax(z, label)
trainer = C.Trainer(z, loss, sgd_learner)
# use the trainer with a minibatch source as in the trainer howto

The trainer will compute the gradients of `loss` with respect to the parameters of `z` and call the sgd_learner's update method as we did manually in the `inspect_update` function earlier. Here we have specified a learning rate schedule that is 0.05 for the first 300 minibatches (3 times the epoch size), then drops to 0.025 for the next 200 minibatches, and it is 0.0125 from then on until the end of training. This kind of functionality is quite common in tuning neural networks and it is the reason why in some papers (such as the [ResNet paper](https://arxiv.org/abs/1512.03385)) we see learning curves like this
![resnet](resnet.png)

What is happening here is that the learning rate gets reduced by a factor of 10 after 150000 and 300000 updates (cf. section 3.4 of the paper)

Finally the `cntk.Function.train` method provides convenience functionality which allows you to specify the learner and the data and it internally creates a trainer that drives the training loop.

## Other builtin learners

Apart from SGD, other builtin learners include 
- SGD with momentum (`momentum_sgd()`)
- SGD with Nesterov momentum (`nesterov()`) first popularized in deep learning by [this paper](http://proceedings.mlr.press/v28/sutskever13.html)
- Adagrad (`adagrad()`) first popularized in deep learning by [this paper](https://research.google.com/archive/large_deep_networks_nips2012.html) 
- RMSProp (`rmsprop()`) a correction to adagrad that prevents the learning rate from decaying too fast.
- FSAdagrad (`fsadagrad()`) adds momentum and bias correction to RMSprop
- Adam / Adamax (`adam(..., adamax=False/True)`) see [this paper](https://arxiv.org/abs/1412.6980)
- Adadelta (`adadelta()`) see [this paper](https://arxiv.org/abs/1212.5701)

### Momentum

Among these learners, `momentum_sgd`, `nesterov`, `fsadagrad`, and `adam` take an additional momemtum schedule. 

Momentum means that instead of updating the parameter using the current gradient we update the parameter using all previous gradients exponentially decayed . If there is a consistent direction that the gradients are pointing to, the parameter updates will develop momentum in that direction. [This page](http://distill.pub/2017/momentum/) has a good explanation of momentum.

Like the learning rate schedule, the momentum schedule can be specified in two equivalent ways:
- `momentum_schedule(float or list of floats, epoch_size)`
- `momentum_as_time_constant(float or list of floats, epoch_size)`

As with `learning_rate_schedule` the arguments are interpreted in the same way, i.e. there's flexibility is specifying different momentum for the first few minibatches and for later minibatches. 

The difference between the two calls is just a simple transformation which can be explained as follows. Since momentum is creating a sort of exponential moving average it is fair to ask "when does the contribution of an old gradient diminish by a certain constant factor?". If we choose the constant factor to be $0.5$ we call this the [half-life](https://en.wikipedia.org/wiki/Half-life) and if we choose the constant to be $e^{-1}\approx 0.368$ we call this the [time constant](https://en.wikipedia.org/wiki/Time_constant). So `momentum_as_time_constant_schedule` specifies the number of samples it would take for a gradient to decay to $0.368$ of its original contribution on the momentum term. Specifying a `momentum_as_time_constant_schedule(300)` is a little bit more meaningful than specifying `momentum_schedule(.967...)` even though both lead to the same updates. The way to convert between the two schedules is
- $\textrm{momentum} = \exp(-\frac{\textrm{minibatch_size}}{\textrm{time_constant}})$
- $\textrm{time_constant} = \frac{\textrm{minibatch_size}}{\log(1/\textrm{momentum})}$

Apart from the momentum schedule, the momentum learners can also take a boolean "unit_gain" argument that determines the form of the momentum update:
- `unit_gain=True`: $\textrm{momentum_direction} = \textrm{momentum} \cdot \textrm{old_momentum_direction} + (1 - \textrm{momentum}) \cdot \textrm{gradient}$
- `unit_gain=False`: $\textrm{momentum_direction} = \textrm{momentum} \cdot \textrm{old_momentum_direction} + \textrm{gradient}$

The idea behind the non-conventional `unit_gain=True` is that when momentum and or learning rate changes this way of updating does not lead to divergence.

The following code illustrates that for the case of `unit_gain=False`, the two ways of specifying momentum (as time constant or not) are equivalent. It also shows that when `unit_gain=True` you need to scale your learning rate by $1/(1-\textrm{momentum})$ to match the `unit_gain=False` case

In [4]:
mb_size = 10
time_constant = 300
momentum = math.exp(-mb_size/time_constant)

print('time constant for momentum of 0.967... = ', mb_size/math.log(1/momentum))
print('momentum for time constant of 300      = ', math.exp(-mb_size/time_constant))

lr_schedule = C.learning_rate_schedule(1, C.UnitType.minibatch)
ug_schedule = C.learning_rate_schedule(1/(1-momentum), C.UnitType.minibatch)

m_schedule = C.momentum_schedule(momentum)
t_schedule = C.momentum_as_time_constant_schedule(time_constant)

msgd = C.momentum_sgd(z.parameters, lr_schedule, m_schedule, unit_gain=False)
tsgd = C.momentum_sgd(z.parameters, lr_schedule, t_schedule, unit_gain=False)
usgd = C.momentum_sgd(z.parameters, ug_schedule, m_schedule, unit_gain=True)

print(inspect_update(msgd, mb_size, 5)[0][0])
print(inspect_update(tsgd, mb_size, 5)[0][0])
print(inspect_update(usgd, mb_size, 5)[0][0])

time constant for momentum of 0.967... =  300.00000000000006
momentum for time constant of 300      =  0.9672161004820059
[-1.436 -1.436]
[-1.436 -1.436]
[-1.436 -1.436]


### Learners with individual learning rates

Among the builtin learners, `adagrad`, `rmsprop`, `fsadagrad`, `adam`, and `adadelta` have rules for tuning the learning rate of each parameter individually. They still require the tuning of a global learning rate that gets multiplied with the individual learning rate of each parameter but the hope is that these techniques can achieve better results by taking advantage of the fact that, for example, if the inputs are word ids then some features appear more rarely than others and therefore the learning rate for the parameters associated with rare feature should depend on how often those features have been seen rather than how many minibatches have been processed. 

These methods are typically easier to tune, but there is some new evidence that [they overfit more easily](https://arxiv.org/abs/1705.08292) than SGD with momentum.

Below we show how these learners can be configured and the effect their updates have on the parameters. 
The main take-away is that **if you switch learners, you need to retune the learning rate**. In this example the sequence of initial points and gradients is the same yet different learners arrive at different parameter values after 10 minibatches. However if we retune the learning rates, the learner with the smallest parameter value (adadelta), we can reach as large parameter values as the one with the largest parameter value (adamax). Also this is an artificial example where gradients are consistently equal to 1 so the methods that have momemtum builtin (`adam`/`adamax`/`fsadagrad`) should get higher parameter values than the methods that don't have momentum builtin (for the same value of the learning rate).

In [5]:
mb_size = 32
time_constant = 1000

lr_schedule = C.learning_rate_schedule(1, C.UnitType.minibatch)
t_schedule = C.momentum_as_time_constant_schedule(time_constant)

tsgd = C.momentum_sgd(z.parameters, lr_schedule, t_schedule, unit_gain=False)

adadelta  = C.adadelta(z.parameters, lr_schedule, 0.999, 1e-6)
adagrad   = C.adagrad(z.parameters, lr_schedule)
adam      = C.adam(z.parameters, lr_schedule, t_schedule, unit_gain=False)
adamax    = C.adam(z.parameters, lr_schedule, t_schedule, unit_gain=False, adamax=True)
fsadagrad = C.fsadagrad(z.parameters, lr_schedule, t_schedule, unit_gain=False)
rmsprop   = C.rmsprop(z.parameters, lr_schedule, gamma=0.999, inc=1.0+1e-9, dec=1.0-1e-9, max=np.inf, min=1e-30)

print('adadelta :', inspect_update(adadelta, mb_size, 10)[0][0])
print('agagrad  :', inspect_update(adagrad, mb_size, 10)[0][0])
print('adam     :', inspect_update(adam, mb_size, 10)[0][0])
print('adamax   :', inspect_update(adamax, mb_size, 10)[0][0])
print('fsadagrad:', inspect_update(fsadagrad, mb_size, 10)[0][0])
print('rmsprop  :', inspect_update(rmsprop, mb_size, 10)[0][0])

adadelta_schedule = C.learning_rate_schedule(1004, C.UnitType.minibatch)
adadelta_tuned  = C.adadelta(z.parameters, adadelta_schedule, 0.999, 1e-6)
print('adadelta2:', inspect_update(adadelta_tuned, mb_size, 10)[0][0])

adadelta : [-0.0099 -0.0099]
agagrad  : [-0.3125 -0.3125]
adam     : [-9.9203 -9.9203]
adamax   : [-9.9227 -9.9227]
fsadagrad: [-8.8573 -8.8573]
rmsprop  : [-0.3125 -0.3125]
adadelta2: [-9.9228 -9.9228]


## Writing a learner as a CNTK expression

If you want to experiment with your own learner, you should first try to write it as a CNTK expression. This is much faster than the next alternative, which is to write it in Python. CNTK has a universal learner that accepts a function as an argument. This function takes a list of parameters and gradients and creates an expression (a network) that when evaluated it will assign new values to the parameters according to the learning rule you coded. At the time of this writing, this learner does not support schedules for learning rate and momentum. If this is necessary, the user must create a new learner. Another deficiency of this learner is it only supports densely stored gradients. If you get an error that a quantity is not dense, you have two options:
- find the parameters with sparse gradients (typically those used at the very first layer) and put them in a builtin learner
- replace input variables that are sparse with dense (is_sparse=False)

Below we show how to write RMSprop using the universal learner.

In [6]:
def my_rmsprop(parameters, gradients):
    rho = 0.999
    lr = 0.01
    # We use the following accumulator to store the moving average of every squared gradient
    accumulators = [C.constant(1e-6, shape=p.shape, dtype=p.dtype) for p in parameters]
    update_funcs = []
    for p, g, a in zip(parameters, gradients, accumulators):
        # We declare that `a` will be replaced by an exponential moving average of squared gradients
        # The return value is the expression rho * a + (1-rho) * g * g 
        accum_new = C.assign(a, rho * a + (1-rho) * g * g)
        # This is the rmsprop update. 
        # We need to use accum_new to create a dependency on the assign statement above. 
        # This way, when we run this network both assigns happen.
        update_funcs.append(C.assign(p, p - lr * g / C.sqrt(accum_new)))
    return C.combine(update_funcs)

my_learner = C.universal(my_rmsprop, z.parameters)
print(inspect_update(my_learner, 10, 2)[0][0])

[-0.5397 -0.5397]


## Writing a learner as a Python class

CNTK expressions are very powerful and all the well-known learners can be expressed in this way. Still, there can be rare cases where you want to perform an update that cannot be currently implemented as a CNTK expression. In those cases you can implement your learner as a Python class. CNTK will then call its update method during training. Since this means C++ (the training loop) is calling into Python (your learner) for every single minibatch, this approach is the slowest of all options.

In order for your class to be understood as a learner it has to inherit from `cntk.UserLearner`. The constructor can be used to set up the learner and the trainer will call the learner's update method by supplying it a dictionary whose keys are the parameters and whose values are the corresponding gradients, as well as the number of samples in the minibatch and whether we have reached the end of a sweep through the data. The implementation of update is totally up to you.

In the code below, we create a learner that just performs SGD. In the constructor we create a dictionary mapping tensor shapes to CNTK expressions with the gradients being input variables. In the update method for each parameter, gradient pair we look up the expression corresponding to the shape of the parameter, bind the gradient to the input of the expression and evaluate the expression. Finally we slice the result to get rid of the batch axis and update the parameter. We have also slightly modified the `inspect_update` method to make it work with a user defined learner.

In [7]:
class MySgd(C.UserLearner):

    def __init__(self, parameters, lr_schedule):
        super(MySgd, self).__init__(parameters, lr_schedule, as_numpy=False)

        self.new_p = {}
        self.grad_input = {}

        self.sample_count_input = C.input_variable((), name='count')

        lr = lr_schedule[0]  # assuming constant learning rate
        eta = lr / self.sample_count_input

        # we need one graph per parameter shape
        for param in parameters:
            p_shape = param.shape
            self.grad_input[p_shape] = C.input_variable(p_shape)
            self.new_p[p_shape] = param - eta * self.grad_input[p_shape]

    def update(self, gradient_values, training_sample_count, sweep_end):
        for p, g in gradient_values.items():
            new_p = self.new_p[p.shape]
            grad_input = self.grad_input[p.shape]

            data = {
                    self.sample_count_input: np.asarray(training_sample_count),
                    grad_input: g
                    }
            result = new_p.eval(data, as_numpy=False)
            shape = result.shape

            # result has the shape of a complete minibatch, but contains
            # only one tensor, which we want to write to p. This means, we
            # have to slice off the leading dynamic axis.
            static_tensor = result.data.slice_view([0]*len(shape), shape[1:])
            p.set_value(static_tensor)
        return True
    
mb_size = 64
lr_schedule = C.learning_rate_schedule(1, C.UnitType.minibatch)
my_sgd = MySgd(z.parameters, lr_schedule)

def inspect_user_learner_update(learner, mbsize, count):
    easy_parameters = [C.Variable.as_parameter(p) for p in learner.parameters()]
    old_values = [p.value for p in easy_parameters]
    for p in easy_parameters:
        p.value = 0 * p.value
    updates = {p: p.value + 1 for p in easy_parameters}
    for i in range(count):
        learner.update(updates, np.float32(mbsize), sweep_end=False)
    ret_values = [p.value for p in easy_parameters]
    for p, o in zip(easy_parameters, old_values):
        p.value = o
    return ret_values

print(inspect_user_learner_update(my_sgd, mb_size, 10)[0][0])

[-0.1562 -0.1562]


And that's all there is to learners! They are at the heart of neural network training, but by themselves they are not very useful, and they are typically driven by a trainer. So a good next step for you would be to take a look at our Trainer howto.