# Understanding `fit_one_cycle(max_lr)`

This notebook is exploring how fastai determines the learning rate when you pass a learning rate in a `fit_one_cycle` code.

**Summary** The `slice` you provide `fit_one_cycle` gets converted into a numpy list of values. If you provide a `float` instead, it remains a `float`. 

The `slice` is predominantly used so that they can calculate the lower and upper bounds of the range of learning rate values. If the start value of `slice` is not provided, e.g. `slice(0, 0.03)` or `slice(None, 0.03)` then `slice.end` aka `0.03` is used in conjuction with the number of layer groups to provide a list of learning rates. 

---

I'm importing the vision section of the fastai library in order to test functions and classes with an existing model and dataset.

In [1]:
from fastai import *
from fastai.vision import *

In [2]:
from typing import Dict, Any, AnyStr, List, Sequence, TypeVar, Tuple, Optional, Union

Going to use a simple MNIST dataset to help setup a model.

In [3]:
path = untar_data(URLs.MNIST_TINY)

In [4]:
bs = 8
np.random.seed(42)
data = ImageDataBunch.from_folder(path, size=26, bs=bs)

Recall that `create_cnn` is a fastai function that creates a CNN model we can train.

In [5]:
learn = create_cnn(data, models.resnet18, metrics=error_rate)

Let's train this model once using the `fit_one_cycle` method. We don't care about the results, we just want to notice certain things that happen with the `learner` object.

In [108]:
type(learn)

fastai.basic_train.Learner

In [106]:
learn.fit_one_cycle(1, max_lr=slice(0, 0.01))

Total time: 00:45
epoch  train_loss  valid_loss  error_rate
1      6.477190    1.054067    0.034335    (00:45)



# Passing a learning rate into fit_one_cycle

The model of the network is stored in `learn` which is an instance of the class `Learner` which is part of the fastai library. Most functions and methods directly attached to it will be from the fastai library. When we're ready to train our model, we can use `fit_one_cycle()` or `fit()`

Let's go deeper in `fit_one_cycle`. It's a function that's written in `train.py` and connected to the `Learner` class through the line:

```
Learner.fit_one_cycle = fit_one_cycle
```

Anytime we create an instance of a `Learner` class, this comes attached with it. 

`Learner.fit_one_cycle` can take many arguments. I'm just going to focus on maximum learning rate `max_lr`, which is a variable that is expected to be either be a float or a slice.

In [137]:
# Train .fit_one_cycle()
def fit_one_cycle(learn:Learner, cyc_len:int, 
                  max_lr:Union[Floats,slice]=default_lr, 
                  moms:Tuple[float,float]=(0.95,0.85),
                  div_factor:float=25., pct_start:float=0.3, 
                  wd:float=None,
                  callbacks:Optional[CallbackList]=None, **kwargs)-> None:
    "Fit a model following the 1cycle policy."
    max_lr = learn.lr_range(max_lr)
    callbacks = ifnone(callbacks, [])
    callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
                                        pct_start=pct_start, **kwargs))
    learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
    
# Learner().lr_range() that's used in fit_one_cycle
def lr_range(self, lr:Union[float,slice])-> np.ndarray:
        "Build differential learning rates."
        if not isinstance(lr,slice): return lr
        if lr.start: res = even_mults(lr.start, lr.stop, len(self.layer_groups))
        else: res = [lr.stop/3]*(len(self.layer_groups)-1) + [lr.stop]
        return np.array(res)

By default, the maximum learning rate `max_lr` is assigned the value of `default_lr`, or more precisely. 

In [111]:
print(default_lr)

slice(None, 0.003, None)


## What happens to `max_lr` in `fit_one_cycle` depends on the three different scenarios:


### If `max_lr` is a `float`

If `max_lr` is a float, `learn.lr_range(max_lr)` in the `fit_one_cycle` class method will return float as a float. It won't change it at all.

That's because in the line `max_lr = learn.lr_range(max_lr)`, `lr_range` first checks to see if the input `max_lr` is or is not a `slice`. If its not a `slice` object, `lr_range` will immediately return `max_lr` without manipulating the value or datatype.

Once it returns the learning rate, the learning rate is passed onto `callbacks` and to `learn.fit`


---


### If `max_lr` is a `slice` with a non-zero starting index (e.g. slice(1, 2), slice(0.1, 1, 1))

If `max_lr` is a slice and contains a starting position that is neither zero nor `None`, e.g. `slice(1, 2)` or `slice(0.01, 1)`, then `lr_range` will pass the starting number, the stop number, and the length of the layer groups into another function called `even_mults`:


```
if lr.start: res = even_mults(lr.start, lr.stop, len(self.layer_groups))
```


`even_mults` is a function contained in the fastai core library that will return a numpy array containing a list learning rates between the start number and stop number, with the total number of intervals in between dependent on `n`. When the function is called from `lr_range`, `n` is number of `layer_groups` in the model. It's not clear to me why that is.

In [161]:
# Fastai function contained within Core.py
def even_mults(start:float, stop:float, n:int)-> np.ndarray:
    "Build evenly stepped schedule from `start` to `stop` in `n` steps."
    mult = stop / start
    step = mult ** (1 / (n - 1))
    return np.array([start * (step ** i) for i in range(n)])

Let's test this out with a sample slice that has non-zero start index.

In [162]:
sample_slice = slice(0.01, 0.03)
even_mults(sample_slice.start, sample_slice.stop, len(learn.layer_groups))

array([0.01    , 0.017321, 0.03    ])

We received three different learning rates: One with our minimum, one somewhere in between (though not precisely), and one that's the maximum. The number of learning rates was ties to the number of layer groups.

We'll find that if our network had more than three layer groups, we'd get more than three learning rates. Let's pretend our network has ten layer groups.

In [163]:
sample_slice = slice(0.01, 0.03)
even_mults(sample_slice.start, sample_slice.stop, 10)

array([0.01    , 0.011298, 0.012765, 0.014422, 0.016295, 0.018411, 0.020801, 0.023501, 0.026553, 0.03    ])

As expected, `even_mults` returned 10 learning rates, 8 of which were between 0.01 and 0.03. My assumption is that this will users to apply different learning rates to different layer groups.

This numpy array will be passed back to `fit_one_cycle` to be used in the `OneCycleScheduler` callback, as well as passed to `learn.fit` function.

**Sidenote:** Layer groups (e.g. `self.layer_groups`) are an instance attribute of a `Learner` object that contain the Pytorch `Sequential` blocks of the model, which are groupings of layers that operate one after another sequentially. It's not important to know for this exercise, but we can say that the number of groups vary according to the way model is structured in Pytorch.

Here's an example Sequential layer in Pytorch:

```
layer_group = nn.Sequential(
          nn.Conv2d(1,20,5),
          nn.ReLU(),
          nn.Conv2d(20,64,5),
          nn.ReLU()
        )
```

---

### If `max_lr` is a `slice` without a starting index (e.g. slice(None, 0.5, None) or slice(0, 0.5)) 

If `max_lr` is a `slice` with a starting index of 0 or `None`, the `lr_range` method will fail the first two conditionals and contiue to the `else` section as follows: 

```
else:
    res = [lr.stop/3]*(len(self.layer_groups)-1) + [lr.stop]
```

Here, we see that we're not including `max_lr.start` in this formula, and instead calculating the list of learning rates purely with `lr.stop` aka the second slice number, and the number of layer groups. In other words, when we don't have a non-zero starting value in the slice we provided `fit_one_cycle`, the `lr_range` function creates a list of values without it.

`[lr.stop / 3]` is dividing the max learning rate we provided by 3 and then duplicating that value by the length of networks layer groups minus 1. At the end, it appends the value of the `lr.stop`, the maximum learning rate we provided, to the list. 

It's not clear why we're dividing by 3 since the number of layer groups changes depending on th model. Perhaps it's just about trying to find some arbitrary number that's less than the maximum learning rate we chose.

The following is a step by step calculation of the above formula.

In [164]:
sample_stop = default_lr

# Step 1
learning_rates = [sample_stop.stop / 3]
print("Divide max learning rate by 3:", learning_rates)

# Step 2
learning_rates =  learning_rates * (len(learn.layer_groups) - 1)
print("Duplicate the number by the number of layer groups", learning_rates)

# Step 3
learning_rates =  learning_rates + [sample_stop.stop]
print("Append the maximum learning rate to the end of the list:", learning_rates)

Divide max learning rate by 3: [0.001]
Duplicate the number by the number of layer groups [0.001, 0.001]
Append the maximum learning rate to the end of the list: [0.001, 0.001, 0.003]


# Still In Progress

---

# Once `max_lr` is calculated, how is it used?

Once `max_lr` is calculated in `fit_one_cycle`, it's used for the following:

1) The learning rate gets passed into the `OneCycleScheduler` callback.

2) The learning rate also gets passed into `Learner.fit`, which passes it to another method `create_opt` which creates the optimizer for the `Learner` object.

## Learner.fit

In `Learner.fit`, `lr`, the learning rate input, can either be a float or a slice. Again, like `fit_one_cycle` it has a default learning rate of `default_lr`.

We also see that just like `fit_one_cycle`, it pass `lr` to `lr_range`, where if its a float or numpy list, it doesn't change. And if its a `slice`, it does the same procedures as mentioned previously where it converts the slice to a list of numbers.

**`self.create_opt(lr, wd)`** is key here because this is where we establish the optimization attribute associated with the model, `self.opt`

In [None]:
# Fastai Learner.fit()

def fit(self, epochs:int, lr:Union[Floats,slice]=default_lr,
        wd:Floats=None, callbacks:Collection[Callback]=None)->None:
    "Fit the model on this learner with `lr` learning rate, `wd` weight decay for `epochs` with `callbacks`."
    lr = self.lr_range(lr)
    if wd is None: wd = self.wd
    self.create_opt(lr, wd)
    callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
        callbacks=self.callbacks+callbacks)

In `create_opt`, the list of learning rates is passed to `OptimWrapper` to create an instance property `self.opt` for our `learn` object, which will be passed into the general basic_train.py `fit` function.

In [None]:
# Fastai Learner.create_opt()

def create_opt(self, lr:Floats, wd:Floats=0.)->None:
    "Create optimizer with `lr` learning rate and `wd` weight decay."
    self.opt = OptimWrapper.create(self.opt_func, lr, self.layer_groups, 
                                   wd=wd, true_wd=self.true_wd, bn_wd=self.bn_wd)

`OptimWrapper` is a "Basic wrapper around an optimizer to simplify hyper parameter changes.



# 

In [173]:
sample = lr_range(None, [0, 2, 4])
print(sample)

[0, 2, 4]


In [179]:
learn.opt.lr

0.003

In [170]:
learn.callbacks.

<function list.count>

In [166]:
learn.fit(1, lr=[0.001, 0.002, 0.003, 0.004])

AssertionError: List len mismatch (4 vs 3)

In [None]:
# Fastai Learner.fit()
def fit(self, epochs:int, lr:Union[Floats,slice]=default_lr,
        wd:Floats=None, callbacks:Collection[Callback]=None)->None:
    "Fit the model on this learner with `lr` learning rate, `wd` weight decay for `epochs` with `callbacks`."
    lr = self.lr_range(lr)
    if wd is None: wd = self.wd
    self.create_opt(lr, wd)
    callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
        callbacks=self.callbacks+callbacks)

Here in `Learner.create_opt`, we can notice a number of things.

We create a new instance variable called `self.opt` which is a product of what comes from `OptimWrapper.create`, which is clearly an `OptimWrapper` object.

Another thing to notice is that we pass in `self.opt_func` and `self.layer_groups`. `self.opt_func` is the optimization function we want to use for our model, and by default, it's set to AdamW in the fastai code. 

`self.layer_groups` is a list of the models layers.

In [25]:
# Fastai Learner.create_opt()
def create_opt(self, lr:Floats, wd:Floats=0.)->None:
    "Create optimizer with `lr` learning rate and `wd` weight decay."
    self.opt = OptimWrapper.create(self.opt_func, lr, self.layer_groups, 
                                   wd=wd, true_wd=self.true_wd, bn_wd=self.bn_wd)

In [34]:
# Which optimization function are we using?
learn.opt_func.func

torch.optim.adam.Adam

In [52]:
# What's inside a layer_group?
print(learn.layer_groups[2])

# How many layer groups are there?
print(len(learn.layer_groups))

Sequential(
  (0): AdaptiveAvgPool2d(output_size=1)
  (1): AdaptiveMaxPool2d(output_size=1)
  (2): Lambda()
  (3): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (4): Dropout(p=0.25)
  (5): Linear(in_features=1024, out_features=512, bias=True)
  (6): ReLU(inplace)
  (7): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (8): Dropout(p=0.5)
  (9): Linear(in_features=512, out_features=2, bias=True)
)
3


In [54]:
learn.callbacks

[]

In `OptimWrapper.create()`, we see that 

In [None]:
# Fastai OptimWrapper.create()
def create(cls, opt_func:Union[type,Callable], lr:Union[float,Tuple,List],
           layer_groups:ModuleList, **kwargs:Any)->optim.Optimizer:
    "Create an optim.Optimizer from `opt_func` with `lr`. Set lr on `layer_groups`."
    split_groups = split_bn_bias(layer_groups)
    opt = opt_func([{'params': trainable_params(l), 'lr':0} for l in split_groups])
    opt = cls(opt, **kwargs)
    opt.lr = listify(lr, layer_groups)
    return opt

In [None]:
OptimWrapper.