# Mixed Precision Training

#### Last Time
We've [just finished](http://nbviewer.jupyter.org/github/jamesdellinger/fastai_deep_learning_course_part2_v3/blob/master/10b_mixup_label_smoothing_my_reimplementation.ipynb?flush_cache=true) implementing [Mixup](https://arxiv.org/abs/1710.09412) and [Label Smoothing](https://arxiv.org/abs/1512.00567). Our Mixup implementation included a few optimizations to the original paper's approach that resulted in a sped up training time:
* Not loading in two batches at once, but shuffling one batch's images and mixing up half of them with the other half.
* Using a different ratio, $t$, to mixup each image pair in a batch.

Encouragingly, I found that training a model using mixup resulted in better performance than [when I trained](https://nbviewer.jupyter.org/github/jamesdellinger/fastai_deep_learning_course_part2_v3/blob/master/Diving_into_DALI.ipynb?flush_cache=true#Bonus:-horizontal-flip-transform-experiment) the same the same model using a DALI pipeline with four more traditional image augmentations. Unfortunately, adding label smoothing to my mixup model's loss function did not improve performance.

#### FP16 Training
Today we're gonna spend some time understand what, exactly, mixed-precision training is, and why using it when training models can be helpful. This topic is personally relevent to me, because I will shortly be building my own deep learning work station from scratch, and NVIDIA's latest GPUs, such as the [RTX 2080 Ti](https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2080-ti/) support mixed-precision training!

### Setup
As always, we're going to train our models on the [Imagenette](https://github.com/fastai/imagenette) dataset.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:
#export
from exports.nb_10b import *

In [3]:
path = datasets.untar_data(imagenette_url)
tfms = [MakeRGB(), ResizeFixed(128), to_byte_tensor, to_float_tensor]
bs=64

data = (ImageList.from_files(path, tfms=tfms)
        .to_split(partial(grandparent_splitter, valid_name='val'))
        .to_label(parent_labeler, y_processor=CategoryProcessor())
        .to_databunch(bs, channels_in=3, channels_out=10, num_workers=4))

## What *is* mixed-precision training?
Up until now, all calculations in neural networks have typically been done using 32-bit single precision floats. One problem that practitioners universally face is running out of GPU memory while training. This happens when networks have many layers, and or, a large batch size is being used. Since 16-bit floats take up about half the memory that 32-bit floats require, some practitioners have lately begun to try to sidestep GPU memory bottlenecks by training networks that use 16-bit floats in their calculations.

Moreover, as mentioned above, NVIDIA's latest GPUs have been optimized to do more operations at the same time when 16-bit floats are used, which has the nice side-effect of decreasing training time. Unfortunately, just how much of a speed-up you get depends on which GPU you happen to be using. (Volta GPUs are said to offer the largest speed increase.)

### We can't *only* use half-precision training
There's one slight problem: if we only use half-precision training, we'll find that our networks don't converge to as good an accuracy as would have been the case had we used full-precision 32-bit float training. To understand why this could happen, let's disect what a 16-bit float actually is:

![half float](images/half.png)

FP16 offers a smaller range of possible values, from 2e-14 to 2e15, than FP32, which give a range of 2e-126 to 2e127. Another limitation of FP16 is that it supports a larger minimum *offset* than FP32. Between numbers 1 and 2, FP16 only represents the numbers 1, 1+2e-10, 1+2*2e-10, 1+3*2e-10, ... and so on and so forth. The implication of this larger offset is that in the world of FP16, 1 + 0.0001 is only seen as 1. This imprecision could lead to the following three problems during model training:

#### 3 Problems with Half-precision training
1. Weight gradients, once multiplied by the learning rate, are often smaller than the FP16 offset, which means that there's a significant chance that the model's layer weights won't actually get updated and change after any given training iteration.
2. Similarly, weight gradients themselves can be so small that FP16 thinks they are zero. This will also cause layer weights, particularly at the earliest layers, to not change during training.
3. On the other end of the spectrum, there's a greater chance of gradients exploding when FP16 is used. Since the largest number that FP16 can represent is several orders of magnitude smaller than that which can be represented by FP32, there's a much greater chance that gradients will hit infinity and be seen as NaN. 

### The right approach: use a *mix* of FP16 and FP32
Thankfully, there's a workaround that will let us get the memory saving benefits of FP16 training, while making it possible to avoid the three bugaboos described just above: we perform some operations in FP16, and the rest we will do in FP32!

#### Avoiding Problem 1
We can do the forward pass in FP16, because this will give us some extra speed and having relatively less precision during the forward pass doesn't bring any other bad consequences. In order to avoid the first problem detailed above (weights *don't* actually get updated during the "weight update"), we perform the weight update operation, `w = w - lr * grad`, *using FP32*.

#### Avoiding Probem 2
In order to avoid problem two, where FP16 takes gradients that actually are larger than 0 in magnitude but sees them as 0 nonetheless (known as *gradient underflow*), the trick is to perform something called *gradient scaling*. All this entails is multiplying by a scale factor, such as 512, the loss value we get from the loss function. Doing this allows us to also perform the backward pass in FP16, without having to worry that gradients (especially those backpropagating through the network's the earliest layers) will become 0. Of course, when we're back in FP32 and about to perform the weight update calculation, we need to divide the gradient by the scaling factor before we perform the weight update.

#### Avoiding Problem 3
To lessen the chance of having exploding gradients due to FP16 prematurely seeing large gradients as infinite, or NaN, there are two things we can do:
* Convert the output of model's final layer from FP16 to FP32 and compute the loss using FP32. 
* Leave all batchnorm layer weights in FP32.

Making our loss value more precise means that we can scale by our scale factor, such as 512, and not have to worry that the product will be seen as NaN by FP16. 

Batchnorm layers don't have many weights to begin with, so we can afford to store their weights in FP32 without worrying about using up significantly more GPU memory than we would if we stored those weights in FP16.

#### The Canonical Mixed-Precision Training Loop

![Mixed precision training](images/Mixed_precision.jpeg)

Here are the steps we have in a mixed-precision training loop:
1. With FP16: Run the forward pass.
2. With FP32: Convert the final layer outputs from FP16 to FP32, and compute the loss.
3. With FP32: Multiply the loss by the gradient scale factor, such as 512.
4. With FP16: Backpropagate the gradients.
5. With FP32: Copy the FP16 gradients to FP32.
6. With FP32: Divide gradients by the scale factor applied in step 3., and calculate and apply layer weight updates.
7. With FP16: Copy the newly updated gradients from FP32 back to FP16, in preparation to run the forward pass of the next loop.

## Implementing a Mixed-Precision Training Loop using Callbacks

### Helper Functions
Before we implement the our `MixedPrecision` callback class, we need to write some helper functions that will help us to do things like convert a model to FP16, or create a copy of model parameters.

#### Helper Prep: NVIDIA's Apex Library
[Apex](https://github.com/NVIDIA/apex) is a PyTorch extension created by NVIDIA that enables mixed-precision training. We'll use some of its useful methods inside our helper functions.

In [4]:
#export
import apex.fp16_utils as fp16

#### Helper 1: Converting the model to FP16
Our helper function converts all layers of our model from FP32 to FP16 - except for the batchnorm layers, which are always left in FP32. The most straightforward way to accomplish this is:
1. Convert all layers to FP16 in one fell swoop.
2. Loop back over all layers and convert any batchnorm layers back to FP32.

In [5]:
bn_layers = (nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)

In [6]:
def bn_to_float(model):
    if isinstance(model, bn_layers): model.float()
    for child in model.children(): bn_to_float(child)
    return model

In [7]:
def model_to_half(model):
    model = model.half()
    return bn_to_float(model)

A quick test to ensure everything works as we hope:

First use our helper function to convert the model minus batchnorm layers to FP16:

In [8]:
test_model = nn.Sequential(nn.Linear(10,30), nn.BatchNorm1d(30), nn.Linear(30,2)).cuda()
test_model = model_to_half(test_model)

Then create a function to verify that the weights of each of our test model's layers are as we'd expect:

In [11]:
def check_weights(test_model):
    for i, t in enumerate([torch.float16, torch.float32, torch.float16]):
        assert test_model[i].weight.dtype == t
        assert test_model[i].bias.dtype   == t

In [12]:
check_weights(test_model)

The Apex library also has a function called `convert_network` that converts all model layers, save batchnorm, to FP16. Conveniently, we can use to to convert a model either *to* FP16, or *from* FP16 back to FP32.

In [13]:
test_model = nn.Sequential(nn.Linear(10,30), nn.BatchNorm1d(30), nn.Linear(30,2)).cuda()
test_model = fp16.convert_network(test_model, torch.float16)
check_weights(test_model)

#### Helper 2: Creating a FP32 master copy of parameter weights
While our layer weights are in FP16, we will need to perform each weight step in the optimizer using FP32. We'll thus need a way to convert these FP16 parameters to FP32 right before we perform each optimizer step.

We'll also include the option to concetenate all the model paramters into one single flat tensor because this can make things run slightly faster.

In [14]:
from torch.nn.utils import parameters_to_vector

def get_master(model, flat_master=False):
    model_params = [param for param in model.parameters() if param.requires_grad]
    if flat_master:
        master_params = parameters_to_vector([param.data.float() for param in model_params])
        master_params = torch.nn.Parameter(master_params, requires_grad=True)
        if master_params.grad is None: master_params.grad = master_params.new(*master_params.size())
        return model_params, [master_params]
    else:
        master_params = [param.clone().float().detach() for param in model_params]
        for param in master_params: param.requires_grad_(True)
        return model_params, master_params

In [15]:
test_model_params, test_master_params = get_master(test_model)

A function to ensure that the values of a master list of FP32 params are sufficiently close to their FP16 counterparts, and that the same params in each list are trainable (requires_grad = True).

In [16]:
def same_lists(param_list1, param_list2):
    assert len(param_list1) == len(param_list2)
    for (param1, param2) in zip(param_list1, param_list2):
        assert param1.requires_grad == param2.requires_grad
        assert torch.allclose(param1.data.float(), param2.data.float())

In [17]:
same_lists(test_model_params, test_master_params)

Apex also has a function that we can use to get a master list of FP32 params. It's called `prep_param_lists`.

In [18]:
test_model_params_apex, test_master_params_apex = fp16.prep_param_lists(test_model)

In [19]:
same_lists(test_model_params_apex, test_master_params_apex)

Let's also verify that our home-grown `get_master` method and Apex's `prep_param_lists` method indeed produce the same results:

In [20]:
same_lists(test_model_params, test_model_params_apex)

In [21]:
same_lists(test_master_params, test_master_params_apex)

Note that we can't use a flat master list when we have batchnorm layers that will always stay in FP32:

In [22]:
test_model_params_flat, test_master_params_flat = get_master(test_model, flat_master=True)

In [23]:
same_lists(test_model_params_flat, test_master_params_flat)

AssertionError: 

In [24]:
test_model_params_apex_flat, test_master_params_apex_flat = fp16.prep_param_lists(test_model, flat_master=True)

Error in prep_param_lists:  model may contain a mixture of parameters of different types.  Use flat_master=False, or use F16_Optimizer.


RuntimeError: Expected object of scalar type Half but got scalar type Float for sequence element 2 in sequence argument at position #1 'tensors'

We're gonna need to tweak our implementation because in practice we definitely won't all of our model's parameters to be in the same parameter group, which is what our current implementation of `get_master` requires. We want to support multiple parameter groups in the evnt that we want to:
* Perform transfer learning, which requires us to freeze earlier layers of our network.
* Apply discriminative learning rates/schedules to different layer groups.
* Not apply weight decay to batchnorm layer or bias terms.

`get_master` should thus be able to split the parameters of an optimizer (*not* those of a model) according to their respective parameter groups.

In [25]:
#export
def get_master(optimizer, flat_master=False):
    model_params = [[param for param in param_group if param.requires_grad] for param_group in optimizer.param_groups]
    if flat_master:
        master_params = []
        for param_group in model_params:
            master_param_group = parameters_to_vector([param.data.float() for param in param_group])
            master_param_group = torch.nn.Parameter(master_param_group, requires_grad=True)
            if master_param_group.grad is None: 
                master_param_group = master_param_group.new(*master_param_group.size())
            master_params.append(master_param_group)
    else:
        master_params = [[param.clone().float().detach() for param in param_group] for param_group in model_params]
        for param_group in master_params:
            for param in param_group: param.requires_grad_(True)
    return model_params, master_params

#### Helper 3: Creating a FP32 master copy of gradients
In order to perform the optimizer step, we need *both* the model paramater weights and gradients to be in FP32. The `get_master` function we implemented just above takes care of copying parameter weights to FP32. Our 3rd helper function will do the same, but for the gradients that would have been computed during backpropagation in FP16.

In [26]:
def to_master_grads(model_params, master_params, flat_master:bool=False) -> None:
    if flat_master:
        if master_params[0].grad is None:
            master_params[0].data.new(*master_params[0].data.size())
        master_params[0].grad.data.copy_(parameters_to_vector([param.grad.data.float() for param in model_params]))
    else:
        for model_param, master_param in zip(model_params, master_params):
            if model_param.grad is not None:
                if master_param.grad is None: 
                    master_param.grad = master_param.data.new(*master_param.data.size())
                master_param.grad.data.copy_(model_param.grad.data)
            else: master_param.grad = None

Let's try it out. First we'll simulate both a forward and backward pass that are performed in FP16, for a simple model with *no* parameter groups:

In [27]:
x = torch.randn(20, 10).half().cuda()
y = test_model(x)
loss = F.cross_entropy(y, torch.randint(0, 2, (20,)).cuda())
loss.backward()

In [28]:
y

tensor([[ 0.1296, -0.5391],
        [ 0.2874, -0.3467],
        [-0.0869, -0.0266],
        [ 0.2837, -0.1782],
        [ 0.0437,  0.0309],
        [-0.5664,  0.0712],
        [-0.1545, -0.1049],
        [ 0.2072,  0.0307],
        [ 0.0306, -0.0629],
        [ 0.1960, -0.3071],
        [ 0.3550, -0.1348],
        [-0.6455,  0.0584],
        [-0.3154, -0.2942],
        [ 0.5537, -0.2245],
        [ 0.2050, -0.3508],
        [-0.4912,  0.1227],
        [ 0.1016, -0.2094],
        [-0.5332, -0.0906],
        [-0.3108, -0.0751],
        [ 0.4629,  0.2021]], device='cuda:0', dtype=torch.float16,
       grad_fn=<AddmmBackward>)

In [29]:
loss

tensor(0.7646, device='cuda:0', dtype=torch.float16, grad_fn=<NllLossBackward>)

In [30]:
test_model_params = [param for param in test_model.parameters() if param.requires_grad]
test_master_params = [param.clone().float().detach() for param in test_model_params]
test_master_params = [torch.nn.Parameter(param, requires_grad=True) for param in test_master_params]
                                                    

In [31]:
test_model_params

[Parameter containing:
 tensor([[ 6.0028e-02,  1.1719e-01,  1.6040e-01,  8.0688e-02,  2.0630e-01,
           6.4087e-02, -1.9250e-01, -3.1055e-01, -2.7637e-01,  1.3049e-01],
         [ 4.4952e-02,  1.7444e-01,  2.8638e-01,  3.7323e-02,  2.6978e-01,
          -1.5881e-01,  3.0786e-01, -6.8787e-02, -1.6296e-01,  2.7832e-01],
         [-2.8198e-01,  2.8000e-02, -1.6211e-01,  2.4426e-01,  9.9915e-02,
           2.2021e-01, -2.6001e-01,  2.4036e-01, -6.3171e-02, -3.0981e-01],
         [-2.9816e-02,  1.9336e-01,  2.2766e-01,  2.9297e-01, -8.6426e-02,
           1.0162e-01,  5.1666e-02,  2.5610e-01, -5.9326e-02,  1.0303e-01],
         [ 2.4475e-01, -6.2439e-02,  1.3171e-01,  1.0767e-01,  3.3417e-02,
          -2.8345e-01,  1.7432e-01,  3.0127e-01,  1.6040e-01, -8.7219e-02],
         [-1.1328e-01,  2.5171e-01,  1.6284e-01, -7.9407e-02, -2.9663e-01,
          -7.6050e-02,  2.2705e-01, -2.9144e-03, -3.1396e-01, -2.0520e-01],
         [ 2.8882e-01, -8.4167e-02,  1.1261e-01, -2.9785e-01,  9.6680e-

In [32]:
test_master_params

[Parameter containing:
 tensor([[ 6.0028e-02,  1.1719e-01,  1.6040e-01,  8.0688e-02,  2.0630e-01,
           6.4087e-02, -1.9250e-01, -3.1055e-01, -2.7637e-01,  1.3049e-01],
         [ 4.4952e-02,  1.7444e-01,  2.8638e-01,  3.7323e-02,  2.6978e-01,
          -1.5881e-01,  3.0786e-01, -6.8787e-02, -1.6296e-01,  2.7832e-01],
         [-2.8198e-01,  2.8000e-02, -1.6211e-01,  2.4426e-01,  9.9915e-02,
           2.2021e-01, -2.6001e-01,  2.4036e-01, -6.3171e-02, -3.0981e-01],
         [-2.9816e-02,  1.9336e-01,  2.2766e-01,  2.9297e-01, -8.6426e-02,
           1.0162e-01,  5.1666e-02,  2.5610e-01, -5.9326e-02,  1.0303e-01],
         [ 2.4475e-01, -6.2439e-02,  1.3171e-01,  1.0767e-01,  3.3417e-02,
          -2.8345e-01,  1.7432e-01,  3.0127e-01,  1.6040e-01, -8.7219e-02],
         [-1.1328e-01,  2.5171e-01,  1.6284e-01, -7.9407e-02, -2.9663e-01,
          -7.6050e-02,  2.2705e-01, -2.9144e-03, -3.1396e-01, -2.0520e-01],
         [ 2.8882e-01, -8.4167e-02,  1.1261e-01, -2.9785e-01,  9.6680e-

In [33]:
to_master_grads(test_model_params, test_master_params)

In [34]:
[param.grad for param in test_model_params]

[tensor([[ 1.5087e-03, -3.4847e-03,  1.6525e-02,  1.0452e-02, -6.8207e-03,
          -6.7558e-03,  3.1815e-03,  6.4468e-04, -3.4027e-03, -1.1208e-02],
         [ 2.7418e-05, -5.0604e-05,  2.1803e-04,  1.3256e-04, -1.0949e-04,
          -9.2328e-05,  1.1981e-05,  6.6161e-06, -4.5061e-05, -1.9956e-04],
         [-1.4248e-03,  4.6425e-03, -2.3865e-02, -1.4778e-02,  8.6975e-03,
           9.9106e-03, -3.2921e-03,  8.3268e-05,  5.7373e-03,  1.4061e-02],
         [ 5.3406e-03,  1.8433e-02, -4.1351e-02, -1.0513e-02,  1.6479e-02,
           3.1219e-02,  3.1376e-04,  1.4580e-02,  7.6294e-03,  3.9276e-02],
         [-1.9119e-02, -2.2087e-03,  5.4565e-02,  1.3596e-02, -1.4503e-02,
          -3.0930e-02, -1.7456e-02, -2.6611e-02, -2.4216e-02, -2.9327e-02],
         [ 7.5455e-03, -4.3335e-02,  5.6641e-02,  4.9896e-02, -5.0697e-03,
          -4.0771e-02, -1.0414e-02, -1.3771e-02, -4.2496e-03, -1.4076e-02],
         [-9.1028e-04,  2.0599e-03, -9.3765e-03, -1.5345e-03,  2.1305e-03,
           1.1091e-

In [35]:
[param.grad for param in test_master_params]

[tensor([[ 1.5087e-03, -3.4847e-03,  1.6525e-02,  1.0452e-02, -6.8207e-03,
          -6.7558e-03,  3.1815e-03,  6.4468e-04, -3.4027e-03, -1.1208e-02],
         [ 2.7418e-05, -5.0604e-05,  2.1803e-04,  1.3256e-04, -1.0949e-04,
          -9.2328e-05,  1.1981e-05,  6.6161e-06, -4.5061e-05, -1.9956e-04],
         [-1.4248e-03,  4.6425e-03, -2.3865e-02, -1.4778e-02,  8.6975e-03,
           9.9106e-03, -3.2921e-03,  8.3268e-05,  5.7373e-03,  1.4061e-02],
         [ 5.3406e-03,  1.8433e-02, -4.1351e-02, -1.0513e-02,  1.6479e-02,
           3.1219e-02,  3.1376e-04,  1.4580e-02,  7.6294e-03,  3.9276e-02],
         [-1.9119e-02, -2.2087e-03,  5.4565e-02,  1.3596e-02, -1.4503e-02,
          -3.0930e-02, -1.7456e-02, -2.6611e-02, -2.4216e-02, -2.9327e-02],
         [ 7.5455e-03, -4.3335e-02,  5.6641e-02,  4.9896e-02, -5.0697e-03,
          -4.0771e-02, -1.0414e-02, -1.3771e-02, -4.2496e-03, -1.4076e-02],
         [-9.1028e-04,  2.0599e-03, -9.3765e-03, -1.5345e-03,  2.1305e-03,
           1.1091e-

The above comparison of model and master parameter gradients passes the eye test, but to be sure, let's double check that our FP32 gradients are all sufficiently close in value to their FP16 counterparts:

In [36]:
def check_grads(param_list1, param_list2):
    for param1, param2 in zip(param_list1, param_list2):
        if param1.grad is None: assert param2.grad is None
        else: assert torch.allclose(param1.grad.data.float(), param2.grad.data.float())

In [37]:
check_grads(test_model_params, test_master_params)

The function in Apex to copy FP16 gradients to a FP32 master copy is `model_grads_to_master_grads`. Also, we'll of course, we need to support multiple parameter groups, so our final implementation will look like this:

In [38]:
#export
def to_master_grads(model_param_groups, master_param_groups, flat_master:bool=False)->None:
    for (model_param_group, master_param_group) in zip(model_param_groups, master_param_groups):
        fp16.model_grads_to_master_grads(model_param_group, master_param_group, flat_master=flat_master)

#### Helper 4: Copying the FP32 master copy params back to FP16 model params
Once the FP32 weight update calculation is complete, our "master" FP32 copy contains the updated model weights in FP32. In order to complete the next forward pass in FP16, we need to first copy these weights back to FP16 in the "model" copy.

In [39]:
from torch._utils import _unflatten_dense_tensors

def to_model_params(model_params, master_params, flat_master:bool=False)->None:
    if flat_master:
        for model_param, master_param in zip(model_params, _unflatten_dense_tensors(master_params[0].data, model_params)):
            model_param.data.copy_(master)
    else:
        for model_param, master_param in zip(model_params, master_params): model_param.data.copy_(master_param.data)

Let's simulate the weight update operation for all parameters in the FP32 "master" copy:

In [40]:
test_master_params = [param + 0.1 * param.grad for param in test_master_params]

In [41]:
test_master_params

[tensor([[ 0.0602,  0.1168,  0.1621,  0.0817,  0.2056,  0.0634, -0.1922, -0.3105,
          -0.2767,  0.1294],
         [ 0.0450,  0.1744,  0.2864,  0.0373,  0.2698, -0.1588,  0.3079, -0.0688,
          -0.1630,  0.2783],
         [-0.2821,  0.0285, -0.1645,  0.2428,  0.1008,  0.2212, -0.2603,  0.2404,
          -0.0626, -0.3084],
         [-0.0293,  0.1952,  0.2235,  0.2919, -0.0848,  0.1047,  0.0517,  0.2576,
          -0.0586,  0.1070],
         [ 0.2428, -0.0627,  0.1372,  0.1090,  0.0320, -0.2865,  0.1726,  0.2986,
           0.1580, -0.0902],
         [-0.1125,  0.2474,  0.1685, -0.0744, -0.2971, -0.0801,  0.2260, -0.0043,
          -0.3144, -0.2066],
         [ 0.2887, -0.0840,  0.1117, -0.2980,  0.0969,  0.2670, -0.1924, -0.2747,
           0.1255,  0.2404],
         [ 0.0035,  0.0849,  0.1262,  0.2562,  0.2100, -0.3053,  0.0988, -0.3016,
          -0.0748,  0.1003],
         [-0.0463,  0.1359,  0.1727, -0.0281,  0.1628, -0.0461, -0.0049, -0.1990,
          -0.1351,  0.0875],
 

And test out our implementation of `to_model_params`:

In [42]:
to_model_params(test_model_params, test_master_params)

In [43]:
test_model_params

[Parameter containing:
 tensor([[ 0.0602,  0.1168,  0.1621,  0.0817,  0.2056,  0.0634, -0.1921, -0.3105,
          -0.2766,  0.1294],
         [ 0.0450,  0.1744,  0.2864,  0.0373,  0.2698, -0.1588,  0.3079, -0.0688,
          -0.1630,  0.2783],
         [-0.2822,  0.0285, -0.1646,  0.2428,  0.1008,  0.2212, -0.2603,  0.2404,
          -0.0626, -0.3083],
         [-0.0293,  0.1952,  0.2235,  0.2920, -0.0848,  0.1047,  0.0517,  0.2576,
          -0.0586,  0.1069],
         [ 0.2428, -0.0627,  0.1372,  0.1090,  0.0320, -0.2866,  0.1726,  0.2986,
           0.1580, -0.0901],
         [-0.1125,  0.2473,  0.1685, -0.0744, -0.2971, -0.0801,  0.2260, -0.0043,
          -0.3145, -0.2067],
         [ 0.2888, -0.0840,  0.1117, -0.2981,  0.0969,  0.2668, -0.1924, -0.2747,
           0.1255,  0.2404],
         [ 0.0035,  0.0849,  0.1262,  0.2561,  0.2100, -0.3052,  0.0988, -0.3015,
          -0.0748,  0.1003],
         [-0.0463,  0.1359,  0.1727, -0.0282,  0.1628, -0.0461, -0.0049, -0.1990,
       

In [44]:
for (param1, param2) in zip(test_model_params, test_master_params):
    assert param1.requires_grad == param2.requires_grad
    assert torch.allclose(param1.data.float(), param2.data.float(), atol=1e-03)

Recall that when copying FP16 params to FP32, the two lists matched with an absolute tolerance of 1r-8 (the PyTorch default for `torch.allclose`. However, when copying from FP32 back to FP16, the best we can do is get the two lists to match with an absolute tolerance of 1e-3.

Also, we still this helper function to support parameter groups as well. We'll use Apex's equivaluent of `to_model_params`, `master_params_to_model_params`, to accomplish this:

In [45]:
#export
def to_model_params(model_param_groups, master_param_groups, flat_master:bool=False)->None:
    for (model_param_group, master_param_group) in zip(model_param_groups, master_param_groups):
        fp16.master_params_to_model_params(model_param_group, master_param_group, flat_master=flat_master)

### The main mixed-precision callback
We've finished our helper functions that transfer a model's layer weights to FP16, transfer weights back to FP32, transfer gradients to FP32, and transfer updated weights back to FP16. Now it's time to put all the pieces together and create a Callback class that will execute a mixed-precision training loop!

That this class is so straightforward to implement is a testament to the flexibility of our training loop's (the `Learner` class) callback architecture!

In [47]:
class MixedPrecision(Callback):
    _order = 99
    def __init__(self, loss_scale=512, flat_master=False):
        assert torch.backends.cudnn.enabled, 'Mixed-precision training requires that the cuDNN be installed.'
        self.loss_scale, self.flat_master = loss_scale, flat_master
        
    def begin_fit(self):
        # Helper 1: Convert model (except for any batchnorm layers) to FP16:
        self.run.model = fp16.convert_network(self.model, dtype=torch.float16)
        
        # Helper 2: Creating a FP32 master copy of parameter weights
        self.model_param_groups, self.master_param_groups = get_master(self.opt, self.flat_master)
        # To place those FP32 master copy param groups inside the runner:
        self.run.opt.param_groups = self.master_param_groups
        
    def after_fit(self): self.model.float()
       
    # Convert inputs to FP16 before forward pass
    def begin_batch(self): self.run.xb = self.run.xb.half()
        
    # Convert preds to FP32 so that loss can be computed in FP32
    def after_pred(self): self.run.pred = self.run.pred.float()
        
    # Scale loss to avoid gradient underflow (FP16 seeing non-zero grads as zero)
    def after_loss(self): self.run.loss *= self.loss_scale
        
    def after_backward(self):
        # Helper 3: Copy FP16 gradients (resulting from backprop) to FP32 master.
        to_master_grads(self.model_param_groups, self.master_param_groups, self.flat_master)
        
        # Also unscale gradients so that weight update can be computed in FP32.
        for param_group in self.master_param_groups:
            for param in param_group:
                if param.grad is not None: param.grad.div_(self.loss_scale)
                    
    def after_step(self):
        # Be sure to zero out all gradients after the weight update step.
        self.model.zero_grad()
        
        # Helper 4: Copy updated weights from FP32 master back to FP16 model, 
        #           so that next iteration's forward pass can be done in FP16.
        to_model_params(self.model_param_groups, self.master_param_groups, self.flat_master)

Let's test this mixed-precision training callback out on Imagenette.

In [48]:
min_lr=3e-1
n_outs = [64, 64, 128, 256] 
phases = combine_scheds([0.3, 0.7], cos_1cycle_anneal(min_lr, 3*min_lr, min_lr))
scheduler = ParamScheduler('lr', phases)

First, we'll train without using mixed-precision:

In [49]:
callback_funcs = [partial(AvgStatsCallback, accuracy),
                  CudaCallback,
                  ProgressBarCallback,
                  partial(BatchTransformXCallback, norm_imagenette),
                  MixUp]
learn = get_learner(n_outs, data, lr=min_lr, layer=conv_layer, callback_funcs=callback_funcs)

In [50]:
learn.fit(8, scheduler)

epoch,train_loss,train_accuracy,valid_loss,valid_accuracy,time
0,1.936914,0.374205,1.84446,0.418,00:45
1,1.701108,0.502714,1.652163,0.416,00:41
2,1.525369,0.605165,1.089554,0.664,00:41
3,1.411922,0.661548,1.069016,0.636,00:41
4,1.30726,0.715371,0.906656,0.732,00:41
5,1.225333,0.760819,0.827631,0.752,00:41
6,1.136919,0.810377,0.789534,0.764,00:41
7,1.072786,0.852179,0.808262,0.762,00:41


Now we'll train with mixed-precision:

In [51]:
callback_funcs = [partial(AvgStatsCallback, accuracy),
                  CudaCallback,
                  ProgressBarCallback,
                  partial(BatchTransformXCallback, norm_imagenette),
                  MixUp,
                  MixedPrecision]
learn = get_learner(n_outs, data, lr=min_lr, layer=conv_layer, callback_funcs=callback_funcs)

In [52]:
learn.fit(8, scheduler)

epoch,train_loss,train_accuracy,valid_loss,valid_accuracy,time
0,1.92765,0.377695,1.451271,0.514,00:41
1,1.691863,0.513107,1.486721,0.48,00:41
2,1.529405,0.599659,1.126288,0.63,00:41
3,1.406764,0.663254,0.975285,0.708,00:41
4,1.304548,0.714518,0.919934,0.706,00:41
5,1.207942,0.763378,0.823949,0.76,00:41
6,1.132919,0.810222,0.873254,0.718,00:41
7,1.071925,0.847448,0.814378,0.744,00:41


Let's verify that once training has completed, our model's weights are back in FP32 (`'torch.cuda.FloatTensor'`). Note that FP16 weights would have a [GPU tensor type](https://pytorch.org/docs/stable/tensors.html) of `'torch.cuda.HalfTensor'`.

In [53]:
test_eq(next(learn.model.parameters()).type(), 'torch.cuda.FloatTensor')

Notice that when I ran the mixed-precision training loop on my AWS p2.xlarge instance, there was no decrease in per-epoch training time. This is because the GPU  used in my p2.xlarge instance is too old and hasn't had any of the half-precision training optimizations built into it. If I had run my training loop using the newer NVIDIA Volta or 2080Ti GPUs, I would have seen a noticeable speed-up.

## Dynamic Loss Scaling
We introduced loss scaling to overcome problem 2, which is when gradients computed during backpropagation that *should* be non-zero are interpreted as FP16 floats of value 0.0. In the above implemention of our mixed-precision training, we chose a loss scaling value of 512. However, is this ideal? Would a different value work better? Should it be less or greater than 512?

The annoying thing here is that we now have yet another hyper-parameter to tune. Thankfully, there is a way we can avoid having to take the time to tune our loss scaling value! 

Conceptually, we want our scaling value to be as high as possible, so that there is as low as possible a chance that, when converted back to FP16, gradients that *shouldn't* be seen as zero are indeed not interpreted as zero. We want to go as high as we can, but stop just short of causing problem 3, which is where gradients overflow, or become NaN/infinity, when converted into FP16.

Our approach will be to create yet another helper function that is able to detect when gradients overflow. We'll start our training by choosing a really large loss scaling value, and if our gradient overflow checker determines that gradients are overflowing, we'll decrease the loss scaling value by one half. Throughout training, our gradient overflow checker will continue to monitor things, and if gradients ever begin to overflow, loss scaling will again be decreased by half. We therefore refer to this entire mechanism as "dynamic" loss scaling!

Note that we won't necessarily always want to halve our loss scaling. What about when training begins to converge and gradients naturally get smaller and smaller? In that case we'll want to make sure that our loss scaling is large enough to avoid gradient underflow. The Apex library's strategy for this is to double the loss scaling factor each time our model has gone a given number of iterations without the gradient overflow checking detecting any gradient overflow.

So, to implement all this, we first need to develop a way to test if gradients have actually overflowed. This is pretty simple. An overflowed gradient is seen as just NaN. Although torch has a `torch.isnan` function that we could use to determine if any gradients are NaN, there's a faster way on the GPU: if we sum all the gradients together, and if at least one of the gradients is NaN, then their sum will also be NaN. We convert gradients back to FP32 before computing this sum.

You might be wondering what, exactly, are the gradients that will be summed. The answer is that we will sum up the gradients for all parameters across each individual layer. In other words, at each iteration, we sum up gradients on a per-layer basis, when checking if any are NaN.

In [102]:
#export
def test_overflow(x):
    grad_sum = float(x.float().sum())
    return (grad_sum == float('inf') or grad_sum == float('-inf') or grad_sum != grad_sum)

Let's see how this logic performs:

In [103]:
x = torch.randn(512, 1024).cuda()

In [104]:
test_overflow(x)

False

Let's set one of the "gradients" to infinity:

In [105]:
x[123, 145] = float('inf')

Now let's verify that using the sum trick indeed runs faster than using `torch.isnan` to see if *any* of the gradients are NaN:

In [106]:
%timeit test_overflow(x)

79.1 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [107]:
%timeit torch.isnan(x).any().item()

96.8 µs ± 146 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Yep! Summing all the gradients and then checking if the sum is NaN is faster! 

#### Helper 5: Checking param groups for gradient overflow
We can now include the `test_overflow()` method inside a function, `grad_overflow()`, that, like the final versions of all our other helper functions support networks with parameter groups.

In [142]:
#export
def grad_overflow(param_groups):
    for param_group in param_groups:
        for param in param_group:
            if param.grad is not None:
                test_overflow(param.grad.data)

### Main mixed-precision callback with dynamic loss scaling

In [143]:
#export
class MixedPrecision(Callback):
    _order = 99
    def __init__(self, loss_scale=512, flat_master=False, dynamic=True, max_loss_scale=2.**24,
                 div_factor=2., scale_wait=500):
        assert torch.backends.cudnn.enabled, 'Mixed-precision training requires that the cuDNN be installed.'
        self.flat_master = flat_master
        self.dynamic = dynamic
        self.max_loss_scale = max_loss_scale
        self.div_factor = div_factor
        self.scale_wait = scale_wait
        self.loss_scale = max_loss_scale if dynamic else loss_scale
        
    def begin_fit(self):
        # Helper 1: Convert model (except for any batchnorm layers) to FP16:
        self.run.model = fp16.convert_network(self.model, dtype=torch.float16)
        
        # Helper 2: Creating a FP32 master copy of parameter weights
        self.model_param_groups, self.master_param_groups = get_master(self.opt, self.flat_master)
        # To place those FP32 master copy param groups inside the runner:
        self.run.opt.param_groups = self.master_param_groups
        
        # To count number of iterations without gradient overflow occurring.
        if self.dynamic: self.count = 0
        
    def after_fit(self): self.model.float()
       
    # Convert inputs to FP16 before forward pass
    def begin_batch(self): self.run.xb = self.run.xb.half()
        
    # Convert preds to FP32 so that loss can be computed in FP32
    def after_pred(self): self.run.pred = self.run.pred.float()
        
    # Scale loss to avoid gradient underflow (FP16 seeing non-zero grads as zero)
    def after_loss(self): self.run.loss *= self.loss_scale
        
    def after_backward(self):
        # Helper 5: check whether gradient overflow is occurring.
        if self.dynamic and grad_overflow(self.model_param_groups):
            # Divide loss scale factor by the div_factor (usually 2.)
            # if there is gradient overflow.
            self.loss_scale /= self.div_factor
            # We just zero out the gradients and skip the weight 
            # update step if there are NaN gradients.
            self.model.zero_grad()
            return True
        
        # Helper 3: Copy FP16 gradients (resulting from backprop) to FP32 master.
        to_master_grads(self.model_param_groups, self.master_param_groups, self.flat_master)
        
        # Also unscale gradients so that weight update can be computed in FP32.
        for param_group in self.master_param_groups:
            for param in param_group:
                if param.grad is not None: param.grad.div_(self.loss_scale)
      
        # Check if we can double the loss scale factor (if it's been 
        # long enough since a gradient overflow occurred).
        if self.dynamic:
            self.count += 1
            if self.count == self.scale_wait:
                self.count = 0
                self.loss_scale *= self.div_factor
    
    def after_step(self):
        # Be sure to zero out all gradients after the weight update step.
        self.model.zero_grad()
        
        # Helper 4: Copy updated weights from FP32 master back to FP16 model, 
        #           so that next iteration's forward pass can be done in FP16.
        to_model_params(self.model_param_groups, self.master_param_groups, self.flat_master)

Let's try it out!

In [144]:
# Use a lower max_loss_scale than the above MixedPrecision class'
# default value. Because my one cycle model uses a higher than 
# typical minimum learning rate (0.3).
max_loss_scale = 2.**12
callback_funcs = [partial(AvgStatsCallback, accuracy),
                  CudaCallback,
                  ProgressBarCallback,
                  partial(BatchTransformXCallback, norm_imagenette),
                  MixUp,
                  partial(MixedPrecision, max_loss_scale=max_loss_scale)]
learn = get_learner(n_outs, data, lr=min_lr, layer=conv_layer, callback_funcs=callback_funcs)

In [145]:
# Prevent next cell from displaying pink error box
# re. assertion errors thrown by dataloader.py that are ignored.
from IPython.display import HTML
HTML('''<script>code_show_err=false;</script>''')

In [146]:
learn.fit(8, scheduler)

epoch,train_loss,train_accuracy,valid_loss,valid_accuracy,time
0,1.942185,0.372732,1.846166,0.416,00:42
1,1.716588,0.502714,1.511381,0.498,00:41
2,1.549313,0.594928,1.111166,0.65,00:41
3,1.406133,0.66147,1.1518,0.634,00:41
4,1.318605,0.710098,0.960298,0.704,00:42
5,1.234019,0.752753,0.917452,0.746,00:41
6,1.157992,0.797115,0.804901,0.79,00:42
7,1.092179,0.836668,0.846601,0.752,00:42


Notice that the loss scaler that our dynamic loss scaling logic settled upon after 8 epochs of training is *much* larger than the value, 512, that we used by default in our earlier implementation of mixed precision training that didn't include dynamic loss scaling:

In [148]:
learn.callbacks[-1].loss_scale

32768.0

## Export

In [149]:
nb_auto_export()

<IPython.core.display.Javascript object>