Training a model to perform at its best can be a challenging task.

Having explored the intricacies of various training techniques, such as Precise BatchNorm, Weight Averaging, and Batch Accumulation, you've learned how these can significantly improve a model's performance and stability.

Now, it's time to take the next step and see these concepts in action!

In the SuperGradients framework, implementing these training tricks is straightforward and efficient.

Let's dive into the code and demonstrate just how easy it is to take advantage of these powerful techniques within your own models.


In [None]:
%%capture
!pip install super-gradients==3.2.0

In [None]:
%%capture
from super_gradients.training import training_hyperparams

In [None]:
training_params = training_hyperparams.get("training_hyperparams/default_train_params")

# Exponential Moving Average (EMA)

Getting trapped in a false local minima sucks.

When you’re training a neural network, chances are you’re using mini-batches.

There's nothing wrong with that, it just happens to introduce noise and less accurate gradients when gradient descent updates model parameters between batches.

On one hand, thats nice because noisy gradients can sometimes help optimization and lead to a better local optimum than if you trained on the entire data set.

On the other hand, the noisiness might lead to converging to a false local minima.

Luckily, you have the EMA method at your disposal.

EMA is a method that increases the stability of a model’s convergence and helps it reach a better overall solution by preventing convergence to a local minima.

EMA makes your models more stable, improves convergence, and helps your network find a better solution.

In [None]:
training_params['ema']

False

In [None]:
training_params['ema'] = 'True'

In [None]:
training_params['ema_params']

# Weight Averaging

Everyone likes a free boost in model accuracy.

And that's what weight averaging gives you.

It’s a post-training method that takes the best model weights across training epochs and averages them into a single model.

By averaging weights for the N best checkpoints, we’re effectively making an ensemble of N models.

It is not exactly the same as having N models and averaging their predictions, which comes at the price of running inference on N models, but it could help with squeezing out some extra accuracy.

It does this by overcoming the optimization tendency to alternate between adjacent local minima in the later stages of the training.

It also has the added benefit of reducing bias.

This trick doesn’t affect the training at all, it just keeps a few additional weights on the disk, and can give you a boost in performance and stability.

In [None]:
training_params = training_hyperparams.get("training_hyperparams/default_train_params")

In [None]:
training_params['average_best_models']

True

In [None]:
training_params['average_best_models'] = 'False'

# Batch Accumulation

Most "off-the-shelf’ models come with a suggested training recipe.

Which usually suggests a powerful GPU for training.

If you just try to reduce your batch size so it works with your hardware, you’ll have to tune other parameters as well.

Which means you won’t always get the same training results.

There’s got to be a way to train a model thats appropriate for your target hardware.

That’s where batch accumulation comes it.

Here’s how it works…

1) Perform several consecutive forward steps over the model

2) Accumulate the gradients

3) Back propagate them once every few batches.

Be sure that you are training with a small batch size to begin with, typically 4 or 8 should be good for most smaller GPUs.

Next, determine the virtual batch size you want to simulate.

If you're working with a batch size of 4 but want to simulate a batch size of 64 then you'd accumulate for 64/4 = 16 batches.

In [None]:
training_params = training_hyperparams.get("training_hyperparams/default_train_params")

In [None]:
training_params['batch_accumulate'] = 16

# Precise BatchNorm

BatchNorm is a wonderful invention.

Ever since it hit the scene in 2015 its been making models less sensitive to learning rates and choice of initialization.

It’s also helped speed up model convergence.

Hell, it’s even helped wage war against overfitting.


BatchNorm does, however, have it’s problems…


> Batch normalization in the mind of many people, including me, is a necessary evil. In the sense that nobody likes it, but it kind of works, so everybody uses it, but everybody is trying to replace it with something else because everybody hates it.

> — Yann LeCun

Why does BatchNorm catch such flack?

Well, BatchNorm layers are meant to normalize the data based on the dataset’s distribution.

Ideally, you want to estimate the distribution according to the entire dataset.

But this isn’t possible.

So, BatchNorm layers are used to evaluate the statistics of a given mini-batch throughout the training.

But a 2021 paper by Facebook AI Research titled "Rethinking “Batch” in BatchNorm” showed that these mini-batch based statistics are sub-optimal.

The researchers propose estimating the data statistics parameters (the mean and standard deviation variables) across several mini-batches, while keeping the trainable parameters fixed.

This method, titled Precise BatchNorm, helps improve both the stability and performance of a model.

If you want to use PreciseBN it's preferabe set batch size to something small, mimicking a scenario where we can't fit large batch in your GPU.

Then set `precie_bn_batch_size` to be large enough until you see some good results

In [None]:
training_params = training_hyperparams.get("training_hyperparams/default_train_params")

In [None]:
training_params['precise_bn'] = 'True'
# the effective batch size we want to calculate the batchnorm on.
training_params['precise_bn_batch_size'] = 256

# Zero-weight decay on BatchNorm and Bias

Most computer vision tasks have BatchNorm layers and biases along with linear or convolutional layers.

This tends to work well because you’ll have more parameters in your model.

More parameters mean more ways to capture interactions between parts of your network.

More parameters, however, also mean more opportunities to overfit your model.

In [None]:
training_params = training_hyperparams.get("training_hyperparams/default_train_params")

In [None]:
training_params['zero_weight_decay_on_bias_and_bn'] = 'True'