[View the runnable example on GitHub](https://github.com/intel-analytics/BigDL/tree/main/python/nano/tutorial/notebook/training/pytorch/accelerate_pytorch_training_bf16.ipynb)

# Use BFloat16 Mixed Precision for PyTorch Training

Brain Floating Point Format (BFloat16) is a custom 16-bit floating point format designed for machine learning. BFloat16 is comprised of 1 sign bit, 8 exponent bits, and 7 mantissa bits. With the same number of exponent bits, BFloat16 has the same dynamic range as FP32, but requires only half the memory usage.

BFloat16 Mixed Precision combines BFloat16 and FP32 during training, which could lead to increased performance and reduced memory usage. Compared to FP16 mixed precision, BFloat16 mixed precision has better numerical stability.

By using `TorchNano` (`bigdl.nano.pytorch.TorchNano`), you can make very few code changes to use BFloat16 mixed precision for training. Here we provide __2__ ways to achieve this: A) subclass `TorchNano` or B) use `@nano` decorator. You can choose the appropriate one depending on your (preferred) code structure.

## Prepare Environment for BigDL-Nano

At first, you need to install BigDL-Nano for PyTorch:

In [None]:
!pip install --pre --upgrade bigdl-nano[pytorch] # install the nightly-built version
!source bigdl-nano-init # set environment variables

> 📝 **Note**
>
> Before starting your PyTorch application, it is highly recommended to run `source bigdl-nano-init` to set several environment variables based on your current hardware. Empirically, these variables will greatly improve performance for most PyTorch applications on training workloads.

> ⚠️ **Warning**
> 
> For Jupyter Notebook users, we recommend to run the commands above, especially `source bigdl-nano-init` before jupyter kernel is started, or some of the optimizations may not take effect.

> ⚠️ **Warning**
> 
> Using BFloat16 precision with `torch < 1.12` may result in extremely slow training.

## Pre-define Model and Dataloader

In this guide, we take the fine-tuning of a [ResNet-18 model](https://pytorch.org/vision/main/models/generated/torchvision.models.resnet18.html) on [OxfordIIITPet dataset](https://pytorch.org/vision/main/generated/torchvision.datasets.OxfordIIITPet.html) as an example:

In [None]:
# Define model and dataloader

from torch import nn
from torchvision.models import resnet18

class MyPytorchModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = resnet18(pretrained=True)
        num_ftrs = self.model.fc.in_features
        # Here the size of each output sample is set to 37.
        self.model.fc = nn.Linear(num_ftrs, 37)

    def forward(self, x):
        return self.model(x)


import torch
from torchvision import transforms
from torchvision.datasets import OxfordIIITPet
from torch.utils.data.dataloader import DataLoader

def create_train_dataloader():
    train_transform = transforms.Compose([transforms.Resize(256),
                                          transforms.RandomCrop(224),
                                          transforms.RandomHorizontalFlip(),
                                          transforms.ColorJitter(brightness=.5, hue=.3),
                                          transforms.ToTensor(),
                                          transforms.Normalize([0.485, 0.456, 0.406],
                                                               [0.229, 0.224, 0.225])])

    # apply data augmentation to the train_dataset
    train_dataset = OxfordIIITPet(root="/tmp/data", transform=train_transform, download=True)

    # prepare data loader
    train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)

    return train_dataloader

## A) Subclass `TorchNano`

In general, two steps are required if you choose to subclass `TorchNano`:

1) import and subclass `TorchNano`, and override its `train()` method
2) instantiate it with setting `precision='bf16'`, then call the `train()` method

For step 1, you can refer to [this page](https://bigdl.readthedocs.io/en/latest/doc/Nano/Howto/Training/PyTorch/convert_pytorch_training_torchnano.html) to achieve it (for consistency, we use the same model and dataset as an example). Supposing that you've already got a well-defined subclass `MyNano`, below line will instantiate it with enabling BFloat16 mixed precision and train your model.

In [None]:
from tqdm import tqdm
from bigdl.nano.pytorch import TorchNano # import TorchNano

# subclass TorchNano and override its train method
class MyNano(TorchNano):
    def train(self):
        # Move the code for your custom training loops inside the train method
        model = MyPytorchModule()
        optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=5e-4)
        loss_fuc = torch.nn.CrossEntropyLoss()
        train_loader = create_train_dataloader()

        # call setup method to set up model, optimizer(s),
        # and dataloader(s) for accelerated training
        model, optimizer, train_loader = self.setup(model, optimizer, train_loader)
        num_epochs = 5

        for epoch in range(num_epochs):

            model.train()
            train_loss, num = 0, 0
            with tqdm(train_loader, unit="batch") as tepoch:
                for data, target in tepoch:
                    tepoch.set_description(f"Epoch {epoch}")
                    optimizer.zero_grad()
                    output = model(data)
                    loss = loss_fuc(output, target)
                    # Replace loss.backward() with self.backward(loss)
                    self.backward(loss)
                    optimizer.step()
                    loss_value = loss.sum()
                    train_loss += loss_value
                    num += 1
                    tepoch.set_postfix(loss=loss_value)
            print(f'Train Epoch: {epoch}, avg_loss: {train_loss / num}')

In [None]:
MyNano(precision='bf16').train()

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; _The detailed definition of_ `MyNano` _can be found in the_ [runnable example](https://github.com/intel-analytics/BigDL/tree/main/python/nano/tutorial/notebook/training/pytorch/accelerate_pytorch_training_bf16.ipynb).

However, using BF16 precision on CPU without BF16 instruction support may affect training efficiency. You can set `use_ipex=True` and `precision='bf16'` simultaneously to enable IPEX ([Intel® Extension for PyTorch*](https://github.com/intel/intel-extension-for-pytorch)), which adopts AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and other optimizations for BFloat16 mixed precision training to gain more acceleration:

In [None]:
MyNano(use_ipex=True, precision='bf16').train()

## B) Use `@nano` decorator

`@nano` decorator is very friendly since you can only add 2 new lines (import it and wrap the training function) and enjoy the features brought by BigDL-Nano if you have already defined a PyTorch training function with a model, optimizers, and dataloaders as parameters. You can learn the usage and notes of it from [here](https://bigdl.readthedocs.io/en/latest/doc/Nano/Howto/Training/PyTorch/use_nano_decorator_pytorch_training.html). The only difference when using BFloat16 mixed precision for training is that you should specify the decorator as `@nano(precision='bf16')`.

In [None]:
from tqdm import tqdm
from bigdl.nano.pytorch import nano # import nano decorator

@nano(precision='bf16') # apply the decorator to the training loop
def training_loop(model, optimizer, train_loader, num_epochs, loss_func):

    for epoch in range(num_epochs):

        model.train()
        train_loss, num = 0, 0
        with tqdm(train_loader, unit="batch") as tepoch:
            for data, target in tepoch:
                tepoch.set_description(f"Epoch {epoch}")
                optimizer.zero_grad()
                output = model(data)
                loss = loss_func(output, target)
                loss.backward()
                optimizer.step()
                loss_value = loss.sum()
                train_loss += loss_value
                num += 1
                tepoch.set_postfix(loss=loss_value)
            print(f'Train Epoch: {epoch}, avg_loss: {train_loss / num}')

In [None]:
model = MyPytorchModule()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=5e-4)
loss_func = torch.nn.CrossEntropyLoss()
train_loader = create_train_dataloader()

training_loop(model, optimizer, train_loader, num_epochs=5, loss_func=loss_func)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; _A runnable example including this_ `training_loop` _can be seen from_ [here](https://github.com/intel-analytics/BigDL/tree/main/python/nano/tutorial/notebook/training/pytorch/accelerate_pytorch_training_bf16.ipynb).

However, using BF16 precision on CPU without BF16 instruction support may affect training efficiency. You can set `use_ipex=True` and `precision='bf16'` simultaneously to enable IPEX ([Intel® Extension for PyTorch*](https://github.com/intel/intel-extension-for-pytorch)), which adopts AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and other optimizations for BFloat16 mixed precision training to gain more acceleration.

> 📚 **Related Readings**
> 
> - [How to install BigDL-Nano](https://bigdl.readthedocs.io/en/latest/doc/Nano/Overview/install.html)
> - [How to convert your PyTorch training loop to use TorchNano for acceleration](https://bigdl.readthedocs.io/en/latest/doc/Nano/Howto/Training/PyTorch/convert_pytorch_training_torchnano.html)
> - [How to accelerate your PyTorch training loop with \@nano decorator](https://bigdl.readthedocs.io/en/latest/doc/Nano/Howto/Training/PyTorch/use_nano_decorator_pytorch_training.html)
> - [How to accelerate a PyTorch application on training workloads through Intel® Extension for PyTorch*](https://bigdl.readthedocs.io/en/latest/doc/Nano/Howto/Training/PyTorch/accelerate_pytorch_training_ipex.html)