# Introduction
In this notebook we will demonstrates how to use BigDL-Nano to accelerate PyTorch or PyTorch-Lightning applications on training workloads.

### Prepare Environment
Before you start with Apis delivered by bigdl-nano, you have to make sure BigDL-Nano is correctly installed for PyTorch. If not, please follow [this](../../../../../docs/readthedocs/source/doc/Nano/Overview/nano.md) to set up your environment.<br>

### Load Cifar10 DataModule
Import the existing data module from bolts and modify the train and test transforms.
You could access [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) for a view of the whole dataset.
Leveraging OpenCV and libjpeg-turbo, BigDL-Nano can accelerate computer vision data pipelines by providing a drop-in replacement of torch_vision's `datasets` and `transforms`.

In [1]:
import os
from torchvision.datasets import CIFAR10
from torch.utils.data.dataloader import DataLoader
from bigdl.nano.pytorch.vision import transforms
DATA_PATH = os.environ.get('DATA_PATH', '.')
BATCH_SIZE = 64
DEV_RUN = bool(os.environ.get('DEV_RUN', False))
train_transforms = transforms.Compose(
    [
        transforms.RandomCrop(32, 4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.4913725490196078, 0.4823529411764706, 0.4466666666666667],
                                     std=[0.24705882352941178, 0.24352941176470588, 0.2615686274509804])
    ]
)
test_transforms = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.4913725490196078, 0.4823529411764706, 0.4466666666666667],
                                     std=[0.24705882352941178, 0.24352941176470588, 0.2615686274509804])
    ]
)
train_dataset = CIFAR10(
        root=DATA_PATH,
        train=True,
        transform=train_transforms,
        download=True,
)
train_loader = DataLoader(
        dataset=train_dataset,
        batch_size=128
)
test_dataset = CIFAR10(
    root=DATA_PATH,
    train=False,
    transform=test_transforms,
    download=True
)
test_loader = DataLoader(
        dataset=train_dataset,
        batch_size=128
)


  from .autonotebook import tqdm as notebook_tqdm


Files already downloaded and verified
Files already downloaded and verified


###  Custom Model
Modify the pre-existing Resnet architecture from TorchVision. The pre-existing architecture is based on ImageNet images (224x224) as input. So we need to modify it for CIFAR10 images (32x32).

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim.lr_scheduler import OneCycleLR
from torchvision.models import resnet18
from pytorch_lightning import LightningModule, seed_everything
from torchmetrics.functional import accuracy
seed_everything(7)
def create_model():
    model = resnet18(pretrained=False, num_classes=10)
    model.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    model.maxpool = nn.Identity()
    return model

class LitResnet(LightningModule):
    def __init__(self, learning_rate=0.05, num_processes=1):
        super().__init__()

        self.save_hyperparameters()
        self.model = create_model()

    def forward(self, x):
        out = self.model(x)
        return F.log_softmax(out, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log("train_loss", loss)
        return loss

    def evaluate(self, batch, stage=None):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)

        if stage:
            self.log(f"{stage}_loss", loss, prog_bar=True)
            self.log(f"{stage}_acc", acc, prog_bar=True)

    def validation_step(self, batch, batch_idx):
        self.evaluate(batch, "val")

    def test_step(self, batch, batch_idx):
        self.evaluate(batch, "test")

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(
            self.parameters(),
            lr=self.hparams.learning_rate,
            momentum=0.9,
            weight_decay=5e-4,
        )
        steps_per_epoch = 45000 // BATCH_SIZE // self.hparams.num_processes
        scheduler_dict = {
            "scheduler": OneCycleLR(
                optimizer,
                0.1,
                epochs=self.trainer.max_epochs,
                steps_per_epoch=steps_per_epoch,
            ),
            "interval": "step",
        }
        return {"optimizer": optimizer, "lr_scheduler": scheduler_dict}

Global seed set to 7


### Train with Nano Apis
The PyTorch Trainer (`bigdl.nano.pytorch.Trainer`) is the place where we integrate most optimizations. It extends PyTorch Lightning's Trainer and has a few more parameters and methods specific to BigDL-Nano. The Trainer can be directly used to train a `LightningModule`.

`torch.channels_last` is recommended to be applied to the model object to raise CPU resource usage efficiency.

In [3]:
from bigdl.nano.pytorch import Trainer
model = LitResnet()
model = model.to(memory_format=torch.channels_last)
trainer = Trainer(max_epochs=30,
                  fast_dev_run=DEV_RUN) # run model once quickly in test
fit_time_basic = %timeit -n 1 -r 1 -o \
trainer.fit(model, train_dataloader=train_loader)
metric_basic = trainer.test(model, dataloaders=test_loader)


GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
  "`trainer.fit(train_dataloader)` is deprecated in v1.4 and will be removed in v1.6."
  rank_zero_warn(f"you defined a {step_name} but have no {loader_name}. Skipping {stage} loop")

  | Name  | Type   | Params
---------------------------------
0 | model | ResNet | 11.2 M
---------------------------------
11.2 M    Trainable params
0         Non-trainable params
11.2 M    Total params
44.696    Total estimated model params size (MB)


                                           

Global seed set to 7
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Epoch 29: 100%|██████████| 391/391 [01:51<00:00,  3.52it/s, loss=0.227, v_num=0] 


  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


55min 48s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Testing: 100%|█████████▉| 390/391 [00:37<00:00, 10.42it/s]--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_acc': 0.9060199856758118, 'test_loss': 0.2779909074306488}
--------------------------------------------------------------------------------
Testing: 100%|██████████| 391/391 [00:37<00:00, 10.35it/s]


Intel Extension for Pytorch (a.k.a. IPEX) link extends PyTorch with optimizations for an extra performance boost on Intel hardware. BigDL-Nano integrates IPEX through the Trainer. Users can turn on IPEX by setting use_ipex=True.

In [4]:
model = LitResnet()
model = model.to(memory_format=torch.channels_last)
trainer = Trainer(max_epochs=30, 
                  use_ipex=True,
                  fast_dev_run=DEV_RUN)
fit_time_ipex = %timeit -n 1 -r 1 -o \
trainer.fit(model, train_dataloader=train_loader)
metric_ipex = trainer.test(model, dataloaders=test_loader)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
  "`trainer.fit(train_dataloader)` is deprecated in v1.4 and will be removed in v1.6."
  rank_zero_warn(f"you defined a {step_name} but have no {loader_name}. Skipping {stage} loop")

  | Name  | Type   | Params
---------------------------------
0 | model | ResNet | 11.2 M
---------------------------------
11.2 M    Trainable params
0         Non-trainable params
11.2 M    Total params
44.712    Total estimated model params size (MB)


                                           

Global seed set to 7
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Epoch 0:   0%|          | 1/391 [00:00<01:13,  5.31it/s, loss=2.36, v_num=0]



Epoch 29: 100%|██████████| 391/391 [01:31<00:00,  4.29it/s, loss=0.269, v_num=0] 
46min 26s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Testing: 100%|█████████▉| 390/391 [00:34<00:00, 11.46it/s]--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_acc': 0.871940016746521, 'test_loss': 0.370888352394104}
--------------------------------------------------------------------------------
Testing: 100%|██████████| 391/391 [00:34<00:00, 11.45it/s]


Setting use_ipex=True will Apply optimizations at Python frontend to the given model (nn.Module), as well as the given optimizer (optional). Optimizations include conv+bn folding (for inference only), weight prepacking and so on.

Increase the number of processes on distributed training to accelerate training.

In [5]:
model = LitResnet(learning_rate=0.1, num_processes=4)
model = model.to(memory_format=torch.channels_last)
trainer = Trainer(max_epochs=30, 
                  num_processes=4,
                  fast_dev_run=DEV_RUN)
fit_time_dit = %timeit -n 1 -r 1 -o \
trainer.fit(model, train_dataloader=train_loader)
metric_dit = trainer.test(model, dataloaders=test_loader)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
2022-07-11 02:57:45,329 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - ----------------------------------------------------------------------------------------------------
2022-07-11 02:57:45,330 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - distributed_backend=ddp_subprocess
2022-07-11 02:57:45,331 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - All DDP processes registered. Starting ddp with 4 processes
2022-07-11 02:57:45,331 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - ----------------------------------------------------------------------------------------------------
Global seed set to 7
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
Global seed set to 7
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
Global seed set to 7
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
Global seed set to 7
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
----------------

Epoch 0:   0%|          | 0/98 [00:00<00:00, 3506.94it/s]  



Epoch 29: 100%|██████████| 98/98 [01:13<00:00,  1.35it/s, loss=0.172, v_num=2]  


  rank_zero_warn("cleaning up ddp environment...")
2022-07-11 03:34:26,434 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - ----------------------------------------------------------------------------------------------------
2022-07-11 03:34:26,434 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - distributed_backend=ddp_subprocess
2022-07-11 03:34:26,435 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - All DDP processes registered. Starting ddp with 4 processes
2022-07-11 03:34:26,436 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - ----------------------------------------------------------------------------------------------------


36min 41s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
evaluate


Global seed set to 7
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
Global seed set to 7
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
Global seed set to 7
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
Global seed set to 7
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All DDP processes registered. Starting ddp with 4 processes
----------------------------------------------------------------------------------------------------

  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Testing: 100%|██████████| 98/98 [00:20<00:00,  5.18it/s]--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_acc': 0.9073399901390076, 'test_loss': 0.25950178503990173}
--------------------------------------------------------------------------------
Testing: 100%|██████████| 98/98 [00:21<00:00,  4.57it/s]


  rank_zero_warn("cleaning up ddp environment...")


Enable both distributed training and ipex

In [6]:
model = LitResnet(learning_rate=0.1, num_processes=4)
model = model.to(memory_format=torch.channels_last)
trainer = Trainer(max_epochs=30, 
                  num_processes=4,
                  use_ipex=True,
                  fast_dev_run=DEV_RUN)
fit_time_dit_ipex = %timeit -n 1 -r 1 -o \
trainer.fit(model, train_dataloader=train_loader)
metric_dit_ipex = trainer.test(model, dataloaders=test_loader)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
  "`trainer.fit(train_dataloader)` is deprecated in v1.4 and will be removed in v1.6."
  rank_zero_warn(f"you defined a {step_name} but have no {loader_name}. Skipping {stage} loop")
2022-07-11 03:35:07,885 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - ----------------------------------------------------------------------------------------------------
2022-07-11 03:35:07,886 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - distributed_backend=ddp_subprocess
2022-07-11 03:35:07,887 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - All DDP processes registered. Starting ddp with 4 processes
2022-07-11 03:35:07,888 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - ----------------------------------------------------------------------------------------------------
Global seed set to 7
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
Global seed set to 7
initializing 

Epoch 0:   0%|          | 0/98 [00:00<00:00, 6657.63it/s]  



Epoch 29: 100%|██████████| 98/98 [01:01<00:00,  1.60it/s, loss=0.163, v_num=3]  


  rank_zero_warn("cleaning up ddp environment...")
2022-07-11 04:06:11,310 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - ----------------------------------------------------------------------------------------------------
2022-07-11 04:06:11,310 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - distributed_backend=ddp_subprocess
2022-07-11 04:06:11,311 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - All DDP processes registered. Starting ddp with 4 processes
2022-07-11 04:06:11,312 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - ----------------------------------------------------------------------------------------------------


31min 3s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
evaluate


Global seed set to 7
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
Global seed set to 7
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
Global seed set to 7
Global seed set to 7
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All DDP processes registered. Starting ddp with 4 processes
----------------------------------------------------------------------------------------------------

  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Testing: 100%|██████████| 98/98 [00:20<00:00,  5.23it/s]--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_acc': 0.9164199829101562, 'test_loss': 0.2389732003211975}
--------------------------------------------------------------------------------
Testing: 100%|██████████| 98/98 [00:22<00:00,  4.45it/s]


  rank_zero_warn("cleaning up ddp environment...")


In [5]:
template = """
|      Precision    | Fit Time(s)         | Accuracy(%) |
|        Basic      |       {:5.2f}       |    {:5.5f}    |
|        IPEX       |       {:5.2f}       |    {:5.5f}    |
|     Distributed   |       {:5.2f}       |    {:5.5f}    |
|    DIST with IPEX |       {:5.2f}       |    {:5.5f}    |
"""
summary = template.format(
    fit_time_basic.average, metric_basic[0]['test_acc'],
    fit_time_ipex.average, metric_ipex[0]['test_acc'],
    fit_time_dit.average, metric_dit[0]['test_acc'],
    fit_time_dit_ipex.average, metric_dit_ipex[0]['test_acc']
)
print(summary)


|      Precision    | Fit Time(s)         | Accuracy(%) |
|        Basic      |       3348.04       |    0.90602    |
|        IPEX       |       2786.39       |    0.87194    |
|     Distributed   |       2201.14       |    0.90734    |
|    DIST with IPEX |       1863.43       |    0.91642    |

