# Introduction
In this notebook we will demonstrates how to use BigDL-Nano to accelerate PyTorch or PyTorch-Lightning applications on training workloads.

### Prepare Environment
Before you start with Apis delivered by bigdl-nano, you have to make sure BigDL-Nano is correctly installed for PyTorch. If not, please follow [this](../../../../../docs/readthedocs/source/doc/Nano/Overview/nano.md) to set up your environment.<br>

We used pre-built cifar10 datamodule from lightning-bolts for demo. You are required to install lightnig-bolts as follows:
```python
pip install lightning-bolts
```

### Load Cifar10 DataModule
Import the existing data module from bolts and modify the train and test transforms.
You could access [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) for a view of the whole dataset.
Leveraging OpenCV and libjpeg-turbo, BigDL-Nano can accelerate computer vision data pipelines by providing a drop-in replacement of torch_vision's `datasets` and `transforms`.

In [1]:
import os
from pl_bolts.datamodules import CIFAR10DataModule
from pl_bolts.transforms.dataset_normalizations import cifar10_normalization
from bigdl.nano.pytorch.vision import transforms
DATA_PATH = os.environ.get('DATA_PATH', '.')
BATCH_SIZE = 64
train_transforms = transforms.Compose(
    [
        transforms.RandomCrop(32, 4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        cifar10_normalization()
    ]
)
test_transforms = transforms.Compose(
    [
        transforms.ToTensor(),
        cifar10_normalization()
    ]
)
cifar10_dm = CIFAR10DataModule(
    data_dir = DATA_PATH,
    batch_size = BATCH_SIZE,
    train_transforms = train_transforms,
    val_transforms = test_transforms,
    test_transforms = test_transforms
)


  from .autonotebook import tqdm as notebook_tqdm


###  Custom Model
Modify the pre-existing Resnet architecture from TorchVision. The pre-existing architecture is based on ImageNet images (224x224) as input. So we need to modify it for CIFAR10 images (32x32).

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim.lr_scheduler import OneCycleLR
from torchvision.models import resnet18
from pytorch_lightning import LightningModule, seed_everything
from torchmetrics.functional import accuracy
seed_everything(7)
def create_model():
    model = resnet18(pretrained=False, num_classes=10)
    model.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    model.maxpool = nn.Identity()
    return model

class LitResnet(LightningModule):
    def __init__(self, learning_rate=0.05, num_processes=1):
        super().__init__()

        self.save_hyperparameters()
        self.model = create_model()

    def forward(self, x):
        out = self.model(x)
        return F.log_softmax(out, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log("train_loss", loss)
        return loss

    def evaluate(self, batch, stage=None):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)

        if stage:
            self.log(f"{stage}_loss", loss, prog_bar=True)
            self.log(f"{stage}_acc", acc, prog_bar=True)

    def validation_step(self, batch, batch_idx):
        self.evaluate(batch, "val")

    def test_step(self, batch, batch_idx):
        self.evaluate(batch, "test")

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(
            self.parameters(),
            lr=self.hparams.learning_rate,
            momentum=0.9,
            weight_decay=5e-4,
        )
        steps_per_epoch = 45000 // BATCH_SIZE // self.hparams.num_processes
        scheduler_dict = {
            "scheduler": OneCycleLR(
                optimizer,
                0.1,
                epochs=self.trainer.max_epochs,
                steps_per_epoch=steps_per_epoch,
            ),
            "interval": "step",
        }
        return {"optimizer": optimizer, "lr_scheduler": scheduler_dict}

Global seed set to 7


### Train with Nano Apis
The PyTorch Trainer (`bigdl.nano.pytorch.Trainer`) is the place where we integrate most optimizations. It extends PyTorch Lightning's Trainer and has a few more parameters and methods specific to BigDL-Nano. The Trainer can be directly used to train a `LightningModule`.

In [4]:
from bigdl.nano.pytorch import Trainer
model = LitResnet()
model.datamodule = cifar10_dm
trainer = Trainer(max_epochs=30)
fit_time_basic = %timeit -n 1 -r 1 -o \
trainer.fit(model, datamodule=cifar10_dm)
metric_basic = trainer.test(model, datamodule=cifar10_dm)


GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


Files already downloaded and verified
Files already downloaded and verified



  | Name  | Type   | Params
---------------------------------
0 | model | ResNet | 11.2 M
---------------------------------
11.2 M    Trainable params
0         Non-trainable params
11.2 M    Total params
44.696    Total estimated model params size (MB)


                                                                      

  f"The dataloader, {name}, does not have many workers which may be a bottleneck."
Global seed set to 7
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Epoch 29: 100%|██████████| 782/782 [01:42<00:00,  7.67it/s, loss=0.145, v_num=11, val_loss=0.252, val_acc=0.917] 
51min 6s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Testing:  99%|█████████▊| 155/157 [00:07<00:00, 20.90it/s]--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_acc': 0.9081000089645386, 'test_loss': 0.27566173672676086}
--------------------------------------------------------------------------------
Testing: 100%|██████████| 157/157 [00:07<00:00, 20.88it/s]


Have a look at the summary of all layers in the model.

In [5]:
model.summarize

<bound method LightningModule.summarize of LitResnet(
  (model): ResNet(
    (conv1): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu): ReLU(inplace=True)
    (maxpool): Identity()
    (layer1): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (1): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    

Intel Extension for Pytorch (a.k.a. IPEX) link extends PyTorch with optimizations for an extra performance boost on Intel hardware. BigDL-Nano integrates IPEX through the Trainer. Users can turn on IPEX by setting use_ipex=True.

In [6]:
model = LitResnet()
model.datamodule = cifar10_dm
trainer = Trainer(max_epochs=30, 
                  use_ipex=True)
fit_time_ipex = %timeit -n 1 -r 1 -o \
trainer.fit(model, datamodule=cifar10_dm)
metric_ipex = trainer.test(model, datamodule=cifar10_dm)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
  f"DataModule.{name} has already been called, so it will not be called again. "

  | Name  | Type   | Params
---------------------------------
0 | model | ResNet | 11.2 M
---------------------------------
11.2 M    Trainable params
0         Non-trainable params
11.2 M    Total params
44.712    Total estimated model params size (MB)


                                                                      

Global seed set to 7


Epoch 0:   0%|          | 1/782 [00:00<01:16, 10.18it/s, loss=2.36, v_num=12]



Epoch 29: 100%|██████████| 782/782 [01:44<00:00,  7.48it/s, loss=0.163, v_num=12, val_loss=0.235, val_acc=0.922] 
52min 49s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


  f"DataModule.{name} has already been called, so it will not be called again. "


Testing:  99%|█████████▉| 156/157 [00:07<00:00, 21.61it/s]--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_acc': 0.9156000018119812, 'test_loss': 0.2631952464580536}
--------------------------------------------------------------------------------
Testing: 100%|██████████| 157/157 [00:07<00:00, 21.46it/s]


After the optimization, some layers in the model were replaced, for example, the `Conv2d` is replaced by `_IPEXConv2d`.

In [7]:
model.summarize

<bound method LightningModule.summarize of LitResnet(
  (model): ResNet(
    (conv1): _IPEXConv2d()
    (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu): ReLU(inplace=True)
    (maxpool): Identity()
    (layer1): Sequential(
      (0): BasicBlock(
        (conv1): _IPEXConv2d()
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): _IPEXConv2d()
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (1): BasicBlock(
        (conv1): _IPEXConv2d()
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): _IPEXConv2d()
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (layer2): Sequential(
      (0): BasicBlock(
        (conv1): _IPEXConv2d()
    

Increase the number of processes on distributed training to accelerate training.

In [8]:
model = LitResnet(learning_rate=0.1, num_processes=4)
model.datamodule = cifar10_dm
trainer = Trainer(max_epochs=30, 
                  num_processes=4
                  )
fit_time_dit = %timeit -n 1 -r 1 -o \
trainer.fit(model, datamodule=cifar10_dm)
metric_dit = trainer.test(model, datamodule=cifar10_dm)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
2022-06-30 00:51:21,491 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - ----------------------------------------------------------------------------------------------------
2022-06-30 00:51:21,493 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - distributed_backend=ddp_subprocess
2022-06-30 00:51:21,494 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - All DDP processes registered. Starting ddp with 4 processes
2022-06-30 00:51:21,494 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - ----------------------------------------------------------------------------------------------------
Global seed set to 7
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
Global seed set to 7
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
Global seed set to 7
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
Global seed set to 7
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
----------------

Epoch 0:   0%|          | 0/197 [00:00<00:00, 7319.90it/s]            

Global seed set to 7
Global seed set to 7
Global seed set to 7
Global seed set to 7
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Epoch 0:  80%|███████▉  | 157/197 [01:03<00:16,  2.49it/s, loss=1.56, v_num=13]
Validating: 0it [00:00, ?it/s][A
Validating:   0%|          | 0/40 [00:00<?, ?it/s][A
Epoch 0:  81%|████████  | 159/197 [01:03<00:15,  2.51it/s, loss=1.56, v_num=13]
Epoch 0:  82%|████████▏ | 161/197 [01:03<00:14,  2.54it/s, loss=1.56, v_num=13]
Epoch 0:  83%|████████▎ | 163/197 [01:04<00:13,  2.56it/s, loss=1.56, v_num=13]
Epoch 0:  84%|████████▍ | 165/197 [01:04<00:12,  2.59it/s, loss=1.56, v_num=13]
Epoch 0:  85%|████████▍ | 167/197 [01:04<00:11,  2.61it/s, loss=1.56, v_num=13]
Epoch 0:  86%|████████▌ | 169/197 [01:04<00:10,  2.63it/s, loss=1.56, v_num=13]
Epoch 0:  87%|████████▋ | 171/197 [01:04<00:09,  2.66it/s, loss=1.56, v_num=13]
Epoch 0:  88%|████████▊ | 173/197 [01:04<00:08,  2.68it/s, loss=1.56, v_num=13]
Epoch 0:  89%|████████▉ | 175/197 [01:05<00:08,  2.70it/s, loss=1.56, v_num=13]
Epoch 0:  90%|████████▉ | 177/197 [01:05<00:07,  2.72it/s, loss=1.56, v_num=13]
Epoch 0:  91%|█████████ | 179/19

  rank_zero_warn("cleaning up ddp environment...")
  f"DataModule.{name} has already been called, so it will not be called again. "
  f"DataModule.{name} has already been called, so it will not be called again. "
2022-06-30 01:25:04,274 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - ----------------------------------------------------------------------------------------------------
2022-06-30 01:25:04,275 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - distributed_backend=ddp_subprocess
2022-06-30 01:25:04,276 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - All DDP processes registered. Starting ddp with 4 processes
2022-06-30 01:25:04,277 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - ----------------------------------------------------------------------------------------------------


33min 42s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
evaluate


Global seed set to 7
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
Global seed set to 7
Global seed set to 7
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
Global seed set to 7
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All DDP processes registered. Starting ddp with 4 processes
----------------------------------------------------------------------------------------------------

  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Testing:  98%|█████████▊| 39/40 [00:03<00:00, 10.34it/s]--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_acc': 0.9239000082015991, 'test_loss': 0.26745525002479553}
--------------------------------------------------------------------------------
Testing: 100%|██████████| 40/40 [00:03<00:00, 10.41it/s]


  rank_zero_warn("cleaning up ddp environment...")
  f"DataModule.{name} has already been called, so it will not be called again. "


Enable both distributed training and ipex

In [9]:
model = LitResnet(learning_rate=0.1, num_processes=4)
model.datamodule = cifar10_dm
trainer = Trainer(max_epochs=30, 
                  num_processes=4,
                  use_ipex=True)
fit_time_dit_ipex = %timeit -n 1 -r 1 -o \
trainer.fit(model, datamodule=cifar10_dm)
metric_dit_ipex = trainer.test(model, datamodule=cifar10_dm)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
  f"DataModule.{name} has already been called, so it will not be called again. "
2022-06-30 01:25:13,097 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - ----------------------------------------------------------------------------------------------------
2022-06-30 01:25:13,098 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - distributed_backend=ddp_subprocess
2022-06-30 01:25:13,099 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - All DDP processes registered. Starting ddp with 4 processes
2022-06-30 01:25:13,100 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - ----------------------------------------------------------------------------------------------------
Global seed set to 7
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
Global seed set to 7
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
Global seed set to 7
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
Gl

Epoch 0:   0%|          | 0/197 [00:00<00:00, 2504.06it/s]            

Global seed set to 7
Global seed set to 7
Global seed set to 7
Global seed set to 7
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Epoch 0:  80%|███████▉  | 157/197 [01:06<00:16,  2.38it/s, loss=1.58, v_num=14]
Validating: 0it [00:00, ?it/s][A
Validating:   0%|          | 0/40 [00:00<?, ?it/s][A
Epoch 0:  81%|████████  | 159/197 [01:06<00:15,  2.40it/s, loss=1.58, v_num=14]
Epoch 0:  82%|████████▏ | 161/197 [01:06<00:14,  2.43it/s, loss=1.58, v_num=14]
Epoch 0:  83%|████████▎ | 163/197 [01:06<00:13,  2.45it/s, loss=1.58, v_num=14]
Epoch 0:  84%|████████▍ | 165/197 [01:07<00:12,  2.47it/s, loss=1.58, v_num=14]
Epoch 0:  85%|████████▍ | 167/197 [01:07<00:12,  2.50it/s, loss=1.58, v_num=14]
Epoch 0:  86%|████████▌ | 169/197 [01:07<00:11,  2.52it/s, loss=1.58, v_num=14]
Epoch 0:  87%|████████▋ | 171/197 [01:07<00:10,  2.54it/s, loss=1.58, v_num=14]
Epoch 0:  88%|████████▊ | 173/197 [01:07<00:09,  2.56it/s, loss=1.58, v_num=14]
Epoch 0:  89%|████████▉ | 175/197 [01:08<00:08,  2.58it/s, loss=1.58, v_num=14]
Epoch 0:  90%|████████▉ | 177/197 [01:08<00:07,  2.61it/s, loss=1.58, v_num=14]
Epoch 0:  91%|█████████ | 179/19

  rank_zero_warn("cleaning up ddp environment...")
  f"DataModule.{name} has already been called, so it will not be called again. "
  f"DataModule.{name} has already been called, so it will not be called again. "
2022-06-30 02:00:17,710 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - ----------------------------------------------------------------------------------------------------
2022-06-30 02:00:17,711 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - distributed_backend=ddp_subprocess
2022-06-30 02:00:17,712 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - All DDP processes registered. Starting ddp with 4 processes
2022-06-30 02:00:17,713 - bigdl.nano.pytorch.plugins.ddp_subprocess - INFO - ----------------------------------------------------------------------------------------------------


35min 4s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
evaluate


Global seed set to 7
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
Global seed set to 7
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
Global seed set to 7
Global seed set to 7
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All DDP processes registered. Starting ddp with 4 processes
----------------------------------------------------------------------------------------------------

  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Testing:  98%|█████████▊| 39/40 [00:03<00:00, 10.29it/s]--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_acc': 0.9186999797821045, 'test_loss': 0.2728155851364136}
--------------------------------------------------------------------------------
Testing: 100%|██████████| 40/40 [00:04<00:00,  9.94it/s]


  rank_zero_warn("cleaning up ddp environment...")
  f"DataModule.{name} has already been called, so it will not be called again. "


In [18]:
template = """
|      Precision    | Fit Time(s)       | Accuracy(%) |
|        Basic      |       {:5.2f}       |    {:5.2f}    |
|        Ipex       |       {:5.2f}       |    {:5.2f}    |
|     Distributed   |       {:5.2f}       |    {:5.2f}    |
|   Dist with IPEX  |       {:5.2f}       |    {:5.2f}    |
"""
summary = template.format(
    fit_time_basic.best, metric_basic[0]['test_acc']*100,
    fit_time_ipex.best, metric_ipex[0]['test_acc']*100,
    fit_time_dit.best, metric_dit[0]['test_acc']*100,
    fit_time_dit_ipex.best, metric_dit_ipex[0]['test_acc']*100
)
print(summary)


|      Precision    | Fit Time(s)       | Accuracy(%) |
|        Basic      |       3066.26       |    90.81    |
|        Ipex       |       3169.98       |    91.56    |
|     Distributed   |       2022.82       |    92.39    |
|   Dist with IPEX  |       2104.62       |    91.87    |

