# **Part 1: Run MobileNet on GPU**

In this tutorial, we will explore how to train a neural network with PyTorch.

### Setup (5%)

We will first install a few packages that will be used in this tutorial and also define the path of CUDA library:

In [1]:
!pip install torchprofile 1>/dev/null
!ldconfig /usr/lib64-nvidia 2>/dev/null
!pip install onnx 1>/dev/null
!pip install onnxruntime 1>/dev/null

We will then import a few libraries:

In [2]:
import random

import numpy as np
import torch
import torchvision
from torch import nn
from torch.optim import *
from torch.optim.lr_scheduler import *
from torch.utils.data import DataLoader
from torchprofile import profile_macs
from torchvision.datasets import *
from torchvision.transforms import *
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
print(torch.__version__)
print(torchvision.__version__)

2.1.0
0.16.0


To ensure the reproducibility, we will control the seed of random generators:

In [4]:
random.seed(0)
np.random.seed(0)
torch.manual_seed(0)

<torch._C.Generator at 0x7febcc6cb330>

We must decide the HYPER-parameter before training the model:

In [5]:
NUM_CLASSES = 10

# TODO:
# Decide your own hyper-parameters
BATCH_SIZE = 125
LEARNING_RATE = 1e-4
NUM_EPOCH = 3

### Data  (5%)

In this lab, we will use CIFAR-10 as our target dataset. This dataset contains images from 10 classes, where each image is of
size 3x32x32, i.e. 3-channel color images of 32x32 pixels in size.

Before using the data as input, we can do data pre-processing with transform function:

In [6]:
# TODO:
# Resize images to 224x224, i.e., the input image size of MobileNet,
# Convert images to PyTorch tensors, and
# Normalize the images with mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
transform = Compose([
  Resize((224, 224)),
  ToTensor(),
  Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])


dataset = {}
for split in ["train", "test"]:
  dataset[split] = CIFAR10(
    root="data/cifar10",
    train=(split == "train"),
    download=True,
    transform=transform,
  )

Files already downloaded and verified
Files already downloaded and verified


To train a neural network, we will need to feed data in batches.

We create data loaders with the batch size determined previously in setup section:

In [7]:
dataflow = {}
for split in ['train', 'test']:
  dataflow[split] = DataLoader(
    dataset[split],
    batch_size=BATCH_SIZE,
    shuffle=(split == 'train'),
    num_workers=0,
    pin_memory=True,
    drop_last=True
  )

We can print the data type and shape from the training data loader:

In [8]:
for inputs, targets in dataflow["train"]:
  print(f"[inputs] dtype: {inputs.dtype}, shape: {inputs.shape}")
  print(f"[targets] dtype: {targets.dtype}, shape: {targets.shape}")
  break

[inputs] dtype: torch.float32, shape: torch.Size([125, 3, 224, 224])
[targets] dtype: torch.int64, shape: torch.Size([125])


### Model (10%)

In this tutorial, we will import MobileNet provided by torchvision, and use the pre-trained weight:

In [9]:
# TODO:
# Load pre-trained MobileNetV2
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
model = mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT)
print(model)



MobileNetV2(
  (features): Sequential(
    (0): Conv2dNormActivation(
      (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU6(inplace=True)
    )
    (1): InvertedResidual(
      (conv): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
          (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU6(inplace=True)
        )
        (1): Conv2d(32, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (2): InvertedResidual(
      (conv): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(16, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (1): BatchNorm2d(96, eps=

You should observe that the output dimension of the classifier does not match the number of cleasses in CIFAR-10.

Now change the output dimension of the classifer to number of classes:

In [10]:
# TODO:
# Change the output dimension of the classifer to number of classes
model.classifier[1] = nn.Sequential(
  nn.Dropout(0.2),
  nn.Linear(1280, NUM_CLASSES)
)
print(model)

# Send the model from cpu to gpu
model = model.cuda()

MobileNetV2(
  (features): Sequential(
    (0): Conv2dNormActivation(
      (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU6(inplace=True)
    )
    (1): InvertedResidual(
      (conv): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
          (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU6(inplace=True)
        )
        (1): Conv2d(32, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (2): InvertedResidual(
      (conv): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(16, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (1): BatchNorm2d(96, eps=

Now the output dimension of the classifer matches.

As this course focuses on efficiency, we will then inspect its model size and (theoretical) computation cost.


* The model size can be estimated by the number of trainable parameters:

In [11]:
num_params = 0
for param in model.parameters():
  if param.requires_grad:
    num_params += param.numel()
print("#Params:", num_params)

#Params: 2236682


* The computation cost can be estimated by the number of [multiply–accumulate operations (MACs)](https://en.wikipedia.org/wiki/Multiply–accumulate_operation) using [TorchProfile](https://github.com/zhijian-liu/torchprofile), we will further use this profiling tool in the future labs .

In [12]:
num_macs = profile_macs(model, torch.zeros(1, 3, 224, 224).cuda())
print("#MACs:", num_macs)

#MACs: 306186464


This model has 2.2M parameters and requires 306M MACs for inference. We will work together in the next few labs to improve its efficiency.

### Optimization (10%)

As we are working on a classification problem, we will apply [cross entropy](https://en.wikipedia.org/wiki/Cross_entropy) as our loss function to optimize the model:

In [13]:
# TODO:
# Apply cross entropy as our loss function
criterion = nn.CrossEntropyLoss()

We should decide an optimizer for the model:

In [14]:
# TODO:
# Choose an optimizer.
optimizer = Adam(model.parameters(), lr=LEARNING_RATE)

(Optional) We can apply a learning rate scheduler during the training:

In [15]:
# TODO(optional):
scheduler = ExponentialLR(optimizer, gamma=0.99)

### Training (25%)

We first define the function that optimizes the model for one batch:

In [16]:
def train_one_batch(
  model: nn.Module,
  criterion: nn.Module,
  optimizer: Optimizer,
  # scheduler: LRScheduler,
  inputs: torch.Tensor,
  targets: torch.Tensor
) -> None:

  # TODO:
  # Step 1: Reset the gradients (from the last iteration)
  optimizer.zero_grad()

  # Step 2: Forward inference
  preds = model(inputs)

  # Step 3: Calculate the loss
  loss = criterion(preds, targets)

  # Step 4: Backward propagation
  loss.backward()

  # Step 5: Update optimizer
  optimizer.step()

  # (Optional Step 6: scheduler)
  # scheduler.step()


We then define the training function:

In [17]:
def train(
    model: nn.Module,
    dataflow: DataLoader,
    criterion: nn.Module,
    optimizer: Optimizer
):

  model.train()

  for inputs, targets in tqdm(dataflow, desc='train', leave=False):
    # Move the data from CPU to GPU
    inputs = inputs.cuda()
    targets = targets.cuda()

    # Call train_one_batch function
    train_one_batch(model, criterion, optimizer, inputs, targets)

Last, we define the evaluation function:

In [18]:
def evaluate(
  model: nn.Module,
  dataflow: DataLoader
) -> float:

  model.eval()
  num_samples = 0
  num_correct = 0

  with torch.no_grad():
    for inputs, targets in tqdm(dataflow, desc="eval", leave=False):
      # TODO:
      # Step 1: Move the data from CPU to GPU
      inputs = inputs.cuda()
      targets = targets.cuda()

      # Step 2: Forward inference
      outs = model(inputs)

      # Step 3: Convert logits to class indices (predicted class)
      predicts = torch.argmax(outs, dim=-1)

      # Update metrics
      num_samples += targets.size(0)
      num_correct += (predicts == targets).sum()

  return (num_correct / num_samples * 100).item()

With training and evaluation functions, we can finally start training the model!

If the training is done properly, the accuracy should simply reach higher than 0.925:

In [19]:
for epoch_num in (range(1, NUM_EPOCH + 1)):
  train(model, dataflow["train"], criterion, optimizer)
  acc = evaluate(model, dataflow["test"])
  print(f"epoch {epoch_num}:", acc)

print(f"final accuracy: {acc}")

                                                        

epoch 1: 90.8499984741211


                                                        

epoch 2: 92.45999908447266


                                                        

epoch 3: 93.20999908447266
final accuracy: 93.20999908447266




Save the weight of the model as "model.pt":

In [20]:
# TODO:
# Save the model weight
torch.save(model.state_dict(), "model.pt")

You will find "model.pt" in the current folder.

### Export Model (5%)

We can also save the model weight in [ONNX Format](https://pytorch.org/docs/stable/onnx_torchscript.html):

In [22]:
import torch.onnx

# TODO:
# Specify the input shape
dummy_input = torch.randn(10, 3, 224, 224, device="cuda")

onnx_path = 'model.onnx'

# TODO:
# Export the model to ONNX format
torch.onnx.export(
  model,
  dummy_input,
  onnx_path,
  verbose=True
)

print(f"Model exported to {onnx_path}")

Model exported to model.onnx


In onnx format, we can observe the model structure using [Netron](https://netron.app/).

**Please download the model structure and hand in as YourID_onnx.png.**

### Inference (10%)

Load the saved model weight:



In [24]:
# TODO:
# Step 1: Get the model structure (mobilenet_v2 and the classifier)
loaded_model = mobilenet_v2()
loaded_model.classifier[1] = nn.Sequential(
  nn.Dropout(0.2),
  nn.Linear(1280, NUM_CLASSES)
)

# Step 2: Load the model weight from "model.pt".
state = torch.load("model.pt")
loaded_model.load_state_dict(state)

# Step 3: Send the model from cpu to gpu
loaded_model.cuda()


MobileNetV2(
  (features): Sequential(
    (0): Conv2dNormActivation(
      (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU6(inplace=True)
    )
    (1): InvertedResidual(
      (conv): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
          (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU6(inplace=True)
        )
        (1): Conv2d(32, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (2): InvertedResidual(
      (conv): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(16, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (1): BatchNorm2d(96, eps=

Run inference with the loaded model weight and check the accuracy:

In [25]:
acc = evaluate(loaded_model, dataflow["test"])
print(f"accuracy: {acc}")

                                                     

accuracy: 93.20999908447266




If the accurracy is the same as the accuracy before saved, you have completed PART 1.

Congratulations!

# **Part 2: Training and Inference with torch.compile**

```torch.compile``` is a new feature in PyTorch 2.0.

The following tutorial will help you get to know the usage.

[Introduction to torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html)

### Training with torch.compile (15%)

We will reuse the ```train_one_batch``` function in the previous part.

To use ```torch.compile``` in training, we need to apply it on ```train_one_batch``` function:

In [26]:
import torch._dynamo
# Remind that whenever you use torch.compile, you need to use torch._dynamo.reset() to clear all compilation caches and restores the system to its initial state.
torch._dynamo.reset()

# TODO:
# Apply torch.compile on train_one_batch
train_one_batch_opt = torch.compile(train_one_batch)

We define a new training function with ```train_one_batch_opt```:

In [27]:
def train_opt(
    model: nn.Module,
    dataflow: DataLoader,
    criterion: nn.Module,
    optimizer: Optimizer
):

  model.train()

  for inputs, targets in tqdm(dataflow, desc='train'):
    # TODO:
    # Move the data from CPU to GPU
    inputs = inputs.cuda()
    targets = targets.cuda()

    # Call train_one_batch_opt
    train_one_batch_opt(model, criterion, optimizer, inputs, targets)


We can observe that the training time of the first epoch is higher than the rest. The accuracy should easily be higher than 0.925, too:

In [29]:
# TODO:
# Prepare model
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
model = mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT)
model.classifier[1] = nn.Sequential(
  nn.Dropout(0.2),
  nn.Linear(1280, NUM_CLASSES)
)

# TODO:
# Send the model from cpu to gpu
model.cuda()

# TODO:
# Apply cross entropy as our loss function
criterion = nn.CrossEntropyLoss()

# TODO:
# Choose an optimizer.
optimizer = Adam(model.parameters(), lr=LEARNING_RATE)

for epoch_num in (range(1, NUM_EPOCH + 1)):
  train_opt(model, dataflow["train"], criterion, optimizer)
  acc = evaluate(model, dataflow["test"])
  print(f"epoch {epoch_num}:", acc)

print(f"final accuracy: {acc}")

train:   0%|          | 0/400 [00:00<?, ?it/s]

train: 100%|██████████| 400/400 [01:28<00:00,  4.50it/s]
                                                     

epoch 1: 90.88999938964844


train: 100%|██████████| 400/400 [01:00<00:00,  6.64it/s]
                                                     

epoch 2: 92.08999633789062


train: 100%|██████████| 400/400 [01:00<00:00,  6.63it/s]
                                                     

epoch 3: 92.81999206542969
final accuracy: 92.81999206542969


### Inference with torch.compile (15%)

Load the saved model trained from part 1 and apply torch.compile on the loaded model:

In [30]:
# TODO:
# Step 1: Get the model structure (mobilenet_v2 and the classifier)
loaded_model = mobilenet_v2()
loaded_model.classifier[1] = nn.Sequential(
  nn.Dropout(0.2),
  nn.Linear(1280, NUM_CLASSES)
)

# Step 2: Load the model weight from "model.pt".
state = torch.load("model.pt")
loaded_model.load_state_dict(state)

# Step 3: Send the model from cpu to gpu
loaded_model.cuda()

torch._dynamo.reset()
# Step 4: Apply torch.compile on loaded_model
loaded_model.compile()
loaded_model_opt = loaded_model

# Inference
acc = evaluate(loaded_model_opt, dataflow["test"])
print(f"accuracy: {acc}")

eval:   0%|          | 0/80 [00:00<?, ?it/s]

                                                     

accuracy: 93.2199935913086


If the accurracy is the same as the accuracy in part 1, you have completed PART 2.

Congratulations!