# Lab04 Task 2-2: Manual Post-Training Static Quantization

In this notebook, we will try to manually quantize the pretrained model.

<font color="red">**Only add or modify code between `YOUR CODE START` and `YOUR CODE END`. Don’t change anything outside of these markers.**</font>

In [1]:
##### YOUR CODE START #####

# Please fill in your student id here.
student_id = "314580042"

##### YOUR CODE END #####

### Library Import

The libraries you need for this practice are listed below. You can add more if you think they’re necessary. If you’re not sure whether a library is allowed, ask TA in the FB group.

In [2]:
import os
import tqdm
import torch
from torch import nn
from torch.utils.data import DataLoader
import torchvision
from torchvision import datasets, transforms, models
import copy
from resnet20_int8 import (
    QuantizedTensor,
    QuantizedCifarResNet,
    QuantizeLayer,
    QuantizedConv2d,
    QuantizedConvReLU2d,
    QuantizedReLU,
    QuantizedLinear,
    QuantizedAdaptiveAvgPool2d,
    QuantizedAdd,
    QuantizedFlatten,
)
import matplotlib

##### YOUR CODE START #####

# Do you need any additional libraries? If not, you can leave this block empty.
# For this task, you must attempt to manually quantize the model. Therefore, using any libraries that perform automatic quantization or calculate scale/zero-point values is prohibited.

##### YOUR CODE END #####

### Device

If you have GPU available, you should see "cuda" in the following cell.

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device: %s" % device)

Using device: cuda


### Dataset

In this lab, we will use CIFAR-10 dataset. CIFAR-10 is a widely used image classification dataset consisting of 60,000 color images at 32×32 resolution. It has 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck), with 50,000 training images and 10,000 test images. Due to its small size and balanced categories, CIFAR-10 is commonly used for benchmarking machine learning and computer vision models.

CIFAR-10 has both a training set and a test set. Post-training static quantization requires a small subset of the training set for calibration. On the other hand, manually quantizing the convolutional layer weights is a data-free process. The test set is only used at the end to evaluate the final result.

In [4]:
# Load training & test set

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), 
                         (0.2023, 0.1994, 0.2010))
])

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=128,
                                         shuffle=False)
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                       download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128,
                                         shuffle=False)

### Load Model

In this lab, you do not need to train a model from scratch. We will use a pretrained ResNet20 model instead. ResNet20 is a popular deep learning model for image classification. Its key feature is the use of skip (residual) connections, which make training deep networks easier and more stable.
The code below loads the pre-trained model and evaluates its accuracy on the test set, which should be <font color="red">**92.60%**</font>. Please use this model for the subsequent tasks. <font color="red">**Designing and training your own model is not allowed.**</font>

In [5]:
model = torch.hub.load('chenyaofo/pytorch-cifar-models', 
                       'cifar10_resnet20', pretrained=True).to(device)
model.eval()

Using cache found in /export/home/dl2025f/dl2025f_116/.cache/torch/hub/chenyaofo_pytorch-cifar-models_master


CifarResNet(
  (conv1): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias

In [6]:
def test_acc(model_test):
    correct = 0
    total = 0
    with torch.no_grad():
        for images, labels in testloader:
            images = images.to(device, non_blocking=True)
            labels = labels.to(device, non_blocking=True)
            outputs = model_test(images)
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    acc = 100 * correct / total
    print(f'Accuracy on CIFAR-10 test set: {acc:.2f}%')
    return acc
    
_ = test_acc(model)

Accuracy on CIFAR-10 test set: 92.59%


### Manually Quantizing the Model

While PyTorch's FX graph mode is very convenient, it abstracts away the details of how quantized weights are actually calculated. Therefore, in this section, you will attempt to manually quantize the model.

To pass this lab, the test accuracy of your manually quantized model must be higher than <font color="red">**90.00%.**</font>

<font color="red">**Please be aware of the following rules. Violating them will result in a score of zero for this section:**</font>

1. Your modifications to the model are strictly limited to populating the parameters of the `QuantizedCifarResNet` model. Any other operations, including but not limited to retraining, or changing the model architecture, are forbidden.

2. You must explicitly show your calculation process. The use of any functions that automatically compute scale / zero_point or gather statistics is prohibited. (The pre-defined observer in the previous task is prohibited, but it is allowed to use `torch.max` and `torch.min`, or define an observer on your own.) Also, you must not directly assign numerical values without demonstrating how they were derived.

### Introduction to QuantizedCifarResNet

`QuantizedCifarResNet` is a modified version of the standard `CifarResNet` architecture, specifically adapted for **integer-only inference**. Unlike the original `CifarResNet` which performs computations using 32-bit floating-point (FP32) numbers, this quantized version primarily uses 8-bit integer arithmetic (`int8` for weights, `uint8` for activations) for most of its operations. This significantly reduces model size and can lead to faster inference speeds on hardware with specialized integer instruction support.

The key differences arise from replacing standard PyTorch layers (`nn.Conv2d`, `nn.ReLU`, `nn.Linear`, etc.) with custom-defined quantized layer equivalents. These custom layers require specific **quantization parameters** (scale and zero-point) to map the integer values back to the approximate floating-point range, ensuring the model maintains reasonable accuracy. Data between these layers is passed using a `QuantizedTensor` wrapper object, which bundles the `uint8` tensor data with its corresponding `scale` and `zero_point`.

Here's a breakdown of the custom quantized layers used in this implementation and the parameters they typically require **after initialization** (usually set via methods like `set_..._params` or `set_..._weight` after calibration):

1.  **`QuantizeLayer`**:
    * **Role**: The entry point, converts the initial `float32` input tensor into a `QuantizedTensor` (`uint8`).
    * **Parameters Needed**:
        * `output_scale (float)`: The scale factor for the output activation.
        * `output_zero_point (int)`: The zero-point for the output activation.

2.  **`QuantizedConv2d`** (Used for `conv2` in BasicBlock and `downsample`):
    * **Role**: Performs 2D convolution using `int8` weights and `uint8` activations, producing a `uint8` output. Uses `fp32` bias.
    * **Parameters Needed**:
        * `weight_int8 (torch.Tensor[int8])`: The quantized weights (typically obtained after fusing BatchNorm).
        * `weight_scale (torch.Tensor[float32])`: The **per-channel** scale factor for the weights.
        * `weight_zero_point (torch.Tensor[int32])`: The **per-channel** zero-point for the weights.
        * `bias_fp32 (torch.Tensor[float32])`: The fused `float32` bias term.
        * `output_scale (float)`: The scale factor for the output activation.
        * `output_zero_point (int)`: The zero-point for the output activation.

3.  **`QuantizedConvReLU2d`** (Used for `conv1` in BasicBlock and the first `conv1` of the network):
    * **Role**: Fuses `Conv2d` and `ReLU` operations. Similar to `QuantizedConv2d` but applies ReLU before the final requantization step. Uses `fp32` bias. **The output zero-point is implicitly 0 due to ReLU.**
    * **Parameters Needed**:
        * `weight_int8 (torch.Tensor[int8])`
        * `weight_scale (torch.Tensor[float32])` (**Per-channel**)
        * `weight_zero_point (torch.Tensor[int32])` (**Per-channel**)
        * `bias_fp32 (torch.Tensor[float32])`
        * `output_scale (float)` (Output zero-point is fixed to 0 internally).

4.  **`QuantizedReLU`**:
    * **Role**: Applies ReLU activation directly on the `uint8` tensor by clamping values below the input `zero_point`.
    * **Parameters Needed**: None (It's stateless and uses the parameters from the input `QuantizedTensor`).

5.  **`QuantizedAdd`**:
    * **Role**: Performs element-wise addition of two `QuantizedTensor` inputs (requiring dequantization, float addition, and requantization). Used for residual connections.
    * **Parameters Needed**:
        * `output_scale (float)`: The scale factor for the resulting summed activation.
        * `output_zero_point (int)`: The zero-point for the resulting summed activation.

6.  **`QuantizedAdaptiveAvgPool2d`**:
    * **Role**: Performs adaptive average pooling on the `uint8` tensor.
    * **Parameters Needed**: None (Stateless, passes through the input scale/zero-point after integer averaging).

7.  **`QuantizedFlatten`**:
    * **Role**: Flattens the `uint8` tensor while preserving scale/zero-point.
    * **Parameters Needed**: None (Stateless).

8.  **`QuantizedLinear`** (Used as the final `fc` layer):
    * **Role**: Performs a linear transformation using `int8` weights and `uint8` input, producing a `float32` output (common for the final classification layer). Uses `fp32` bias.
    * **Parameters Needed**:
        * `weight_int8 (torch.Tensor[int8])`
        * `weight_scale (torch.Tensor[float32])` (**Per-channel/output feature**)
        * `weight_zero_point (torch.Tensor[int32])` (**Per-channel/output feature**)
        * `bias_fp32 (torch.Tensor[float32])`

You can find more details of `QuantizedCifarResNet` in `resnet20_int8.py`

In [7]:
model_manual = QuantizedCifarResNet().to(device)
model_manual.eval()

QuantizedCifarResNet(
  (quant): QuantizeLayer(output_scale=1.000000, output_zero_point=0)
  (conv1): QuantizedConvReLU2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=1, bias=True)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): QuantizedConvReLU2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=1, bias=True)
      (conv2): QuantizedConv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=1, bias=True)
      (relu): QuantizedReLU(Quantized ReLU (uint8 clamp at zero_point))
      (add): QuantizedAdd()
    )
    (1): BasicBlock(
      (conv1): QuantizedConvReLU2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=1, bias=True)
      (conv2): QuantizedConv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=1, bias=True)
      (relu): QuantizedReLU(Quantized ReLU (uint8 clamp at zero_point))
      (add): QuantizedAdd()
    )
    (2): BasicBlock(
      (conv1): QuantizedConvReLU2d(16, 16, kerne

#### 1. Prepare the Model

Let's manually quantize the model step-by-step. We'll start with the preparation phase. Although PyTorch's FX graph mode offers `prepare_fx` to automatically insert observers and fuse layers (e.g., `Conv2d`, `BatchNorm2d`, `ReLU`), we need to do this manually here. So, we'll define our own observer now and insert it into the FP32 model. Layer fusion mainly concerns weight recalculation, so we'll handle that later in step three. Below is an example of defining a min/max observer and inserting it into the initial `bn1` layer.

In [8]:
##### YOUR CODE START #####

# Define the Observer class (can be used as a hook)
class Observer:
    def __init__(self):
        # Initialize min/max to capture the first batch's range.
        self.min_val = float('inf')
        self.max_val = float('-inf')

    def __call__(self, module: nn.Module, inputs: tuple, output: torch.Tensor):
        # Hook function executed after the module's forward pass.
        # Detach tensor for efficiency and get scalar min/max.
        # module (nn.Module): The module the hook is registered on.
        # inputs (tuple): A tuple containing the input(s) to the module.
        # output (torch.Tensor): The output tensor from the module.
        batch_min = output.detach().min().item()
        batch_max = output.detach().max().item()

        # Update overall min/max seen so far.
        self.min_val = min(self.min_val, batch_min)
        self.max_val = max(self.max_val, batch_max)

    def get_min_max(self) -> tuple[float, float]:
        # Returns the overall observed min and max values.
        return self.min_val, self.max_val

    def reset(self):
        # Resets the observed min/max values.
        self.min_val = float('inf')
        self.max_val = float('-inf')

# Create a copy to attach hooks without modifying the original model.
model_prepared = copy.deepcopy(model)

# Dictionary to store all observers for each layer we want to monitor
observers = {}

# We need to observe activations after each layer that will be quantized
# For ResNet20, we need to observe:
# 1. Input (after first conv+bn)
# 2. After each Conv+BN or Conv+BN+ReLU fusion point
# 3. After Add operations (residual connections)

# Register observers on key layers
# Initial layers
observers['input'] = Observer()
# Observe bn1 (pre-ReLU) so we can derive post-ReLU range for the first fused conv
observers['bn1'] = Observer()
model_prepared.bn1.register_forward_hook(observers['bn1'])
# (optional) keep relu observer if you also use it elsewhere
observers['relu'] = Observer()
model_prepared.relu.register_forward_hook(observers['relu'])

# Layer 1 (3 blocks)
for i in range(3):
    observers[f'layer1.{i}.bn1'] = Observer()
    observers[f'layer1.{i}.bn2'] = Observer()
    observers[f'layer1.{i}.relu'] = Observer()
    model_prepared.layer1[i].bn1.register_forward_hook(observers[f'layer1.{i}.bn1'])
    model_prepared.layer1[i].bn2.register_forward_hook(observers[f'layer1.{i}.bn2'])
    model_prepared.layer1[i].relu.register_forward_hook(observers[f'layer1.{i}.relu'])

# Layer 2 (3 blocks)
for i in range(3):
    observers[f'layer2.{i}.bn1'] = Observer()
    observers[f'layer2.{i}.bn2'] = Observer()
    observers[f'layer2.{i}.relu'] = Observer()
    model_prepared.layer2[i].bn1.register_forward_hook(observers[f'layer2.{i}.bn1'])
    model_prepared.layer2[i].bn2.register_forward_hook(observers[f'layer2.{i}.bn2'])
    model_prepared.layer2[i].relu.register_forward_hook(observers[f'layer2.{i}.relu'])
    # Downsample in first block
    if i == 0 and model_prepared.layer2[i].downsample is not None:
        observers['layer2.0.downsample.1'] = Observer()
        model_prepared.layer2[i].downsample[1].register_forward_hook(observers['layer2.0.downsample.1'])

# Layer 3 (3 blocks)
for i in range(3):
    observers[f'layer3.{i}.bn1'] = Observer()
    observers[f'layer3.{i}.bn2'] = Observer()
    observers[f'layer3.{i}.relu'] = Observer()
    model_prepared.layer3[i].bn1.register_forward_hook(observers[f'layer3.{i}.bn1'])
    model_prepared.layer3[i].bn2.register_forward_hook(observers[f'layer3.{i}.bn2'])
    model_prepared.layer3[i].relu.register_forward_hook(observers[f'layer3.{i}.relu'])
    # Downsample in first block
    if i == 0 and model_prepared.layer3[i].downsample is not None:
        observers['layer3.0.downsample.1'] = Observer()
        model_prepared.layer3[i].downsample[1].register_forward_hook(observers['layer3.0.downsample.1'])

print(f"Registered {len(observers)} observers on the model")

##### YOUR CODE END #####

Registered 32 observers on the model


#### 2. Calibrate the Model

Next, we need to calibrate the model. This step is quite similar to the process when using PyTorch's FX graph mode. It simply involves feeding data from the training set (or a representative subset of it) through the model we prepared earlier (the one with observers attached). Please note: <font color="red">**it is crucial not to use the test set data for calibration.**</font>

In [9]:
##### YOUR CODE START #####

# Perform calibration here.

# Use a subset of training data for calibration
calibration_batches = 50  # Use 50 batches (6400 samples) for calibration

print("Starting calibration...")
model_prepared.eval()

with torch.no_grad():
    batch_count = 0
    # Track input statistics separately
    input_min = float('inf')
    input_max = float('-inf')
    
    for images, _ in trainloader:
        if batch_count >= calibration_batches:
            break
        
        images = images.to(device)
        
        # Track input statistics
        batch_input_min = images.min().item()
        batch_input_max = images.max().item()
        input_min = min(input_min, batch_input_min)
        input_max = max(input_max, batch_input_max)
        
        # Forward pass to collect statistics via observers
        _ = model_prepared(images)
        batch_count += 1
    
    observers['input'].min_val = input_min
    observers['input'].max_val = input_max

print(f"Calibration completed with {batch_count} batches")
print(f"Input range: [{input_min:.4f}, {input_max:.4f}]")

# Print some statistics to verify calibration
print("\nSample observer statistics:")
print(f"relu (first layer): [{observers['relu'].min_val:.4f}, {observers['relu'].max_val:.4f}]")
print(f"layer1.0.relu: [{observers['layer1.0.relu'].min_val:.4f}, {observers['layer1.0.relu'].max_val:.4f}]")

##### YOUR CODE END #####

Starting calibration...
Calibration completed with 50 batches
Input range: [-2.4291, 2.7537]

Sample observer statistics:
relu (first layer): [0.0000, 3.8333]
layer1.0.relu: [0.0000, 4.6231]


#### 3. Convert the Model

Finally, we need to populate the `QuantizedCifarResNet` with its parameters. You will need to iterate through all layers in the quantized model and set the required parameters (such as quantized weights, scales, zero-points, and biases) based on their type. The necessary data should be obtained from the original model and the observers inserted previously. Additionally, note that consecutive `Conv2d`, `BatchNorm2d`, and `ReLU` layers in the original model have been fused into corresponding single layers in `QuantizedCifarResNet`. You must adjust the `Conv2d` weights and biases according to the parameters of the corresponding `BatchNorm2d` layer.

In [10]:
##### YOUR CODE START #####

# Iterate through all layers in the quantized model and set the required parameters (such as quantized weights, scales, zero-points, and biases) based on their type.

# Helper functions for quantization calculations
def calculate_scale_zero_point(min_val, max_val, dtype='uint8'):
    """
    Calculate scale and zero_point for quantization using min-max method.
    
    For uint8: range is [0, 255]
    Formula:
        scale = (max_val - min_val) / (qmax - qmin)
        zero_point = qmin - round(min_val / scale)
    
    where qmin=0, qmax=255 for uint8
    """
    if dtype == 'uint8':
        qmin, qmax = 0, 255
    elif dtype == 'int8':
        qmin, qmax = -128, 127
    else:
        raise ValueError(f"Unsupported dtype: {dtype}")
    
    # Avoid division by zero
    if max_val == min_val:
        scale = 1.0
        zero_point = 0
    else:
        # Calculate scale
        scale = (max_val - min_val) / (qmax - qmin)
        # Calculate zero_point
        zero_point = qmin - round(min_val / scale)
        # Clamp zero_point to valid range
        zero_point = max(qmin, min(qmax, zero_point))
    
    return scale, int(zero_point)

def quantize_weight_per_channel(weight, dtype='int8'):
    """
    Quantize weights using per-channel (per output channel) quantization.
    Returns: weight_int8, weight_scale, weight_zero_point
    
    For each output channel:
        - Find min/max of weights in that channel
        - Calculate scale and zero_point
        - Quantize: q = round(w / scale) + zero_point
    """
    out_channels = weight.shape[0]
    weight_int = torch.zeros_like(weight, dtype=torch.int8)
    weight_scale = torch.zeros(out_channels, dtype=torch.float32)
    weight_zero_point = torch.zeros(out_channels, dtype=torch.int32)
    
    for i in range(out_channels):
        channel_weight = weight[i]
        # Calculate min/max for this channel
        w_min = torch.min(channel_weight).item()
        w_max = torch.max(channel_weight).item()
        
        # Calculate scale and zero_point
        scale, zp = calculate_scale_zero_point(w_min, w_max, dtype='int8')
        
        # Quantize the weights
        w_quant = torch.round(channel_weight / scale) + zp
        w_quant = torch.clamp(w_quant, -128, 127)
        
        weight_int[i] = w_quant.to(torch.int8)
        weight_scale[i] = scale
        weight_zero_point[i] = zp
    
    return weight_int, weight_scale, weight_zero_point

def fuse_conv_bn(conv, bn):
    """
    Fuse Conv2d and BatchNorm2d layers.
    
    BatchNorm formula: y = gamma * (x - mean) / sqrt(var + eps) + beta
    After fusion with conv:
        w_fused = w_conv * gamma / sqrt(var + eps)
        b_fused = (b_conv - mean) * gamma / sqrt(var + eps) + beta
    
    If conv has no bias: b_conv = 0
    """
    # Get conv parameters
    w_conv = conv.weight.data
    if conv.bias is not None:
        b_conv = conv.bias.data
    else:
        b_conv = torch.zeros(conv.out_channels, device=w_conv.device, dtype=w_conv.dtype)
    
    # Get BN parameters
    gamma = bn.weight.data
    beta = bn.bias.data
    mean = bn.running_mean.data
    var = bn.running_var.data
    eps = bn.eps
    
    # Calculate fusion parameters
    # std = sqrt(var + eps)
    std = torch.sqrt(var + eps)
    
    # w_fused = w_conv * gamma / std
    # Broadcasting: gamma/std has shape [out_channels], w_conv has shape [out_channels, in_channels, k, k]
    scale_factor = (gamma / std).view(-1, 1, 1, 1)
    w_fused = w_conv * scale_factor
    
    # b_fused = (b_conv - mean) * gamma / std + beta
    b_fused = (b_conv - mean) * gamma / std + beta
    
    return w_fused, b_fused

# Start populating the quantized model
print("Starting manual quantization...")

# Get the original model's modules for reference
orig_modules = dict(model.named_modules())

# 1. QuantizeLayer (input quantization)
print("\n1. Quantizing input layer...")
input_scale, input_zp = calculate_scale_zero_point(
    observers['input'].min_val,
    observers['input'].max_val,
    dtype='uint8'
)
model_manual.quant.set_output_quant_params(input_scale, input_zp)
print(f"   Input: scale={input_scale:.6f}, zp={input_zp}")

# 2. First conv1 + bn1 + relu (fused into QuantizedConvReLU2d)
# Use bn1 observer (pre-ReLU) to derive post-ReLU range, avoiding mixing with block-end ReLU
print("\n2. Quantizing conv1 + bn1 + relu (fused)...")
conv1 = orig_modules['conv1']
bn1 = orig_modules['bn1']
w_fused, b_fused = fuse_conv_bn(conv1, bn1)
w_int8, w_scale, w_zp = quantize_weight_per_channel(w_fused)
relu_min, relu_max = 0.0, max(0.0, observers['bn1'].max_val)
output_scale, _ = calculate_scale_zero_point(relu_min, relu_max, dtype='uint8')
model_manual.conv1.set_int8_weight(w_int8)
model_manual.conv1.set_weight_quant_params(w_scale, w_zp)
model_manual.conv1.set_fp32_bias(b_fused)
model_manual.conv1.set_output_quant_params(output_scale)
print(f"   Conv1: weight_scale shape={w_scale.shape}, output_scale={output_scale:.6f}")

# 3. Process each layer (layer1, layer2, layer3)
for layer_idx, layer_name in enumerate(['layer1', 'layer2', 'layer3']):
    print(f"\n{3+layer_idx}. Quantizing {layer_name}...")
    layer = getattr(model_manual, layer_name)
    orig_layer = getattr(model, layer_name)
    
    for block_idx in range(3):
        print(f"   Block {block_idx}:")
        block = layer[block_idx]
        orig_block = orig_layer[block_idx]
        
        # Conv1 + BN1 (fused into QuantizedConvReLU2d)
        w_fused, b_fused = fuse_conv_bn(orig_block.conv1, orig_block.bn1)
        w_int8, w_scale, w_zp = quantize_weight_per_channel(w_fused)
        # derive post-ReLU range from bn1 stats: [0, max(0, bn1_max)]
        relu_min, relu_max = 0.0, max(0.0, observers[f'{layer_name}.{block_idx}.bn1'].max_val)
        output_scale, _ = calculate_scale_zero_point(relu_min, relu_max, dtype='uint8')
        block.conv1.set_int8_weight(w_int8)
        block.conv1.set_weight_quant_params(w_scale, w_zp)
        block.conv1.set_fp32_bias(b_fused)
        block.conv1.set_output_quant_params(output_scale)
        print(f"      conv1: output_scale={output_scale:.6f}")
        
        # Conv2 + BN2 (fused into QuantizedConv2d)
        w_fused, b_fused = fuse_conv_bn(orig_block.conv2, orig_block.bn2)
        w_int8, w_scale, w_zp = quantize_weight_per_channel(w_fused)
        output_scale, output_zp = calculate_scale_zero_point(
            observers[f'{layer_name}.{block_idx}.bn2'].min_val,
            observers[f'{layer_name}.{block_idx}.bn2'].max_val,
            dtype='uint8'
        )
        block.conv2.set_int8_weight(w_int8)
        block.conv2.set_weight_quant_params(w_scale, w_zp)
        block.conv2.set_fp32_bias(b_fused)
        block.conv2.set_output_quant_params(output_scale, output_zp)
        print(f"      conv2: output_scale={output_scale:.6f}, output_zp={output_zp}")
        
        # Downsample (if exists)
        if orig_block.downsample is not None:
            print(f"      Processing downsample...")
            downsample_conv = orig_block.downsample[0]
            downsample_bn = orig_block.downsample[1]
            w_fused, b_fused = fuse_conv_bn(downsample_conv, downsample_bn)
            w_int8, w_scale, w_zp = quantize_weight_per_channel(w_fused)
            output_scale, output_zp = calculate_scale_zero_point(
                observers[f'{layer_name}.{block_idx}.downsample.1'].min_val,
                observers[f'{layer_name}.{block_idx}.downsample.1'].max_val,
                dtype='uint8'
            )
            block.downsample[0].set_int8_weight(w_int8)
            block.downsample[0].set_weight_quant_params(w_scale, w_zp)
            block.downsample[0].set_fp32_bias(b_fused)
            block.downsample[0].set_output_quant_params(output_scale, output_zp)
            print(f"         downsample: output_scale={output_scale:.6f}, output_zp={output_zp}")
        
        # QuantizedAdd (after residual connection)
        add_output_scale, add_output_zp = calculate_scale_zero_point(
            observers[f'{layer_name}.{block_idx}.relu'].min_val,
            observers[f'{layer_name}.{block_idx}.relu'].max_val,
            dtype='uint8'
        )
        block.add.set_output_quant_params(add_output_scale, add_output_zp)
        print(f"      add: output_scale={add_output_scale:.6f}, output_zp={add_output_zp}")

# 4. Final FC layer
print("\n6. Quantizing FC layer...")
fc = orig_modules['fc']
w = fc.weight.data
b = fc.bias.data if fc.bias is not None else torch.zeros(fc.out_features, device=w.device, dtype=w.dtype)
w_int8, w_scale, w_zp = quantize_weight_per_channel(w)
model_manual.fc.set_int8_weight(w_int8)
model_manual.fc.set_weight_quant_params(w_scale, w_zp)
model_manual.fc.set_fp32_bias(b)
print(f"   FC: weight_scale shape={w_scale.shape}")

print("\nManual quantization completed!")

##### YOUR CODE END #####

Starting manual quantization...

1. Quantizing input layer...
   Input: scale=0.020325, zp=120

2. Quantizing conv1 + bn1 + relu (fused)...
   Conv1: weight_scale shape=torch.Size([16]), output_scale=0.015032

3. Quantizing layer1...
   Block 0:
      conv1: output_scale=0.009787
      conv2: output_scale=0.025925, output_zp=117
      add: output_scale=0.018130, output_zp=0
   Block 1:
      conv1: output_scale=0.010260
      conv2: output_scale=0.018620, output_zp=94
      add: output_scale=0.022498, output_zp=0
   Block 2:
      conv1: output_scale=0.008037
      conv2: output_scale=0.026892, output_zp=126
      add: output_scale=0.019750, output_zp=0

4. Quantizing layer2...
   Block 0:
      conv1: output_scale=0.011693
      conv2: output_scale=0.024414, output_zp=130
      Processing downsample...
         downsample: output_scale=0.016035, output_zp=103
      add: output_scale=0.018474, output_zp=0
   Block 1:
      conv1: output_scale=0.006851
      conv2: output_scale=0.018758

In [11]:
# Let's see the result.
acc = test_acc(model_manual)

print("\n===========================================\n")

if acc < 90.0:
    print("Oh no! Your test accuracy is too low!")
else:
    print("Congratulations! You've achieved the goal of this task. Remember to save your model!")
    print("You can also try increasing accuracy further to earn a higher score!")

Accuracy on CIFAR-10 test set: 92.49%


Congratulations! You've achieved the goal of this task. Remember to save your model!
You can also try increasing accuracy further to earn a higher score!


### Save Model

You can use the code below to save your model as `[student_id]_quantization.pt`, where `[student_id]` is replaced by your student ID in the first cell of this notebook.

In [12]:
file_name = student_id + "_quantization.pt"
# Save model.state_dict() instead of the entire model.
torch.save(model_manual.state_dict(), file_name)
print("Your model is saved to \"" + file_name + "\".")

Your model is saved to "314580042_quantization.pt".


### Final Check

TA has provided check_quantization.py for students to check if their models can pass the tests. <font color="red">**Please make sure to check it before submission.**</font>

In [13]:
!python check_quantization.py --path {file_name}

Congratulations! You've achieved the goals of this task.
Your model's test accuracy is 92.49%.
