<sub>&copy; 2021-present Neuralmagic, Inc. // [Neural Magic Legal](https://neuralmagic.com/legal)</sub> 

# Sparse-Quantized Transfer Learning in PyTorch using SparseML

This notebook provides a step-by-step walkthrough for creating a performant sparse-quantized model
by transfer learning the pruned structure from an already sparse-quantized model.

Sparse-quantized models combine [pruning](https://neuralmagic.com/blog/pruning-overview/) and [quantization](https://arxiv.org/abs/1609.07061) to reduce both the number of parameters and the precision of the remaining parameters to significantly increase the performance of neural networks. Using these optimizations, your model will obtain significantly better (around 7x vs. unoptimized) performance at inference time using the [DeepSparse Engine](https://github.com/neuralmagic/deepsparse).

Sparse-quantized transfer learning takes two steps. [SparseML](https://github.com/neuralmagic/sparseml) recipes make it easy to perform these optimizations:
- First, fine-tune a pre-trained sparse model for the transfer dataset while maintaining the pre-trained sparsity structure.
- Second, perform [quantization-aware training (QAT)](https://pytorch.org/blog/introduction-to-quantization-on-pytorch/#quantization-aware-training) to quantize the now sparse model while still holding the same sparsity structure.

In this notebook, you will:
- Set up the model and dataset
- Define a generic PyTorch training flow
- Integrate the PyTorch flow with SparseML for transfer learning
- Perform sparse transfer learning and quantization-aware training using the PyTorch and SparseML flow
- Export to [ONNX](https://onnx.ai/) and convert the model from a QAT
- [Optional] Compare DeepSparse Engine benchmarks of the final sparse-quantized model to an unoptimized model

Reading through this notebook will be reasonably quick to gain an intuition for how to plug SparseML into your PyTorch training flow for transfer learning and generically. Rough time estimates for fully pruning the default model are given. Note that training with the PyTorch CPU implementation will be much slower than a GPU:
- 30 minutes on a GPU
- 90 minutes on a laptop CPU

## Step 1 - Requirements

To run this notebook, you will need the following packages already installed:
* SparseML, SparseZoo
* PyTorch (>= 1.7.0) and torchvision

You can install any package that is not already present via `pip`.

In [1]:
import sparseml
import sparsezoo
import torch
import torchvision

assert torch.__version__ >= "1.7"

Need sparsezoo version above 0.9.0 to run Neural Magic's latest-version check
cannot import name 'LATEST_PACKAGE_VERSION_URL' from 'sparsezoo.requests' (/home/damian/sparsezoo/src/sparsezoo/requests/__init__.py)


## Step 2 - Setting Up the Model and Dataset

By default, you will transfer learn from a sparse-quantized [ResNet-50](https://arxiv.org/abs/1512.03385) model trained on the [ImageNet dataset](http://www.image-net.org/) to the much smaller [Imagenette dataset](https://github.com/fastai/imagenette). The transfer learning weights are downloaded from the [SparseZoo](https://github.com/neuralmagic/sparsezoo) model repository.   The Imagenette dataset is downloaded from its repository via a helper class from SparseML.

When loading weights for transfer learning classification models, it is standard to override the final classifier layer to fit the output shape for the new dataset.  In the example below, this is done by specifying `ignore_error_tensors` as the weights that will be initialzed for the new model.  In other flows this could be accomplished by setting `model.classifier.fc = torch.nn.Linear(...)`.

In [2]:
from sparseml.pytorch.models import ModelRegistry
from sparseml.pytorch.datasets import ImagenetteDataset, ImagenetteSize
from sparsezoo import Model

#######################################################
# Define your model below
#######################################################
print("loading model...")
# SparseZoo stub to pre-trained sparse-quantized ResNet-50 for imagenet dataset
zoo_stub_path = (
    "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned_quant-moderate"
    "?recipe=transfer_learn"
)
model = ModelRegistry.create(
    key="resnet50",
    pretrained_path=zoo_stub_path,
    pretrained_dataset="imagenette",
    num_classes=10,
    ignore_error_tensors=["classifier.fc.weight", "classifier.fc.bias"],
)
input_shape = ModelRegistry.input_shape("resnet50")
input_size = input_shape[-1]
print(model)
#######################################################
# Define your train and validation datasets below
#######################################################

print("\nloading train dataset...")
train_dataset = ImagenetteDataset(
    train=True, dataset_size=ImagenetteSize.s320, image_size=input_size
)
print(train_dataset)

print("\nloading val dataset...")
val_dataset = ImagenetteDataset(
    train=False, dataset_size=ImagenetteSize.s320, image_size=input_size
)
print(val_dataset)

loading model...


downloading...:   0%|          | 0.00/195M [00:00<?, ?B/s]

ResNet(
  (input): _Input(
    (conv): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (act): ReLU(inplace=True)
    (pool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  )
  (sections): Sequential(
    (0): Sequential(
      (0): _BottleneckBlock(
        (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (act1): ReLU(inplace=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (act2): ReLU(inplace=True)
        (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running

## Step 3 - Creating a PyTorch Training Loop
SparseML can plug directly into your existing PyTorch training flow by overriding the Optimizer object. To demonstrate this, in the cell below, we define a simple PyTorch training loop adapted from [here](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html).  To prune and quantize your existing models using SparseML, you can use your own training flow.

In [3]:
from tqdm.auto import tqdm
import math
import torch


def run_model_one_epoch(model, data_loader, criterion, device, train=False, optimizer=None):
    if train:
        model.train()
    else:
        model.eval()

    running_loss = 0.0
    total_correct = 0
    total_predictions = 0

    for step, (inputs, labels) in tqdm(enumerate(data_loader), total=len(data_loader)):
        inputs = inputs.to(device)
        labels = labels.to(device)

        if train:
            optimizer.zero_grad()

        outputs, _ = model(inputs)  # model returns logits and softmax as a tuple
        loss = criterion(outputs, labels)

        if train:
            loss.backward()
            optimizer.step()

        running_loss += loss.item()

        predictions = outputs.argmax(dim=1)
        total_correct += torch.sum(predictions == labels).item()
        total_predictions += inputs.size(0)

    loss = running_loss / (step + 1.0)
    accuracy = total_correct / total_predictions
    return loss, accuracy

## Step 4 - Building PyTorch Training Objects
In this step, you will select hyperparameters, a device to train your model with, set up DataLoader objects, a loss function, and optimizer.  All of these variables and objects can be replaced to fit your training flow.

In [4]:
from torch.utils.data import DataLoader
from torch.nn import CrossEntropyLoss
from torch.optim import Adam

# hyperparameters
BATCH_SIZE = 32

# setup device
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
print(f"Using device: {device}")

# setup data loaders
train_loader = DataLoader(
    train_dataset, BATCH_SIZE, shuffle=True, pin_memory=True, num_workers=8
)
val_loader = DataLoader(
    val_dataset, BATCH_SIZE, shuffle=False, pin_memory=True, num_workers=8
)

# setup loss function and optimizer, LR will be overriden by sparseml
criterion = CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=8e-3)

Using device: cuda


## Step 5 - Running Sparse-Quantized Transfer Learning with a SparseML Recipe

To run sparse-quantized transfer learning with SparseML, you will download a transfer learning recipe from SparseZoo and use it to create a `ScheduledModifierManager` object.  This manager will be used to wrap the optimizer object to maintain the pre-optimized model's sparsity structure while learning weights for the new dataset as well as performing quantization-aware training (QAT).

You can create SparseML recipes to perform various model pruning schedules, QAT, sparse transfer learning, and more.  If you are using a different model than the default, you will have to modify the recipe  file to match the new target's parameters.

Using the wrapped optimizer object, you will call the training function to prune your model. Finalize the model after training by making a call to manager's `finalize(...)` method.

If the kernel shuts down during training, this may be an out of memory error; to resolve this, try lowering the `batch_size` in the cell above.

#### Downloading a Recipe from SparseZoo
The [SparseZoo](https://github.com/neuralmagic/sparsezoo) API provides preconfigured recipes for its optimized model.  In the cell below, you will download a recipe for pruning ResNet-50 on the Imagenette dataset and record its saved path.

In [7]:
from sparsezoo import Model

zoo_model = Model(zoo_stub_path)
recipe_path = zoo_model.recipes.default.path
print(f"Recipe downloaded to: {recipe_path}")

{'training': Directory(name=training), 'deployment': None, 'onnx_folder': None, 'logs': None, 'sample_originals': None, 'sample_inputs': None, 'sample_outputs': None, 'sample_labels': None, 'model_card': None, 'recipes': None, 'onnx_model': None, 'analysis': None, 'benchmarks': None, 'eval_results': None}


AttributeError: 'NoneType' object has no attribute 'default'

In [None]:
from sparseml.pytorch.optim import ScheduledModifierManager

# create ScheduledModifierManager and Optimizer wrapper
manager = ScheduledModifierManager.from_yaml(recipe_path)
optimizer = manager.modify(model, optimizer, steps_per_epoch=len(train_loader))


# Run model pruning
epoch = manager.min_epochs
for epoch in range(manager.max_epochs):
    # run training loop
    epoch_name = f"{epoch + 1}/{manager.max_epochs}"
    print(f"Running Training Epoch {epoch_name}")
    train_loss, train_acc = run_model_one_epoch(
        model, train_loader, criterion, device, train=True, optimizer=optimizer
    )
    print(
        f"Training Epoch: {epoch_name}\nTraining Loss: {train_loss}\nTop 1 Acc: {train_acc}\n"
    )

    # run validation loop
    print(f"Running Validation Epoch {epoch_name}")
    val_loss, val_acc = run_model_one_epoch(model, val_loader, criterion, device)
    print(
        f"Validation Epoch: {epoch_name}\nVal Loss: {val_loss}\nTop 1 Acc: {val_acc}\n"
    )

manager.finalize(model)

## Step 6 - Viewing Model Sparsity
To see the effects of sparse-quantized transfer learning, in this step, you will print out the sparsities of each Conv and FC layer in your model.

In [None]:
from sparseml.pytorch.utils import get_prunable_layers, tensor_sparsity

# print sparsities of each layer
for (name, layer) in get_prunable_layers(model):
    print(f"{name}.weight: {tensor_sparsity(layer.weight).item():.4f}")

## Step 7 - Exporting to ONNX

Now that the sparse-quantized transfer learning is complete, it should be prepped for inference.  A common next step for inference is exporting the model to ONNX.  This is also the format used by the [DeepSparse Engine](https://github.com/neuralmagic/deepsparse) to achieve the sparse-quantized speedups.

For PyTorch, exporting to ONNX is natively supported. In the cell block below, a convenience class, ModuleExporter(), is used to handle exporting.

Additionally, PyTorch, exports a graph setup for quantization-aware training (QAT) to ONNX. To run a fully quantized graph, you will need to convert these QAT operations to fully quantized INT8 operations.  SparseML provides the `quantize_torch_qat_export` helper function to perform this conversion.

Once the model is saved as an ONNX ﬁle, it is ready to be used for inference with the DeepSparse Engine.

Normally, exporting a QAT model from PyTorch to ONNX will create a graph with "fake quantized" operations that represent the QAT graph.  By setting `convert_qat=True` in our exporter, a function will automatically be called to convert this exported model to a fully quantized graph that will contain desired quantized structure.

In [None]:
import os
from sparseml.pytorch.utils import ModuleExporter

save_dir = "pytorch_sparse_quantized_transfer_learning"
quant_onnx_graph_name = "resnet50_imagenette_pruned_quant.onnx"
quantized_onnx_path = os.path.join(save_dir, quant_onnx_graph_name)

exporter = ModuleExporter(model, output_dir=save_dir)
exporter.export_pytorch(name="resnet50_imagenette_pruned_qat.pth")
exporter.export_onnx(
    torch.randn(1, 3, 224, 224), name=quant_onnx_graph_name, convert_qat=True
)

print(f"Sparse-Quantized ONNX model saved to {quantized_onnx_path}")

## [Optional] Step 8 - Benchmarking

Finally, to see the total effect of these optimizations, you will benchmark an unoptimized, dense ResNet-50 model from SparseZoo against your sparse-quantized model using the `deepsparse` API.

To run this step `deepsparse` must be installed in your python environment. You can install it with `pip install deepsparse`.

Note, in order to view speedup from quantization, your CPU must run VNNI instructions.  The benchmarking cell below contains a check for VNNI instructions and will log a warning if they are not detected.  You can learn more about DeepSparse hardware compatibility [here](https://docs.neuralmagic.com/deepsparse/hardware.html).

In [None]:
import numpy
from deepsparse import benchmark_model
from deepsparse.cpu import cpu_architecture


# check VNNI
if cpu_architecture()["vnni"]:
    print("VNNI extensions detected, model will run with quantized speedups\n")
else:
    print(
        "WARNING: No VNNI extensions detected. Your model will not run with "
        "quantized speedups which will affect benchmarking\n"
    )


BATCH_SIZE = 64
NUM_CORES = None  # maximum number of cores available
NUM_ITERATIONS = 100
NUM_WARMUP_ITERATIONS = 20


def benchmark_imagenette_model(model_name, model_path):
    print(
        f"Benchmarking {model_name} for {NUM_ITERATIONS} iterations at batch "
        f"size {BATCH_SIZE} with {NUM_CORES} CPU cores"
    )
    sample_input = [
        numpy.ascontiguousarray(
            numpy.random.randn(BATCH_SIZE, 3, 224, 224).astype(numpy.float32)
        )
    ]

    results = benchmark_model(
        model=model_path,
        inp=sample_input,
        batch_size=BATCH_SIZE,
        num_cores=NUM_CORES,
        num_iterations=NUM_ITERATIONS,
        num_warmup_iterations=NUM_WARMUP_ITERATIONS,
        show_progress=True,
    )
    print(f"results:\n{results}")
    return results


# base ResNet-50 Imagenette model downloaded from SparseZoo
base_results = benchmark_imagenette_model(
    "ResNet-50 Imagenette Base",
    "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenette/base-none"
)

optimized_results = benchmark_imagenette_model(
    "ResNet-50 Imagenette pruned-quantized", quantized_onnx_path
)

speed_up = base_results.ms_per_batch / optimized_results.ms_per_batch
print(f"Speed-up from sparse-quantized transfer learning: {speed_up}")

## Next Steps

Congratulations, you have created a sparse-quantized model and exported it to ONNX for inference!  Next steps you can pursue include:
* Transfer learning, pruning, or quantizing different models using SparseML
* Trying different pruning and optimization recipes
* Benchmarking other models on the [DeepSparse Engine](https://github.com/neuralmagic/deepsparse)