# **Deep Learning embedded into Raspberry PI 3 using Quantized Pytorch Models**

Neste tutorial iremos abordar as 3 diferentes técnicas de compressão de modelos utilizando diversos recursos que o framework Pytorch nos oferece. Iremos avaliar o ganho de perfomance em uma plataforma embarcada chamada Raspberry PI 3 e iremos estruturar como dar deploy desse modelo.

### **Disclaimer:**
Este tutorial se utiliza de diversos tutoriais online para sua criação. Abaixo há todos os links que os levam até eles.

[Pytorch Quantize Recipe](https://pytorch.org/docs/stable/quantization.html#quantization-workflows)

[Pytorch Quantization Docs](https://pytorch.org/docs/stable/quantization.html)



## Relembrando 

Um neurônio é modelado como um produto interno: 

$y(\textbf{X}) = \beta^\top \textbf{X} = \sum_i \textbf{w}_i\textbf{x}_i$

obs: o bias está implícito no vetor de pesos.

Dessa forma, como podemos treiná-lo?


# Treinando o modelo

A função que atualiza os pesos pode ser calculada como:<br>
<center>$\begin{equation}\Delta w_{i} =  - \eta.\nabla_{w}^E\end{equation}$</center> 

Onde $\nabla_{w}^E$ é o operador de gradiente calculado sobre a funação de erro em função dos pesos. Assumindo que cada peso é linearmente independente, podemos reescrever esta função como:<br>
Sendo $\textbf{w}$ um vetor de pesos e  $\nabla_{\textbf{w}}^E$ um vetor de derivadas parciais da função de erro em relação aos pesos, então:

$\begin{equation}
  \textbf{w} = \begin{pmatrix} w_{1} \\ w_{2} \\ \vdots \\ w_{n} \end{pmatrix} ; \nabla_{\textbf{w}}^E = \begin{pmatrix} \frac{\partial E}{\partial w_{1}} \\ \frac{\partial E}{\partial w_{2}}\\ \vdots \\ \frac{\partial E}{\partial w_{n}}\end{pmatrix} \end{equation}
  $

  Portanto $\textbf{w}_{t+1} = \textbf{w}_{t} - \eta.\nabla_{\textbf{w}}^E$

# **Convolutional Neural Networks**

---

Convolução em imagens é o processo de aplicar um filtro a uma representação da image. Em redes neurais, essas aplicações são compostas em diversas camadas onde diversos outros filtros são aplicados às representações anteriores da imagem até chegar a etapa de classificação. Essa parte da arquitetura é chamada de extrator de características.

A etapa de classificação, geralmente, é composta por uma rede MLP ou fully connected.

![conv 1](https://www.researchgate.net/publication/331540139/figure/fig4/AS:733273504354306@1551837435967/The-overall-architecture-of-the-Convolutional-Neural-Network-CNN-includes-an-input.png)



In [1]:
#pip install torch==1.5.0
#pip install torchvision==0.5.0


In [2]:
#@title **Importando dependencias que serao utilizadas neste tutorial**

import numpy as np
import torch
import torch.nn as nn
import torchvision
from torch.utils.data import DataLoader
from torchvision import datasets
import torchvision.transforms as transforms
import os
import time
import sys
import torch.quantization
from torch.utils.data import random_split
from torch.quantization import QuantStub, DeQuantStub
import copy
# # Setup warnings
import warnings
warnings.filterwarnings(
    action='ignore',
    category=DeprecationWarning,
    module=r'.*'
)
warnings.filterwarnings(
    action='default',
    module=r'torch.quantization'
)


#@markdown ---
#@markdown Para conseguirmos reproduzir aos experimentos de maneira deterministica,
#@markdown precisamos configurar uma seed -->```torch.manual_seed(191009)```  
# Specify random seed for repeatable results
torch.manual_seed(191009)

<torch._C.Generator at 0x7f35311c4cb0>

In [3]:
#@title Loading the dataset the same way as before, but now using Normalization
transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    
])

dataset_train = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)

dataset_test = torchvision.datasets.CIFAR10(root='./data', train=False,
                                        download=True, transform=transform)

test, val = random_split(dataset_test, lengths = (5000,5000))

Files already downloaded and verified
Files already downloaded and verified


In [4]:
batch_size = 2500

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
train_loader = DataLoader(dataset=dataset_train, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(dataset=test, shuffle=False, batch_size=batch_size)
val_loader = DataLoader(dataset=val, shuffle=False, batch_size=batch_size)

dataloaders = {'train': train_loader, 'val':val_loader, 'test' : test_loader }

dataset_sizes = {'train' : len(train_loader.dataset), 'test' : len(test_loader.dataset), 'val': len(val_loader.dataset)}

device =  torch.device("cuda:0" if torch.cuda.is_available() else "cpu")




In [23]:
class ConvBNReLU(nn.Sequential):
    def __init__(self, in_planes, out_planes, padding, kernel_size=3, stride=1, groups=1):
        #padding = (kernel_size - 1) // 2
        super(ConvBNReLU, self).__init__(
            nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups=groups, bias=False),
            nn.BatchNorm2d(out_planes, momentum=0.1),
            # Replace with ReLU
            nn.ReLU(inplace=False)
        )

class CNN(nn.Module):
  def __init__(self):
    super(CNN,self).__init__()
    
    #input_channel, output_channel, feature_dimension(kernel_size), stride, padding
    self.feats = nn.Sequential(
        
        ConvBNReLU(3, 20, kernel_size = 3, stride = 1, padding=0),#30x30x20
        nn.MaxPool2d(kernel_size = 3, stride = 2, padding = 1),#15x15x20

        
        

        ConvBNReLU(20, 256, kernel_size = 3, stride = 1, padding = 1),#15x15x256
        nn.MaxPool2d(kernel_size = 3, stride = 2),#7x7x256
      
        nn.MaxPool2d(kernel_size = 3, stride = 2 )#3x3x256
    )
    self.fc = nn.Sequential(
        nn.Dropout(0.4),
        nn.Linear(3*3*256,10),
        # nn.ReLU(True),
        # nn.Dropout(0.5),
        # nn.Linear(768, 10),
        #nn.LogSoftmax(1)
    )
    
    self.quant = QuantStub()
    
    self.dequant = DeQuantStub()

  def forward(self,x):
    x = self.quant(x)
    x = self.feats(x) # CNN

    x = x.reshape(-1,3*3*256) # Lineariza 

    x = self.fc(x) #Classifica
    x = self.dequant(x)
    return x

  def fuse_model(self):
    for m in self.modules():
        if type(m) == ConvBNReLU:
            torch.quantization.fuse_modules(m, ['0', '1', '2'], inplace=True)
        

In [6]:
def train_model(model, criterion, optimizer, dataloaders, device='cpu', num_epochs=25):
    since = time.time()
    
    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
            # if phase == 'train':
            #     scheduler.step()

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model

In [8]:
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)


def accuracy(output, target, topk=(1,)):
    """Computes the accuracy over the k top predictions for the specified values of k"""
    with torch.no_grad():
        maxk = max(topk)
        batch_size = target.size(0)

        _, pred = output.topk(maxk, 1, True, True)
        pred = pred.t()
        correct = pred.eq(target.view(1, -1).expand_as(pred))

        res = []
        for k in topk:
            correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
            res.append(correct_k.mul_(100.0 / batch_size))
        return res


def evaluate(model, criterion, data_loader, neval_batches):
    model.eval()
    top1 = AverageMeter('Acc@1', ':6.2f')
    top5 = AverageMeter('Acc@5', ':6.2f')
    cnt = 0
    with torch.no_grad():
        for image, target in data_loader:
            output = model(image)
            loss = criterion(output, target)
            cnt += 1
            acc1, acc5 = accuracy(output, target, topk=(1, 5))
            print('.', end = '')
            top1.update(acc1[0], image.size(0))
            top5.update(acc5[0], image.size(0))
            if cnt >= neval_batches:
                 return top1, top5

    return top1, top5

def load_model(model_file):
    model = CNN()
    state_dict = torch.load(model_file)
    model.load_state_dict(state_dict)
    model.to('cpu')
    return model

def print_size_of_model(model):
    torch.save(model.state_dict(), "temp.p")
    print('Size (MB):', os.path.getsize("temp.p")/1e6)
    os.remove('temp.p')

In [25]:
model = CNN()

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
train_model(model.to(device), criterion, optimizer, dataloaders, device=device, num_epochs=3)

Epoch 0/2
----------
train Loss: 2.6436 Acc: 0.1295
val Loss: 2.1235 Acc: 0.2598

Epoch 1/2
----------
train Loss: 2.2474 Acc: 0.2123
val Loss: 1.8728 Acc: 0.3530

Epoch 2/2
----------
train Loss: 2.0625 Acc: 0.2665
val Loss: 1.7545 Acc: 0.3946

Training complete in 1m 5s
Best val Acc: 0.394600


CNN(
  (feats): Sequential(
    (0): ConvBNReLU(
      (0): Conv2d(3, 20, kernel_size=(3, 3), stride=(1, 1), bias=False)
      (1): BatchNorm2d(20, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
    )
    (1): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (2): ConvBNReLU(
      (0): Conv2d(20, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
    )
    (3): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (4): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (fc): Sequential(
    (0): Dropout(p=0.4, inplace=False)
    (1): Linear(in_features=2304, out_features=10, bias=True)
  )
  (quant): QuantStub()
  (dequant): DeQuantStub()
)

In [26]:
torch.save(model.state_dict(), 'model_cnn_basic.pth')

In [43]:
torch.jit.save(torch.jit.script(model.to('cpu')), 'model_cnn_basic_scripted.pth')

In [33]:
saved_model_dir = './'
float_model_file = 'model_cnn_basic.pth'
scripted_float_model_file = 'model_cnn_basic_quantization_scripted.pth'
scripted_quantized_model_file = 'model_cnn_basic_quantization_scripted_quantized.pth'

train_batch_size = 30
eval_batch_size = 50

criterion = nn.CrossEntropyLoss()
float_model = load_model(saved_model_dir + float_model_file).to('cpu')



float_model.eval()

# Fuses modules
float_model.fuse_model()


In [34]:
num_eval_batches = 1000

print("Size of baseline model")
print_size_of_model(float_model)

top1, top5 = evaluate(float_model, criterion, dataloaders['val'], neval_batches=num_eval_batches)
print('\nEvaluation accuracy on %d images, %2.2f'%(num_eval_batches * eval_batch_size, top1.avg))
torch.jit.save(torch.jit.script(float_model), saved_model_dir + scripted_float_model_file)

Size of baseline model
Size (MB): 0.281291
..
Evaluation accuracy on 50000 images, 39.46


In [35]:
num_calibration_batches = 32

myModel = load_model(saved_model_dir + float_model_file).to('cpu')
myModel.eval()

# Fuse Conv, bn and relu
myModel.fuse_model()

# Specify quantization configuration
# Start with simple min/max range estimation and per-tensor quantization of weights
myModel.qconfig = torch.quantization.default_qconfig
print(myModel.qconfig)
backend = "qnnpack"

torch.backends.quantized.engine = backend
torch.quantization.prepare(myModel, inplace=True)

# Calibrate first
print('Post Training Quantization Prepare: Inserting Observers')
#print('\n Inverted Residual Block:After observer insertion \n\n', myModel.features[1].conv)

# Calibrate with the training set
evaluate(myModel, criterion, dataloaders['val'], neval_batches=num_calibration_batches)
print('Post Training Quantization: Calibration done')

# Convert to quantized model
torch.quantization.convert(myModel, inplace=True)
print('Post Training Quantization: Convert done')
#print('\n Inverted Residual Block: After fusion and quantization, note fused modules: \n\n',myModel.features[1].conv)

print("Size of model after quantization")
print_size_of_model(myModel)

top1, top5 = evaluate(myModel, criterion, dataloaders['val'], neval_batches=num_eval_batches)
print('\nEvaluation accuracy on %d images, %2.2f'%(num_eval_batches * eval_batch_size, top1.avg))

QConfig(activation=functools.partial(<class 'torch.quantization.observer.MinMaxObserver'>, reduce_range=True), weight=functools.partial(<class 'torch.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric))
Post Training Quantization Prepare: Inserting Observers
..Post Training Quantization: Calibration done
Post Training Quantization: Convert done
Size of model after quantization
Size (MB): 0.07332
..
Evaluation accuracy on 50000 images, 38.94


In [36]:
per_channel_quantized_model = load_model(saved_model_dir + float_model_file)
per_channel_quantized_model.eval()
per_channel_quantized_model.fuse_model()
per_channel_quantized_model.qconfig = torch.quantization.get_default_qconfig('qnnpack')
print(per_channel_quantized_model.qconfig)

torch.quantization.prepare(per_channel_quantized_model, inplace=True)
evaluate(per_channel_quantized_model,criterion, dataloaders['val'], num_calibration_batches)
torch.quantization.convert(per_channel_quantized_model, inplace=True)
top1, top5 = evaluate(per_channel_quantized_model, criterion, dataloaders['val'], neval_batches=num_eval_batches)
print('\nEvaluation accuracy on %d images, %2.2f'%(num_eval_batches * eval_batch_size, top1.avg))
torch.jit.save(torch.jit.script(per_channel_quantized_model), saved_model_dir + scripted_quantized_model_file)

QConfig(activation=functools.partial(<class 'torch.quantization.observer.HistogramObserver'>, reduce_range=False), weight=functools.partial(<class 'torch.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric))
....
Evaluation accuracy on 50000 images, 39.24


In [37]:
def train_one_epoch(model, criterion, optimizer, data_loader, device, ntrain_batches):
    model.train()
    top1 = AverageMeter('Acc@1', ':6.2f')
    top5 = AverageMeter('Acc@5', ':6.2f')
    avgloss = AverageMeter('Loss', '1.5f')

    cnt = 0
    for image, target in data_loader:
        start_time = time.time()
        print('.', end = '')
        cnt += 1
        image, target = image.to(device), target.to(device)
        output = model(image)
        loss = criterion(output, target)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        acc1, acc5 = accuracy(output, target, topk=(1, 5))
        top1.update(acc1[0], image.size(0))
        top5.update(acc5[0], image.size(0))
        avgloss.update(loss, image.size(0))
        if cnt >= ntrain_batches:
            print('Loss', avgloss.avg)

            print('Training: * Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}'
                  .format(top1=top1, top5=top5))
            return

    print('Full CIFAR10 train set:  * Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}'
          .format(top1=top1, top5=top5))
    return

In [40]:
qat_model = load_model(saved_model_dir + float_model_file)
qat_model.fuse_model()

optimizer = torch.optim.SGD(qat_model.parameters(), lr = 0.0001)
qat_model.qconfig = torch.quantization.get_default_qat_qconfig('qnnpack')

In [41]:
torch.quantization.prepare_qat(qat_model, inplace=True)
#print('Inverted Residual Block: After preparation for QAT, note fake-quantization modules \n',qat_model.features[1].conv)

CNN(
  (feats): Sequential(
    (0): ConvBNReLU(
      (0): ConvBnReLU2d(
        3, 20, kernel_size=(3, 3), stride=(1, 1), bias=False
        (activation_post_process): FakeQuantize(
          fake_quant_enabled=True, observer_enabled=True,            scale=None, zero_point=None
          (activation_post_process): MovingAverageMinMaxObserver(min_val=None, max_val=None)
        )
        (weight_fake_quant): FakeQuantize(
          fake_quant_enabled=True, observer_enabled=True,            scale=None, zero_point=None
          (activation_post_process): MovingAverageMinMaxObserver(min_val=None, max_val=None)
        )
      )
      (1): Identity()
      (2): Identity()
    )
    (1): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (2): ConvBNReLU(
      (0): ConvBnReLU2d(
        20, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
        (activation_post_process): FakeQuantize(
          fake_quant_enabled=True, observer_enabled=True,

In [46]:
def run_benchmark(model_file, img_loader):
    elapsed = 0
    model = torch.jit.load(model_file)
    model.eval()
    num_batches = 1
    # Run the scripted model on a few batches of images
    for i, (images, target) in enumerate(img_loader):
        if i < num_batches:
            start = time.time()
            output = model(images)
            end = time.time()
            elapsed = elapsed + (end-start)
        else:
            break
    num_images = images.size()[0] * num_batches

    print('Elapsed time: %3.3f ms' % (elapsed/num_images*1000))
    return elapsed

run_benchmark(saved_model_dir + scripted_float_model_file, dataloaders['val'])

run_benchmark(saved_model_dir + scripted_quantized_model_file, dataloaders['val'])

Elapsed time: 1.678 ms
Elapsed time: 1.508 ms


3.7700726985931396

In [47]:
num_train_batches = 20

# QAT takes time and one needs to train over a few epochs.
# Train and check accuracy after each epoch
for nepoch in range(8):
    train_one_epoch(qat_model, criterion, optimizer, dataloaders['test'], torch.device('cpu'), num_train_batches)
    if nepoch > 3:
        # Freeze quantizer parameters
        qat_model.apply(torch.quantization.disable_observer)
    if nepoch > 2:
        # Freeze batch norm mean and variance estimates
        qat_model.apply(torch.nn.intrinsic.qat.freeze_bn_stats)

    # Check the accuracy after each epoch
    quantized_model = torch.quantization.convert(qat_model.eval(), inplace=False)
    quantized_model.eval()
    top1, top5 = evaluate(quantized_model,criterion, dataloaders['val'], neval_batches=num_eval_batches)
    print('\nEpoch %d :Evaluation accuracy on %d images, %2.2f'%(nepoch, num_eval_batches * eval_batch_size, top1.avg))

..Full CIFAR10 train set:  * Acc@1 28.620 Acc@5 81.420
..
Epoch 0 :Evaluation accuracy on 50000 images, 39.36
..Full CIFAR10 train set:  * Acc@1 29.680 Acc@5 81.000
..
Epoch 1 :Evaluation accuracy on 50000 images, 39.60
..Full CIFAR10 train set:  * Acc@1 28.940 Acc@5 80.800
..
Epoch 2 :Evaluation accuracy on 50000 images, 39.50
..Full CIFAR10 train set:  * Acc@1 28.960 Acc@5 80.080
..
Epoch 3 :Evaluation accuracy on 50000 images, 39.58
..Full CIFAR10 train set:  * Acc@1 29.020 Acc@5 80.840
..
Epoch 4 :Evaluation accuracy on 50000 images, 39.60
..Full CIFAR10 train set:  * Acc@1 29.820 Acc@5 81.460
..
Epoch 5 :Evaluation accuracy on 50000 images, 39.68
..Full CIFAR10 train set:  * Acc@1 30.040 Acc@5 81.980
..
Epoch 6 :Evaluation accuracy on 50000 images, 39.62
..Full CIFAR10 train set:  * Acc@1 28.540 Acc@5 81.860
..
Epoch 7 :Evaluation accuracy on 50000 images, 39.72


In [48]:
torch.quantization.convert(quantized_model, inplace=True)

CNN(
  (feats): Sequential(
    (0): ConvBNReLU(
      (0): QuantizedConvReLU2d(3, 20, kernel_size=(3, 3), stride=(1, 1), scale=0.027072325348854065, zero_point=0)
      (1): Identity()
      (2): Identity()
    )
    (1): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (2): ConvBNReLU(
      (0): QuantizedConvReLU2d(20, 256, kernel_size=(3, 3), stride=(1, 1), scale=0.03293272480368614, zero_point=0, padding=(1, 1))
      (1): Identity()
      (2): Identity()
    )
    (3): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (4): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (fc): Sequential(
    (0): Dropout(p=0.4, inplace=False)
    (1): QuantizedLinear(
      in_features=2304, out_features=10, scale=0.0460919588804245, zero_point=146
      (_packed_params): LinearPackedParams()
    )
  )
  (quant): Quantize(scale=tensor([0.0187]), zero_point=tensor([114]), dtype=torch.quint8)
  (dequant): De

In [49]:
torch.jit.save(torch.jit.script(quantized_model), 'qat_model.pth')