# 16bit Quantization in CMSIS NN (per-tensor)


## Mathematic

$range: [-2^{15}, 2^{15} - 1]$

**$x$ denotes weight / bias / tensor**

$k = 15 - \lceil{log_2(max|x|)}\rceil$

$x_q = round(x \times 2^{k})$ 



**Get float $x^{'}$ from quantized $x_q$:**

$x^{'} = \frac {x_q} {2^k} $

**integer only calculation**

weight: $w_q = w^{'} \times 2^{k_w}$

bias: $b_q = b^{'} \times 2^{k_b}$

input tensor: $x_q = x^{'} \times 2^{k_x}$

output tensor: $o_q = o^{'} \times 2^{k_o}$

**floating point:** $o^{'} = w^{'} \times x^{'} + b^{'}$

we need to get $o^{'}$ from $w_q$, $x_q$ and $b_q$


$w_q \times x_q + b_q << b_{lshift}= w^{'} \times x^{'} \times 2^{k_w + k_x} + b^{'} \times 2 ^{k_b + b_{lshift}}$

let $b_{lshift} = k_x + k_w - k_b$

$w_q \times x_q + b_q << b_{lshift}= {(w^{'} \times x^{'} + b^{'})} \times 2^{k_w + k_x}$

let $o_{rshift} = k_x + k_w - k_o$

($w_q \times x_q + b_q << b_{lshift}) >> o_{rshift}= {(w^{'} \times x^{'} + b^{'})} \times 2^{k_w + k_x-o_{rshift}}=o^{'} \times 2^{k_o} = o_q$

**to quantize a network, we need to know:**
1. bias left shift($b_{lshift}$) for each layer
2. output right shift ($o_{rshift}$) for each layer
3. input and output quantization format



In [20]:
# quantize a random array

import numpy as np

x = (np.random.rand(20) * 5 - 2.5)

frac_bits = 15 - np.ceil(np.log2(np.max(np.abs(x))))
x_q = np.round(x * 2 ** frac_bits)
x_f = x_q / 2 ** frac_bits
print("x:")
print(x)
print("Quantized x:")
print(x_q)
print("Quantization error:")
print(x - x_f)

x:
[-2.3053369  -1.91560495 -0.89864258 -1.81937768 -1.60085279  0.15882521
  1.6689202   2.14076174  1.40368275  2.07776417  0.19306908  1.98640321
 -1.98940858 -2.06382312 -2.35358251  0.55091488 -1.79072939  2.30323224
 -1.24396121 -1.496369  ]
Quantized x:
[-18885. -15693.  -7362. -14904. -13114.   1301.  13672.  17537.  11499.
  17021.   1582.  16273. -16297. -16907. -19281.   4513. -14670.  18868.
 -10191. -12258.]
Quantization error:
[-3.90466010e-05  4.44653465e-05  3.90619017e-05 -4.17377855e-05
 -2.27088437e-05  1.17307267e-05 -2.51088445e-05  1.46659531e-05
 -3.77139214e-06  5.38292173e-06 -4.61510862e-05 -4.69879500e-05
 -2.86957579e-05  1.96485445e-05  5.51880863e-05  1.15563292e-05
  4.20913436e-05  9.58656843e-06  5.73471951e-05 -3.11141335e-05]


## In practice
1. quantize weight and bias, find $max|w|$ and $max|b|$ to compute $k_w$ and $k_b$, save the quantized weight and bias.

2. run the model on a testset to find $max|x|$ of each intermediate tensor, and then compute $k_a$

3. calculate $b_{lshift}$ and $o_{rshift}$ for each layer.

$b_{lshift} = k_x + k_w - k_b$

$o_{rshift} = k_x + k_w - k_o$

## Data format (from a 2D/3D/4D tensor to a flat buffer)

In CMSIS NN everything is stored as a 1D array, so before saving quantized weights, you need to convert them to flat buffers.

In [None]:
# for tensorflow 2.n
# Convolutional layer
weight_conv_flat = np.moveaxis(weight_conv, 2, 0).flatten("F")

# Fully connected layer
weight_fc_flat = np.moveaxis(weight_fc_flat, 1, 0).flatten()

### for intermediate tensor

![shape](./assets/shape.jpg)