In [3]:
import torch
import numpy as np


# Range-Based Linear Quantization
float is quantized linearly using a scale factor
#### 1. Asymmetric:
- Scale factor is computed using the min/max of both float/input and quantized type. A zero_point is introduced (bias)
    
$$ x_q = round\Big(x_f * \frac{2^k - 1}{max(x_f) - min(x_f)} - zp\Big) \\
 zp = round\Big(min\big(x_f, \frac{2^k - 1}{max(x_f) - min(x_f)}\big)\Big) \\ $$

- quantized outputs of FC or convs can be computed by plugging in full precision weights, biases,  and inputs as a function of quantized values
-  gemmlowp documentation: https://github.com/google/gemmlowp/blob/master/doc/quantization.md#implementation-of-quantized-matrix-multiplication

#### 2. Symmetric
- Instead of using min/max of float range, we use $ [-|max(abs(x_f))|, |max(abs(x_f))|] $

- No zero point, so range is symmetric about 0 for both float and quantized range

$$ x_q = round\Big(x_f * \frac{2^k - 1}{max(abs(x_f))}\Big) $$

#### Tradeoffs:
- In symmetric approach, if float range is biased, a protion of quantized range may be dedicated to a range of float values we may not see
- Implementing symmetric approach is simpler

#### Other features:
- removing outliers and scale factor approximation (post-training)

# DoReFa
paper: https://arxiv.org/abs/1606.06160

- defines a function 'quantize_k()' which takes in a real value $a_f \in [0,1]$ and outputs a discrete-valued $a_k \in [\frac{0}{2^k - 1}, \frac{1}{2^k - 1}, ..., \frac{2^k - 1}{2^k - 1}] $ where k is the bit precision

$$ x_q = quantize(x_f) = \frac{1}{2^k - 1} round((2^k - 1) x_f) $$

#### Activations:
- Activations are clipped [0,1], then applied the quantized function


#### Weights:
- Weights are applied a function $f(w)$, then quantized using the quantize function.
    
$$ f(w) = \frac{tanh(w)}{2max(|tanh(w)|)} + 0.5 \\
   w_q = 2*quantize(f(w)) - 1 $$
    
#### Notes:
- requires quantization-aware training (https://nervanasystems.github.io/distiller/quantization.html#quantization-aware-training)
- graident quantization discusses but not supported on Distiller
- binary quantization not supported

# PACT: Parameterized Clipping Activation for Quantized Neural Networks)
paper: https://arxiv.org/abs/1709.01134
- Similar to DoReFa, but the ppper clipping values for the activations $ \alpha $ are learned and not hard coded to 1


# WRPN: Wide Reduced-Precision Networks)
- Similar to DoReFa 
- activations clipped [0, 1] while weights are clipped [-1, 1]
- quantization is done with k-1 bits to allow one bit for sign
- Paper discusses using wider layers to increase accuracy.
- Wider layers and binary weights not supported in Distiller

# PyTorch Quantized Tensors
- operator implementations only support channel-quantization for weights of conv and linear operators
- min and max of input data is linearly mapped to min/max of output data type
- documentation: https://pytorch.org/docs/stable/quantization.html

##### Mapping:
$$ Q(x, s, b) = round( \frac{x}{s} +  b)$$





In [14]:
# Quantize per tensor
k = 8
scale = (2**k - 1)**(-1)
zero_point = 0
a = torch.tensor(np.random.randn(4).astype('f'), dtype=torch.float64)
b = torch.quantize_per_tensor(a, scale=scale, zero_point=zero_point, dtype=torch.uint8)
print(a)
print(b)

RuntimeError: quantize only works on Float Tensor.