# quantization

#### 1.什么是量化
量化就是将浮点数转化为定点整数（通常为`int8/uint8`)。

#### 2.为什么要做量化 
- 减小网络体积
- 配合奖励整数计算的硬件
- 加速
-----

#### 3.如何做量化


&emsp;&emsp;将范围在 $[x_{min}+ x_{max}]$ 的浮点数，映射到$[q_{min}, q_{max}]$的范围中。使用直觉思考， 将xmin对应与qmin， xmax对应于qmax， 其余的数根据其于xmin的距离除以转化步长， 最后去掉小数转化为整数。  
    表示为公式：
    $$ x_q= (x_f - x_{min})\frac{q_{max} - q_{min}}{x_{max}-x_{min}}\tag{2.1}$$
那么：
$$ x_f= x_q\frac{x_{max}-x_{min}}{q_{max} - q_{min}}+x_{min}\tag{2.2}$$
将`2.2`转化一下形式：$$ x_f= \frac{x_{max}-x_{min}}{q_{max} - q_{min}}(x_q + x_{min}\frac{q_{max} - q_{min}}{x_{max}-x_{min}})\tag{2.3}$$  
定义$$s := \frac{x_{max}-x_{min}}{q_{max} - q_{min}}\tag{2.4}$$定义$$zp := round(x_{min}\frac{q_{max} - q_{min}}{x_{max}-x_{min}})\tag{2.5}$$
`2.3`式写为：
$$ x_f = s (x_q + zq_x)\tag{2.6}$$
由`2.4`式可得：
$$ x_q = \frac{x_f}{s} - zq_x \tag{2.7}$$



In [20]:
import numpy as np

def asymmetric_quant(arr):
    arr_min, arr_max = arr.min(), arr.max()
    s = (arr_max - arr_min) / 255 # 2.4
    print("scale: ", s)
    zp = np.round(arr_min / s) # 2.5
    # print(np.round(255 - arr_max /s))
    print("zero point: ", zp)
    for index, i in enumerate(arr):
        arr[index] = np.clip(np.round(i / s) - zp, 0, 255) # 2.7
    return s, zp

def asymmetric_dequant(arr, s, zq):
    for index, i in enumerate(arr):
        arr[index] = s * (i + zq) # 2.6

if __name__ == '__main__':
    a = np.array([-2.2, -1.0, 0.0, 0.1])
    print(a)
    s, zp = asymmetric_quant(a)
    print(a)
    asymmetric_dequant(a, s, zp)
    print(a)

[-2.2 -1.   0.   0.1]
scale:  0.009019607843137255
zero point:  -244.0
[  0. 133. 244. 255.]
[-2.20078431 -1.00117647  0.          0.09921569]


量化与反量化的关键在于，浮点数0在量化后会映射为一个稳定的整数值，即zero point，在上面的代码中，等于200。这是因为在padding的时候，我们需要0（？）。padding让我们避免了处理数组的边界。

将量化应用在网络中有两种不同的情况， 一种是量化已经训练好的网络，加快其推理速度。另一种是在训练是做伪量化。之所以叫做伪量化是因为神经网络训练的过程中需要浮点数。  
不管在哪种中，占据大头的都是矩阵的计算，矩阵计算可以表示为：
$$ f_1 = \sum_{j=1}^{N}f_2^{i,j}*f_3^{j,k} + b\tag{2.8}$$
由2.6知：
$$ f = s(q- zp)$$
则2.8可以转换为：
$$ s_1(q_1 - zp_1) = \sum_{j=1}^{N}s_2(q_2^{i,j} - zp_2)*s_3(q_3^{j,k} - zp_3) + s_b(q_b - zp_b)\tag{2.9}$$
化简：
$$ q_1 = \frac{s_2*s_3}{s_1}\sum_{j=1}^{N}(q_2^{i,j} - zp_2)*(q_3^{j,k} - zp_3) + zp_1 + s_b(q_b - zp_b) \tag{2.9}$$
其中$\sum_{j=1}^{N}(q_2^{i,j} - zp_2)*(q_3^{j,k} - zp_3)$可以用`int32`表示，那么通常偏置b也直接使用`int32`表示，$\frac{s_2*s_3}{s_1}$可以表示为$M_0*2^{-n}，n>=0$的格式。  
那么2.9式可以表示为：
$$ q_1 =  M_0*2^{-n}*uint32 + zp_1 + uint32\tag{1.2}$$  
-----


### symmetric
数据偏移过大时， 对称会将偏移数据堆在$q_{max}$位置

In [15]:
import numpy as np

def sym_quant(arr):
    s = 255 / (arr.max() - arr.min())
    for i in range(len(arr)):
        arr[i] = np.clip(np.round((arr[i]) * s), 0, 255)


def asym_quant(arr):
    s = (arr.max() - arr.min()) /255
    print(s)
    arr_min = arr.min()
    for i in range(len(arr)):
        arr[i] = np.clip(np.round((arr[i] - arr_min) / s), 0, 255)
    # return s

def asym_quant2(arr):
    s = 255 / (arr.max() - arr.min())
    arr_min = arr.min()
    zp = np.round(arr_min * s)
    print(zp, 255 - arr.max()*s)
    for i in range(len(arr)):
        arr[i] = np.clip(np.round(arr[i] * s) - zp, 0, 255)
    return s, zp
    
def de_quant(arr, s, zp):
    for i in range(len(arr)):
        arr[i] = (arr[i] + zp) / s

if __name__ == '__main__':
    a0 = np.array([-1.8, -1.0, 0.0, 0.5])
    a1 = np.array([-1.8, -1.0, 0.0, 0.5])
    a2 = np.array([-1.8, -1.0, 0.0, 0.5])
 
    # a = np.linspace(100, 110, 11)
    # b = np.linspace(100, 110, 11)
    # b1 = np.linspace(100, 110, 11)
    print(a0)
    sym_quant(a0)
    asym_quant(a1)
    s, zp = asym_quant2(a2)
    print(a0)
    print(a1)
    print(a2)
    de_quant(a2, s, zp)
    print(a2)

[-1.8 -1.   0.   0.5]
0.009019607843137253
-200.0 199.56521739130434
[ 0.  0.  0. 55.]
[  0.  89. 200. 255.]
[  0.  89. 200. 255.]
[-1.80392157 -1.00117647  0.          0.49607843]


#### asymmetrc


一般而言，无论per channel还是per layer量化方案，对于weight权重的量化使用对称量化，对于activate激活的量化使用非对称量化。

提供了一种量化模式，可以把权重和激活值都量化成8位数字，只保留了一些参数比如bias参数为32位整数  
    - bias参数本来不多。  
    - bias参数对精度要求高，因为bias-vector会被加到很多激活值上去，所以一个bias值有了误差就会形成全局的误差。

> Range-Based Linear Quantization  
Let's break down the terminology we use here:  
Linear: Means a float value is quantized by multiplying with a numeric constant (the scale factor).  
Range-Based: Means that in order to calculate the scale factor, we look at the actual range of the tensor's values. In the most naive implementation, we use the actual min/max values of the tensor. ___Alternatively, we use some derivation based on the tensor's range / distribution to come up with a narrower min/max range, in order to remove possible outliers.___ This is in contrast to the other methods described here, which we could call ___clipping-based___, as they impose an explicit clipping function on the tensors (using either a hard-coded value or a learned value).

按照量化算法的策略分类大致上可以分为两种，后训练量化和量化感知训练。f

symmetric mm:
    $$ q_y = \frac{s_x*s_w}{s_y}\sum_{j = 1} ^ {N}(q_x^{i, j}- zp_x)\ (q_w^{j, k} - zp_w) + zp_y\tag{1.1}$$
    将$\frac{s_x*s_w}{s_y}$表示为定点：将浮点数表示为一个[0, 1]之间的数和$2^{-n}, \ \ \  n>0$的乘积  
    通过计算量化和反量化之后的float值和原始float值之间的均方差可以直观看出，位宽越小误差越大，范围越大误差越大。

The choice of a quantization paradigm affects the calculations that gemmlowp itself needs to perform, specifically, it affects how one goes from internal 32bit accumulator to final 8bit outputs.

$real_value = scale * (quantized_value - zero_point)        (3)  $  
in $\sum_{j = 1} ^ {N}(q_x^{i, j}- zp_x)\ (q_w^{j, k} - zp_w) $  
Typically, all of these values are uint8. Typically, the above differences of uint8 values would be represented as signed int16; their products as signed int32.  
when we don't care about zero point , (1.1) become:
$$ yf =  M_0*2^{-n}*uint32\tag{1.2}$$  
to find $ M_0*2^{-n}$ we have $ M = M_0*2^{-n}$
so $ M_0 = M * 2 ^ n$

In [13]:

M = 0.0072474273418460
# P = 7091

def multiply(n, M):
    # result = M * P
    Mo = int(round(2 ** n * M)) # 这里不一定要四舍五入截断，因为python定点数不好表示才这样处理

    approx_result = (Mo) >> n
    print("n=%d, Mo=%d,  error=%f"%\
          (n, Mo,  M-approx_result))

for n in range(1, 16):
    multiply(n, M)

n=1, Mo=0,  error=0.007247
n=2, Mo=0,  error=0.007247
n=3, Mo=0,  error=0.007247
n=4, Mo=0,  error=0.007247
n=5, Mo=0,  error=0.007247
n=6, Mo=0,  error=0.007247
n=7, Mo=1,  error=0.007247
n=8, Mo=2,  error=0.007247
n=9, Mo=4,  error=0.007247
n=10, Mo=7,  error=0.007247
n=11, Mo=15,  error=0.007247
n=12, Mo=30,  error=0.007247
n=13, Mo=59,  error=0.007247
n=14, Mo=119,  error=0.007247
n=15, Mo=237,  error=0.007247


-----
#### 算子量化流程

1. 实现反卷积2x2， 和torch核对（数据输入和输出都是channel-last的）
2. 实现卷积3x3
3. 了解网络结构， 数据size变化，传输方式
4. 设计规划对应量化结构，数据类型的转化，量化特征的传输（scale， zero point）
5. 实现对应算子的量化流程
6. 实现底层的quant, dequant , calc_s_zp
7. 以上为非对称实现。拓展一版对称实现。

###### 1. 反卷积2x2

> For an in-depth treatment of the subject, see
Chapter 9 of the Deep Learning textbook 

[___`intro for conv`___](chrome-extension://ikhdkkncnoglghljlkmcimlnlhkeamad/pdf-viewer/web/viewer.html?file=https%3A%2F%2Farxiv.org%2Fpdf%2F1603.07285.pdf)  
1. 为什么做卷积  
> 通过卷积获得数组维度相关的信息


卷积次数：
$$o = \lfloor \frac{i - k + 2p}{s} \rfloor + 1$$

如果想卷积后size和卷积前相同， k必须为奇数， s=1， p=round（k/2）

完全填充：  
> every
possible partial or complete superimposition of the kernel on the input feature
map is taken into account.

- 池化：
$$o = \lfloor \frac{i - k}{s} \rfloor + 1$$

###### trans conv
> To maintain the same connectivity pattern in the equivalent convolution it is
necessary to zero pad the input in such a way that the first (top-left) application
of the kernel only touches the top-left pixel, i.e., the padding has to be equal to
the size of the kernel minus one