# Fixed-Point Quantisation of CNN

This tutorial introduces fixed-point quantisation of CNN using our Plumber tool-chain.

### Fixed point or Q representation

To represent a non-integer number a developer usually has two options. The first one is to use floating point representation, which supports a trade-off between numerical range and precision. However, this data-type and its arithmetic is challenging to implement in hardware with optimal performance, unless the processing device has a dedicated *F*loating *P*oint *U*nit (FPU).

That is why in most of low-power low-performance embeded devices we find fixed point representation or *Q*-representation. A non-integer number is represented by a fixed amount of bits split into two parts. The first part is for the _Integer_ part (IP) and the second one is _Fractional_ part. For example, a Q16 number has 16 fractional bits; a Q2.14 number has 2 integer bits and 14 fractional bits. Note, that to represent signed numbers, we usually need to assign one more bit from the integer part to determine the number being signed.

This representation has its pros and its cons, on one hand it is very easy to [implement](https://en.wikipedia.org/wiki/Q_(number_format)#Math_operations) it in low-level designs, giving improved performance and lower power consumption, the issue remains its precision. Let's see that on an example.

In [33]:
import numpy as np

#Fractional bits
f = 2

#Introduce scale by which we are going to scale the output/input
scale = 1 << f

a = np.linspace(1,2,10)
a_fix = np.round(a*f)*(1.0/f)

print(a)
print(a_fix)

[1.         1.11111111 1.22222222 1.33333333 1.44444444 1.55555556
 1.66666667 1.77777778 1.88888889 2.        ]
[1.  1.  1.  1.5 1.5 1.5 1.5 2.  2.  2. ]


On the other hand if we have too many fractional bits we are loosing precision in the integer part, again: 


In [37]:
f = np.dtype(np.float32).type(3.0)
one = np.dtype(np.float32).type(1.0)
scale = 1 << f

a = np.linspace(0,128,10, dtype=np.float32)
a_fix = np.round(a*f)*(1.0/f)


print(a)
print(a_fix)

TypeError: unsupported operand type(s) for <<: 'int' and 'numpy.float32'