- fbgemm
- on cpu
- up to 7 bits
- supports customization
- tenrorrt
- early prototype
- only supports graph based approach
- only static quantization
- qnnpack
- form arm processors
- x86/native
- auto choose qnnpack or fbgemm
-
Inference Time
- Estimate MACs (Multiply-Accumulate-Operation)
- Same as FP Model, but with integer arithmetic
- Actual Hardware Integer Operations/FLOPS
- Not measurable, highly backend depending
- Actual Time
- Feasible when using the same backend
- Not directly comparable to the FP Model, due to backend differences
- Estimate MACs (Multiply-Accumulate-Operation)
-
Memory
- Estimated Size (bits * params)
- Stored State-Dict size
- Weight Observers
- up to 7 Bits
- Range:
- full range [-64, 63]
- symmetric range [-63, 63]
- non-negative range [0, 63]
- Statistics:
- Min/Max
- Granularity
- Per Channel
- Per Tensor
- Activation Observers
- up to 7 Bits
- Range:
- non-negative range [0, 127]
- Statistics (static):
- Min/Max
- Min/Max (moving avg.)
- Histogramm
- Granularity:
- Per Tensor
- Dynamic (only for linear layers)
- Mixed Quantization
- Conv
- Static
- No Quantization
- Linear
- Static
- Dynamic
- No Quantization
- Conv
- PTQ
- limit calibration set
- QAT
- optimizer + settings
- stop criterion
- Static Quantization
- Different Observers
- Different Quantization Ranges
- Mixed Dynamic Quantization
- Dynamic Quantization of linear layers, No quantization of conv layers
- Dynamic Quantization of linear layers, static quantization of conv layers
- Static Quantization
- Different Observers
- Different Quantization Ranges
- Mixed Dynamic Quantization
- Dynamic Quantization of linear layers, No quantization of conv layers
- Dynamic Quantization of linear layers, static quantization of conv layers