Skip to content

Latest commit

 

History

History
88 lines (76 loc) · 2.01 KB

experiments.md

File metadata and controls

88 lines (76 loc) · 2.01 KB

Backends (pytorch)

  • fbgemm
    • on cpu
    • up to 7 bits
    • supports customization
  • tenrorrt
    • early prototype
    • only supports graph based approach
    • only static quantization
  • qnnpack
    • form arm processors
  • x86/native
    • auto choose qnnpack or fbgemm

Quantization efficiency metrics

  • Inference Time

    • Estimate MACs (Multiply-Accumulate-Operation)
      • Same as FP Model, but with integer arithmetic
    • Actual Hardware Integer Operations/FLOPS
      • Not measurable, highly backend depending
    • Actual Time
      • Feasible when using the same backend
      • Not directly comparable to the FP Model, due to backend differences
  • Memory

    • Estimated Size (bits * params)
    • Stored State-Dict size

Quantization Customization

  • Weight Observers
    • up to 7 Bits
    • Range:
      • full range [-64, 63]
      • symmetric range [-63, 63]
      • non-negative range [0, 63]
    • Statistics:
      • Min/Max
    • Granularity
      • Per Channel
      • Per Tensor
  • Activation Observers
    • up to 7 Bits
    • Range:
      • non-negative range [0, 127]
    • Statistics (static):
      • Min/Max
      • Min/Max (moving avg.)
      • Histogramm
    • Granularity:
      • Per Tensor
    • Dynamic (only for linear layers)
  • Mixed Quantization
    • Conv
      • Static
      • No Quantization
    • Linear
      • Static
      • Dynamic
      • No Quantization
  • PTQ
    • limit calibration set
  • QAT
    • optimizer + settings
    • stop criterion

$\to$ 2700 Quantization Configurations, each can be PTQ or QAT

PTQ

  • Static Quantization
    • Different Observers
    • Different Quantization Ranges
  • Mixed Dynamic Quantization
    • Dynamic Quantization of linear layers, No quantization of conv layers
    • Dynamic Quantization of linear layers, static quantization of conv layers

QAT

  • Static Quantization
    • Different Observers
    • Different Quantization Ranges
  • Mixed Dynamic Quantization
    • Dynamic Quantization of linear layers, No quantization of conv layers
    • Dynamic Quantization of linear layers, static quantization of conv layers