# Dynamic Quantization Notes

## Core Concept

FP32 → INT8 conversion where:
- **Weights**: quantized offline (pre-computed)
- **Activations**: quantized at runtime (computed during inference)

No calibration data needed - quick to deploy but has runtime overhead.

## How It Works

**Offline (model load time):**
1. Quantize weights FP32 → INT8
2. Compute scale & zero-point per layer
3. Store quantized weights in model

**Runtime (inference):**
1. Compute activation scale/zero-point dynamically per batch
2. Quantize activations FP32 → INT8
3. Run INT8 ops
4. Slight overhead from dynamic computation

## Math

```
val_fp32 = scale × (val_int8 - zero_point)

scale = max(|range_max|, |range_min|) × 2 / (quant_range_max - quant_range_min)
```

Zero-point must exactly represent FP32 zero (critical for zero-padding in CNNs).

## Calibration Data

**Don't need it.** This is the key advantage of dynamic quantization.

With static quantization, you need to run the model on representative data to figure out the activation ranges. Dynamic skips this - it computes activation ranges on-the-fly during inference instead.

Trade-off: no calibration = faster setup, but runtime overhead from computing ranges dynamically.

## When to Use

✅ Good for:
- CPU inference
- No calibration data available
- Quick deployment/prototyping
- Activations with varying distributions

❌ Not ideal for:
- GPU/TPU (use static instead)
- Production where max performance needed
- When there is good calibration data

## Example Implementation

```python
from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    model_input="model.onnx",
    model_output="model_quant_dynamic.onnx",
    weight_type=QuantType.QInt8  # or QUInt8
)
```


## Requirements

- Model must be **opset 10+** (recommend opset 13+)
- Two quantized formats exist:
  - **QOperator**: QLinearConv, MatMulInteger, etc.
  - **QDQ**: Quantize/DeQuantize ops

## Performance Notes

**Data formats (CPU only supports U8X8):**
- U8U8 (activation:uint8, weight:uint8) - processes 6 rows/time
- U8S8 (activation:uint8, weight:int8) - uses VPMADDUBSW, processes 4 rows

Try U8U8 first, then U8S8.

**Hardware:**
- Best: AVX2, AVX-512, ARM64 CPUs
- GPU: not ideal, use static quantization instead

## Benchmarking

```python
import time, numpy as np, onnxruntime as ort

def benchmark(model_path, input_shape, num_runs=100):
    session = ort.InferenceSession(model_path)
    input_name = session.get_inputs()[0].name
    x = np.random.rand(*input_shape).astype(np.float32)
    
    session.run(None, {input_name: x})  # warm-up
    
    start = time.time()
    for _ in range(num_runs):
        session.run(None, {input_name: x})
    
    return (time.time() - start) / num_runs

# Compare
print(f"Original: {benchmark('model.onnx', (1,3,224,224)):.4f}s")
print(f"Quantized: {benchmark('model_quant.onnx', (1,3,224,224)):.4f}s")
```

## vs Static Quantization

| | Dynamic | Static |
|---|---|---|
| Calibration | No | Yes |
| Activation quant | Runtime | Offline |
| Speed | Good | Better |
| Accuracy | Good | Better |
| Setup | Easy | Complex |
| Best for | CPU, prototyping | GPU/TPU, production |

## Considerations

- CPU-optimized, don't expect GPU gains
- Always validate accuracy on test set
- Check model file size reduced (~4x)
- Opset must be 10+
- If need max performance → use static instead

## Quick Reference

```bash
pip install onnxruntime
```

```python
from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic("model.onnx", "model_quant.onnx", weight_type=QuantType.QInt8)
```

Done. Good for prototyping, move to static for production.