Summary
Support loading models optimized with QAT, QAD, and pruning techniques.
Reference: NVIDIA AI Model Optimization
Background
PyGPUkit is inference-focused. QAT/QAD/Pruning require training infrastructure, which is out of scope. However, we should support loading models trained with these techniques.
Techniques
Quantization-Aware Training (QAT)
- Models fine-tuned to handle quantization noise
- Typically FP8/INT8 with learned scales
- Better accuracy than PTQ alone
Quantization-Aware Distillation (QAD)
- QAT + knowledge distillation from teacher
- Highest accuracy recovery for aggressive quantization
Pruning + Distillation
- Structurally smaller models (fewer layers/heads/neurons)
- Permanent parameter reduction
TODO
Model Loaders
Sparse Support (for structured pruning)
Documentation
External Tools (for training)
Benefits
- Access to highest-quality quantized models
- Support for production-optimized checkpoints
- Future 2:4 sparsity support for Ampere+
Summary
Support loading models optimized with QAT, QAD, and pruning techniques.
Reference: NVIDIA AI Model Optimization
Background
PyGPUkit is inference-focused. QAT/QAD/Pruning require training infrastructure, which is out of scope. However, we should support loading models trained with these techniques.
Techniques
Quantization-Aware Training (QAT)
Quantization-Aware Distillation (QAD)
Pruning + Distillation
TODO
Model Loaders
Sparse Support (for structured pruning)
Documentation
External Tools (for training)
Benefits