feat(model): QAT/QAD/Pruned Model Loading Support

## Summary

Support loading models optimized with QAT, QAD, and pruning techniques.

Reference: [NVIDIA AI Model Optimization](https://developer.nvidia.com/blog/top-5-ai-model-optimization-techniques-for-faster-smarter-inference/)

## Background

PyGPUkit is inference-focused. QAT/QAD/Pruning require training infrastructure, which is out of scope. However, we should support **loading** models trained with these techniques.

## Techniques

### Quantization-Aware Training (QAT)
- Models fine-tuned to handle quantization noise
- Typically FP8/INT8 with learned scales
- Better accuracy than PTQ alone

### Quantization-Aware Distillation (QAD)  
- QAT + knowledge distillation from teacher
- Highest accuracy recovery for aggressive quantization

### Pruning + Distillation
- Structurally smaller models (fewer layers/heads/neurons)
- Permanent parameter reduction

## TODO

### Model Loaders
- [ ] TensorRT-LLM QAT checkpoint format
- [ ] NVIDIA TensorRT Model Optimizer output format
- [ ] Pruned model config detection (reduced layers/heads)

### Sparse Support (for structured pruning)
- [ ] Sparse tensor representation
- [ ] Sparse GEMM kernels (2:4 sparsity)
- [ ] Sparse attention patterns

### Documentation
- [ ] Recommended QAT workflow (external tools)
- [ ] Recommended pruning workflow
- [ ] Model format conversion guide

## External Tools (for training)

- [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
- [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)
- [Hugging Face Optimum](https://github.com/huggingface/optimum)

## Benefits

- Access to highest-quality quantized models
- Support for production-optimized checkpoints
- Future 2:4 sparsity support for Ampere+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(model): QAT/QAD/Pruned Model Loading Support #115

Summary

Background

Techniques

Quantization-Aware Training (QAT)

Quantization-Aware Distillation (QAD)

Pruning + Distillation

TODO

Model Loaders

Sparse Support (for structured pruning)

Documentation

External Tools (for training)

Benefits

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(model): QAT/QAD/Pruned Model Loading Support #115

Description

Summary

Background

Techniques

Quantization-Aware Training (QAT)

Quantization-Aware Distillation (QAD)

Pruning + Distillation

TODO

Model Loaders

Sparse Support (for structured pruning)

Documentation

External Tools (for training)

Benefits

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions