Skip to content

feat(model): QAT/QAD/Pruned Model Loading Support #115

@m96-chan

Description

@m96-chan

Summary

Support loading models optimized with QAT, QAD, and pruning techniques.

Reference: NVIDIA AI Model Optimization

Background

PyGPUkit is inference-focused. QAT/QAD/Pruning require training infrastructure, which is out of scope. However, we should support loading models trained with these techniques.

Techniques

Quantization-Aware Training (QAT)

  • Models fine-tuned to handle quantization noise
  • Typically FP8/INT8 with learned scales
  • Better accuracy than PTQ alone

Quantization-Aware Distillation (QAD)

  • QAT + knowledge distillation from teacher
  • Highest accuracy recovery for aggressive quantization

Pruning + Distillation

  • Structurally smaller models (fewer layers/heads/neurons)
  • Permanent parameter reduction

TODO

Model Loaders

  • TensorRT-LLM QAT checkpoint format
  • NVIDIA TensorRT Model Optimizer output format
  • Pruned model config detection (reduced layers/heads)

Sparse Support (for structured pruning)

  • Sparse tensor representation
  • Sparse GEMM kernels (2:4 sparsity)
  • Sparse attention patterns

Documentation

  • Recommended QAT workflow (external tools)
  • Recommended pruning workflow
  • Model format conversion guide

External Tools (for training)

Benefits

  • Access to highest-quality quantized models
  • Support for production-optimized checkpoints
  • Future 2:4 sparsity support for Ampere+

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions