Pruning and knowledge distillation for large language models. Creates smaller, faster models from larger ones while preserving quality.
Based on NVIDIA's Minitron papers:
- Compact Language Models via Pruning and Knowledge Distillation
- LLM Pruning and Distillation in Practice: The Minitron Approach
PARE takes a teacher model (Olmo 3 7B Instruct) and produces compressed student models through:
- Importance analysis - Compute per-neuron, per-head, and per-layer importance scores
- Pruning - Remove low-importance neurons, attention heads, and layers
- Distillation - Train the pruned model to match teacher behavior using KL divergence
- Python 3.12
- CUDA-capable GPU with Flash Attention support
Install dependencies:
uv syncuv run importance_analysis.pyAnalyzes the teacher model using 1024 calibration samples. Outputs to importance_scores_tensors/:
| Key | Shape | Description |
|---|---|---|
mlp |
[n_layers, intermediate_size] |
Per-neuron importance (L2 norm of activations) |
attention |
[n_layers, n_heads] |
Per-head importance |
attn_ln |
[hidden_size] |
Aggregated attention layer norm importance |
ffn_ln |
[hidden_size] |
Aggregated FFN layer norm importance |
layer |
[n_layers] |
Cosine-similarity depth scores for layer pruning |
uv run prune.pyApplies width and depth pruning based on importance scores. Configure target dimensions in the script:
mlp_width: FFN intermediate sizenum_heads: Attention headsnum_layers: Transformer layers
Saves pruned model to pruned_models/.
uv run build_distill_dataset.py
uv run generate_logprobs_hf.py --batch-size 1 --start-idx 0
uv run finalize_distill_dataset.pyuv run distill_off_policy.pyRuns off-policy distillation with KL divergence loss. Checkpoints automatically to HuggingFace Hub and logs to Weights & Biases. Resumes from latest checkpoint if interrupted.
pare/
├── importance_analysis.py # Importance score computation
├── prune.py # Model pruning
├── build_distill_dataset.py # Dataset construction
├── finalize_distill_dataset.py # Dataset finalization
├── generate_logprobs_hf.py # Teacher logprob extraction
├── distill_off_policy.py # Distillation training
├── pruned_models/ # Pruned model outputs
├── importance_scores_tensors/ # Cached importance scores
└── cache/ # Pre-packed training data
Key choices from the Minitron papers:
- Width pruning preferred over depth for models under 15B parameters
- Single-shot importance estimation (iterative provides no benefit)
- KL divergence loss for distillation instead of conventional training
- Full attention layers protected when pruning sliding window attention models