Skip to content

Distill and quantize models using TorchAO with intelligent GPU/CPU management. this app was built by Manat, This project is still in its early stages, so feel free to contribute to the build.

Notifications You must be signed in to change notification settings

manat0912/QTinker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QTinker - Distill & Quantize with TorchAO

A modern, production-ready application for distilling and quantizing language models using TorchAO with intelligent GPU/CPU management and Pinokio launcher support.

Features

  • 🎯 Flexible Model Loading: Support for HuggingFace models, Stable Diffusion models, and PyTorch checkpoints
  • 🖼️ Stable Diffusion & Diffusers Support: Full support for all SD models (1.5, 2.x, SDXL) and other diffusers models
    • Complete pipeline loading (model_index.json)
    • Individual component loading (UNet, VAE, Text Encoder)
    • Automatic model architecture detection
    • Raw state_dict support with intelligent wrapping
  • 🤗 Comprehensive BERT Support: 15+ BERT model variants (Large, Base, Small, Multilingual) - NO HuggingFace token required
    • Automatic model downloading from Google Cloud Storage
    • BERT-Large variants for teacher models (uncased, cased, whole-word masking)
    • BERT-Small variants for student models (small, mini, tiny, medium)
    • Multilingual support (104 languages, Chinese)
    • DistilBERT variants documentation
    • Automatic MODEL_REGISTRY.md generation
  • 🧪 Advanced Distillation Strategies: Multiple knowledge distillation methods including:
    • Logit-based Knowledge Distillation (KD)
    • Patient Knowledge Distillation (matching specific layers)
    • Feature-based Distillation (intermediate layer matching)
    • AttentionDistillationKD: Matches attention maps between teacher and student models
    • Custom projection layers for dimension matching
    • Configurable temperature parameters
  • TorchAO Quantization: Professional-grade quantization with multiple options:
    • INT4 Weight-Only (group_size configurable)
    • INT8 Dynamic Quantization
    • SmoothQuant (INT4): Optimized for model with activation outliers
    • NormalFloat-4 (NF4): Support for QLoRA-style NormalFloat data types
    • GPTQ (4-bit): Post-training quantization with group-wise weight adjustments
    • FP8 and other advanced formats
    • Dynamic parameter UI (Alpha for SmoothQuant, Group Size for GPTQ)
    • Model-specific quantization configurations
  • 📂 Enhanced File Browser: Native file dialog integration for easy model selection with smart default paths
  • 🎨 Gradio Web UI: Beautiful, responsive web interface with real-time log streaming and custom dark theme
  • 📦 Modular Architecture: Clean separation of concerns with pluggable components
  • 🖥️ Smart GPU/CPU Management: Automatic device selection and switching with CUDA support for inference engines
  • 💾 Memory Efficient Processing: Intelligent VRAM monitoring and fallback strategies
  • 🚀 Pinokio Launcher Integration: One-click installation, start, update, and reset with robust cross-platform support
  • 🔧 Model Selection UI: Interactive file pickers for teacher, student, and target models
  • 📊 Registry System: Comprehensive model registry for tracking supported architectures
  • 🔗 Symbolic Linking: Automatic model linking for seamless integration

Project Structure

QTinker/
├── app/
│   ├── app.py              # Main entry point (Pinokio compatible)
│   ├── gradio_ui.py        # Full Gradio web interface
│   ├── distillation.py     # Advanced distillation strategies (KD, Patient-KD, etc.)
│   ├── distill_quant_app.py # Legacy desktop UI (Tkinter)
│   ├── model_loader.py     # Unified model loading utilities
│   ├── download_models.py  # Model download and management
│   ├── registry.py         # Model architecture registry
│   ├── run_distillation.py # Distillation pipeline executor
│   ├── bert_models/        # BERT model implementations
│   ├── distilled/          # Output: Distilled models
│   └── quantized/          # Output: Quantized models
├── config/
│   ├── paths.yaml          # Path configuration
│   ├── quant_presets.yaml  # Quantization presets
│   └── settings.yaml       # Application settings
├── configs/
│   └── torchao_configs.py  # TorchAO quantization configurations
├── core/
│   ├── device_manager.py   # GPU/CPU management
│   ├── distillation.py     # Core distillation logic
│   ├── local_llm.py        # Local LLM utilities
│   └── logic.py            # Main pipeline logic
├── settings/
│   └── app_settings.py     # Global application settings
├── data/
│   └── train_prompts.txt   # Training prompts for distillation
├── outputs/                # Output directories
│   ├── distilled/          # Distilled model artifacts
│   └── quantized/          # Quantized model artifacts
├── install.js             # Pinokio installation script
├── start.js               # Pinokio launcher script
├── update.js              # Pinokio update script
├── reset.js               # Pinokio reset script
├── pinokio.js             # Pinokio UI definition
├── select_teacher_model.js # Teacher model selector
├── select_student_model.js # Student model selector
├── select_quantize_model.js # Quantization target selector
├── distill_quantize.js    # Combined distill & quantize trigger
├── link.js                # Model symbolic linking
├── requirements.txt       # Python dependencies
├── pyproject.toml         # Project configuration
├── pinokio_meta.json      # Metadata and state persistence
└── README.md              # This file

Installation

Automatic Installation (Recommended via Pinokio)

Simply open the project in Pinokio and click the "Install" button. The launcher will automatically:

  1. Create a Python virtual environment
  2. Install all dependencies using uv pip
  3. Set up PyTorch with CUDA support (if available)
  4. Configure the application

Manual Installation with pip

pip install -r requirements.txt

Manual Installation with uv (recommended for Pinokio)

uv pip install -r requirements.txt

CUDA Support

If you need specific CUDA-enabled PyTorch wheels:

# For CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# For CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Supported Model Types

QTinker supports a wide variety of model formats and automatically detects the model type:

🤗 HuggingFace Models

  • Transformers: BERT, GPT-2, LLaMA, Mistral, Phi, any AutoModel
  • Sentence Transformers: Embedding models
  • Vision: ViT, CLIP, DINOv2
  • Audio: Whisper, Wav2Vec2
  • Any HuggingFace model with config.json

How to use: Provide path to HuggingFace model folder containing config.json

🖼️ Stable Diffusion & Diffusers

  • Stable Diffusion 1.5: All community checkpoints
  • Stable Diffusion 2.x: v2.0, v2.1, variations
  • SDXL: Stable Diffusion XL and variants
  • Other Diffusers: ControlNet, LoRA-compatible models, custom pipelines
  • Components: Individual UNet, VAE, or Text Encoder models

How to use:

  • Full pipeline (recommended): Provide folder with model_index.json
  • Components: Provide folder with unet/, vae/, or text_encoder/ subfolders
  • Raw state_dict: Provide .bin file - the app will auto-detect and wrap it

📦 PyTorch Checkpoints

  • Raw .pt or .bin files containing model weights (state_dict)
  • Custom model architectures
  • Fine-tuned model checkpoints

How to use: Provide path to .pt or .bin file - the app will intelligently wrap it

🔄 Auto-Detection

The app automatically detects the model type by examining:

  1. Directory structure and config files
  2. JSON metadata files
  3. File extensions and content
  4. State dict keys (for raw weights)

No need to manually select the model type - just provide the path!

Usage

Using Pinokio Launcher (Easiest)

  1. Open QTinker in Pinokio
  2. Click "Install" (first time only) to set up dependencies
  3. Click "Start" to launch the Gradio web interface
  4. The interface will open automatically in your browser
  5. Use the web UI to select models and run distillation/quantization
  6. Use the model selector tools in the sidebar to configure your models:
    • Select Teacher Model: Choose the teacher model for knowledge distillation
    • Select Student Model: Choose the student model to be distilled
    • Select Quantize Model: Choose the model to quantize
    • Distill & Quantize: Run the complete pipeline

Running the Application Directly

python app/app.py

Or directly access the Gradio UI:

cd app
python gradio_ui.py

The Gradio interface will be available at http://localhost:7860

Using the Web Interface

  1. Model Selection: Use the sidebar buttons to select teacher, student, and quantization target models
  2. Model Path: Enter the path to your model (HuggingFace folder, Stable Diffusion folder, or PyTorch checkpoint)
  3. Model Type: The type is auto-detected, but you can override it from the dropdown:
    • HuggingFace folder
    • Diffusers (Stable Diffusion and other diffusers models)
    • PyTorch .pt/.bin file
  4. Quantization Type: Choose your quantization method:
    • INT4 (weight-only) - More aggressive compression
    • INT8 (dynamic) - Better accuracy with moderate compression
  5. Distillation Strategy (if applicable):
    • Logit KD - Match output logits
    • Patient KD - Match intermediate layers
  6. Run: Click "Run Distill + Quantize" to start the pipeline
  7. Monitor: Watch real-time log output for progress and debugging

Programmatic Usage

from core.logic import run_pipeline

# Run the complete pipeline
distilled_path, quantized_path = run_pipeline(
    model_path="microsoft/phi-2",
    model_type="HuggingFace folder",
    quant_type="INT8 (dynamic)",
    log_fn=print
)

For advanced usage with custom distillation strategies:

from app.distillation import LogitKD, PatientKD
from core.device_manager import DeviceManager

# Create device manager
device_manager = DeviceManager()
device = device_manager.get_device()

# Load teacher and student models
teacher_model = load_model("teacher_path")
student_model = load_model("student_path")

# Apply distillation strategy
strategy = LogitKD(teacher_model, student_model, temperature=3.0)
loss = strategy.compute_loss(student_outputs, teacher_outputs)

Configuration

TorchAO Quantization Configs

Edit configs/torchao_configs.py to customize quantization settings:

from torchao.quantization.configs import Int4WeightOnlyConfig, Int8DynamicConfig

# INT4 Configuration
Int4WeightOnlyConfig(
    group_size=128,          # Default: 128 (lower = more granular, slower)
    inner_k_tiles=8,         # Tiling for optimization
    padding_allowed=True     # Allow padding for performance
)

# INT8 Configuration  
Int8DynamicConfig(
    act_range_method="minmax"  # Range calculation method
)

Application Settings

Edit settings/app_settings.py to customize:

  • Output directories
  • Default model/quantization types
  • GPU/CPU management thresholds
  • Device switching behavior
  • Memory limits

Example:

# Device Management
MIN_VRAM_GB = 2.0              # Minimum VRAM to use GPU
VRAM_THRESHOLD = 0.9           # Use CPU if model > 90% of VRAM
AUTO_DEVICE_SWITCHING = True   # Enable automatic switching

# Output Directories
DISTILLED_OUTPUT_DIR = "outputs/distilled/"
QUANTIZED_OUTPUT_DIR = "outputs/quantized/"

# Model Defaults
DEFAULT_MODEL_TYPE = "HuggingFace folder"
DEFAULT_QUANT_TYPE = "INT8 (dynamic)"

Model Registry

The registry.py file maintains a comprehensive registry of supported model architectures with their optimal configurations:

SUPPORTED_MODELS = {
    "phi-2": {
        "type": "causal-lm",
        "default_quant": "INT4",
        "supports_distillation": True
    },
    "bert-base-uncased": {
        "type": "masked-lm",
        "default_quant": "INT8",
        "supports_distillation": True
    },
    # ... more models
}

Advanced Features

Knowledge Distillation Strategies

QTinker supports multiple knowledge distillation methods:

1. Logit-based Knowledge Distillation (LogitKD)

Matches the output logits between teacher and student models using KL divergence with temperature scaling.

from app.distillation import LogitKD

strategy = LogitKD(teacher_model, student_model, temperature=3.0)
loss = strategy.compute_loss(student_outputs, teacher_outputs)

Best for: General-purpose distillation, good baseline for most architectures

2. Patient Knowledge Distillation (PatientKD)

Matches hidden states at specific layers between teacher and student models. Useful when student architecture differs significantly from teacher.

from app.distillation import PatientKD

strategy = PatientKD(
    teacher_model, 
    student_model,
    student_layers=[2, 4, 6],      # Layers to extract from student
    teacher_layers=[4, 8, 12],     # Corresponding teacher layers
    loss_fn=F.mse_loss
)
loss = strategy.compute_loss(student_outputs, teacher_outputs)

Best for: Custom architectures, fine-grained control, layer-specific matching

3. Projection Layer Matching

Automatically handles dimension mismatches between teacher and student hidden states:

from app.distillation import ProjectionLayer

projection = ProjectionLayer(student_dim=768, teacher_dim=1024)
projected_student = projection(student_hidden_states)

Best for: Distilling to significantly smaller models

Device Management System

The intelligent device manager ensures optimal GPU/CPU utilization:

  • Automatic GPU Detection: Detects CUDA (NVIDIA), MPS (Apple Silicon), or CPU
  • VRAM Monitoring: Real-time GPU memory tracking
  • Automatic Fallback: Seamlessly switches to CPU when:
    • Less than 2GB VRAM available
    • Model size exceeds 90% of available VRAM
    • GPU runs out of memory during processing
  • Memory Efficiency: Models loaded on CPU first, then moved to GPU if appropriate
  • Cache Management: Automatic GPU cache clearing between operations
from core.device_manager import DeviceManager

device_manager = DeviceManager()
device = device_manager.get_device()
print(f"Using device: {device}")  # Outputs: cuda, mps, or cpu

Model Loading Utilities

Unified model loading with automatic format detection:

from app.model_loader import load_model

# Supports multiple formats
model = load_model("facebook/opt-350m")  # HuggingFace
model = load_model("./local_model.pt")   # Local PyTorch
model = load_model("./model/")           # Local folder

Model Registry System

Track and manage supported model architectures:

from app.registry import ModelRegistry

registry = ModelRegistry()
supported_models = registry.get_supported_models()
config = registry.get_model_config("phi-2")

Dependencies

  • torch>=2.0.0 - PyTorch deep learning framework
  • torchao>=0.1.0 - TorchAO quantization library
  • transformers>=4.30.0 - HuggingFace transformers for model loading
  • gradio>=4.0.0 - Web UI framework for interactive interface
  • accelerate>=0.20.0 - Model acceleration utilities
  • pyyaml - Configuration file handling
  • numpy - Numerical computing

Full dependency list available in requirements.txt

GPU/CPU Management

The application automatically manages device selection:

  • GPU Detection: Automatically detects and uses CUDA (NVIDIA) or MPS (Apple Silicon) when available
  • VRAM Monitoring: Monitors GPU memory usage and switches to CPU when VRAM is limited
  • Automatic Fallback: Falls back to CPU if:
    • Less than 2GB VRAM is available
    • Model size exceeds 90% of available VRAM
    • GPU runs out of memory during processing
  • Memory Efficient: Loads models on CPU first, then moves to GPU if appropriate
  • Cache Management: Automatically clears GPU cache between operations

Device Settings

You can adjust device management behavior in settings/app_settings.py:

MIN_VRAM_GB = 2.0  # Minimum VRAM required to use GPU
VRAM_THRESHOLD = 0.9  # Use CPU if model size > VRAM * threshold
AUTO_DEVICE_SWITCHING = True  # Enable automatic device switching

Output Directories

All output models are saved in standard HuggingFace format for easy reuse:

  • Distilled Models: outputs/distilled/ - Models after knowledge distillation
  • Quantized Models: outputs/quantized/ - Models after quantization

Each output includes:

  • Model weights and architecture
  • Tokenizers (when available)
  • Configuration files
  • Quantization metadata

GPU/CPU Management

Automatic Device Selection

The application automatically manages device selection based on available hardware:

GPU Detection:

  • NVIDIA CUDA GPUs
  • Apple Silicon (MPS)
  • CPU fallback

Memory Management:

  • Monitors GPU VRAM in real-time
  • Prevents out-of-memory errors
  • Switches to CPU when necessary

Threshold Settings (configurable in settings/app_settings.py):

  • MIN_VRAM_GB: Minimum VRAM required (default: 2.0)
  • VRAM_THRESHOLD: Use CPU if model > X% of VRAM (default: 0.9 = 90%)
  • AUTO_DEVICE_SWITCHING: Enable/disable automatic switching (default: True)

Device Switching Behavior

The system falls back to CPU if:

  • Less than 2GB VRAM available
  • Estimated model size exceeds 90% of available VRAM
  • GPU runs out of memory during processing

Manual Device Configuration

from core.device_manager import DeviceManager

device_manager = DeviceManager(
    min_vram_gb=2.0,
    vram_threshold=0.9,
    auto_switching=True
)

device = device_manager.get_device()

Troubleshooting

Stable Diffusion Model Loading Errors

Problem: "Loaded object is not a torch.nn.Module" when loading Stable Diffusion models Solution: The app now automatically detects and handles Stable Diffusion models in multiple formats:

  1. Full Pipeline (with model_index.json): Automatically loaded as StableDiffusionPipeline
  2. Component-based (separate UNet/VAE/TextEncoder folders): Loaded as individual components
  3. Raw state_dict files (.bin, .pt): Intelligently wrapped based on component type
  4. Different SD versions: Supports SD 1.5, 2.x, SDXL, and other diffusers models

The loader will:

  • Detect model architecture from directory structure
  • Load UNet, VAE, Text Encoder as appropriate components
  • Handle raw state_dict files by analyzing keys and wrapping them
  • Fall back to alternative loading methods if primary method fails

Example: If loading a UNet component at path/to/unet/pytorch_model.bin:

  • The app detects it's a UNet component
  • Wraps it appropriately for distillation/quantization
  • Moves it to the correct device (GPU/CPU)

Out of Memory (OOM) Errors

Problem: Model loading fails with CUDA out of memory Solution:

  • The app will automatically switch to CPU
  • Or reduce model size by using smaller teacher/student models
  • Or increase VRAM threshold in settings to force CPU usage earlier

Model Loading Issues

Problem: Model fails to load from HuggingFace or from file Solution:

  1. Ensure you have internet connection (for HuggingFace models)
  2. Verify model name/path is correct
  3. Check HuggingFace authentication if using private models
  4. For Stable Diffusion models, ensure the folder structure is intact:
    • Either model_index.json exists (full pipeline)
    • Or subfolders like unet/, vae/, text_encoder/ exist (components)
  5. Try loading from a local model path instead of HuggingFace Hub

Distillation Not Starting

Problem: Distillation script fails to execute Solution:

  1. Ensure both teacher and student models are loaded
  2. Check that training data is available in data/train_prompts.txt
  3. Verify CUDA/device availability in logs
  4. For Stable Diffusion models, note that distillation adapts to component type
  5. Check logs folder for detailed error messages

Performance Issues

Problem: Quantization/distillation is slow Solution:

  • Reduce batch size in configuration
  • Use INT4 quantization for faster processing
  • Ensure GPU is available and not occupied by other processes
  • Use smaller models for testing
  • For Stable Diffusion UNet models, quantization may take longer due to component size

Output Models

Models are saved in the following structure:

outputs/
├── distilled/
│   └── model_name/
│       ├── config.json
│       ├── pytorch_model.bin
│       └── tokenizer.*
└── quantized/
    └── model_name_quantized/
        ├── config.json
        ├── pytorch_model.bin
        └── tokenizer.*

All saved models are compatible with HuggingFace transformers and can be loaded with:

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("outputs/distilled/model_name")
tokenizer = AutoTokenizer.from_pretrained("outputs/distilled/model_name")

Pinokio Launcher Scripts

QTinker is fully integrated with Pinokio for easy one-click operations:

Available Commands

  • Install: Automatically sets up Python environment and installs dependencies
  • Start: Launches the Gradio web interface
  • Update: Updates QTinker and dependencies to the latest version
  • Reset: Clears virtual environment and cached files for a fresh start

Model Selection Tools

Sidebar buttons for easy model management:

  • Select Teacher Model: Pick teacher model for knowledge distillation
  • Select Student Model: Pick student model to be distilled
  • Select Quantize Model: Pick model to quantize
  • Link Models: Create symbolic links for model references

Metadata Persistence

Model selections and settings are saved in pinokio_meta.json:

{
  "teacher_model": "/path/to/teacher",
  "student_model": "/path/to/student",
  "quantize_model": "/path/to/model"
}

API Documentation

Python API

Run Complete Pipeline

from core.logic import run_pipeline

result = run_pipeline(
    model_path="microsoft/phi-2",
    model_type="HuggingFace folder",
    quant_type="INT8 (dynamic)",
    distill_type="LogitKD",
    temperature=3.0,
    log_fn=print
)

Load Model

from app.model_loader import load_model

model, tokenizer = load_model(
    model_path="facebook/opt-350m",
    model_type="HuggingFace folder",
    device="cuda"
)

Quantize Model

from torchao.quantization import quantize_
from torchao.quantization.configs import Int8DynamicConfig

quantize_(model, Int8DynamicConfig())
model.save_pretrained("outputs/quantized/model_name")

Apply Distillation

from app.distillation import LogitKD

strategy = LogitKD(teacher_model, student_model, temperature=3.0)
for batch in dataloader:
    loss = strategy.compute_loss(
        student_model(**batch),
        teacher_model(**batch)
    )
    loss.backward()

REST API (via Gradio)

When running the app, a Gradio interface provides HTTP endpoints:

# Gradio automatically generates REST endpoints
# Example: POST to Gradio endpoint with model parameters
curl -X POST "http://localhost:7860/api/predict" \
  -H "Content-Type: application/json" \
  -d '{"model_path": "phi-2", "quant_type": "INT8"}'

CLI Usage

# Start the application
python app/app.py

# Direct Gradio launch
python app/gradio_ui.py

# Run distillation only
python app/run_distillation.py --teacher-path /path/to/teacher \
                                --student-path /path/to/student

# Download models
python app/download_models.py --model-name "phi-2" --output-dir "./models"

About

Distill and quantize models using TorchAO with intelligent GPU/CPU management. this app was built by Manat, This project is still in its early stages, so feel free to contribute to the build.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published