Skip to content

Add Qwen3.5-2B text only olive-recipe with INT4 weights and shared INT8 embedding#422

Open
apsonawane wants to merge 9 commits into
mainfrom
asonawane/int
Open

Add Qwen3.5-2B text only olive-recipe with INT4 weights and shared INT8 embedding#422
apsonawane wants to merge 9 commits into
mainfrom
asonawane/int

Conversation

@apsonawane
Copy link
Copy Markdown
Contributor

@apsonawane apsonawane commented May 14, 2026

Add Qwen3.5-2B olive-recipe with INT4 weights and shared INT8 embedding

Summary

Adds Qwen3.5-2B (text-only) olive-recipes for CUDA, CPU, and WebGPU backends with INT4 weight quantization, INT8 embedding quantization, and shared embedding/lm_head weights.

Model

  • Qwen3.5-2B: Hybrid architecture with GatedDeltaNet linear attention (18 layers) + standard GQA attention (6 layers), 248K vocab, 2048 hidden size, tie_word_embeddings=True
  • Quantization pipeline: ModelBuilder (INT4 via Neural Compressor) → QuantizeEmbeddingInt8ShareEmbeddingLmHead

Configs

Backend Configs
CUDA Qwen-Qwen3.5-2B_cuda_int4.json, Qwen-Qwen3.5-2B_cuda_int4_with_eval.json
CPU Qwen-Qwen3.5-2B_cpu_int4.json, Qwen-Qwen3.5-2B_cpu_int4_with_eval.json
WebGPU Qwen-Qwen3.5-2B_webgpu_int4.json, Qwen-Qwen3.5-2B_webgpu_int4_with_eval.json

Results

Metric Value
Model size 1.4 GB (down from 4.3 GB FP16)
MMLU accuracy 57.11% (vs 59.27% FP16 baseline, -2.16%)
Decode throughput 212 tok/s (CUDA, GenAI)
Prefill throughput 3,007 tok/s (CUDA, GenAI)

Performance vs llama.cpp (Q4_K_M GGUF)

Benchmarked on NVIDIA GPU with CUDA, prompt length 64, 50 decode tokens:

Metric ORT GenAI (1.32 GB) llama.cpp (1.19 GB) Ratio
Decode (tok/s) 212 166 1.27x faster
Prefill (tok/s) 3,007 6,228 0.48x
Model size 1.32 GB 1.19 GB 1.11x
========================================================================================================
TRANSLATION TABLE (single request)
========================================================================================================
Metric                                 ORT 2B CUDA        ORT 2B CPU     llama.cpp GPU     llama.cpp CPU
--------------------------------------------------------------------------------------------------------
EP runtime size                             638 MB            638 MB             37 MB             37 MB
Model size                                 1.32 GB           1.32 GB           1.19 GB           1.19 GB
Memory increase after inference             961 MB           2.13 GB           1.52 GB           1.33 GB
GPU memory increase after inference           3.28 GB              0 MB           1.72 GB            888 MB
Translation time                            469 ms          39938 ms            472 ms           1726 ms
CPU usage during inference                   93.8%            833.7%             96.3%           1092.7%
GPU usage during inference                   18.4%              0.0%             79.0%              0.0%
--------------------------------------------------------------------------------------------------------

ORT GenAI is 27% faster on decode (the primary bottleneck for generation workloads). llama.cpp has faster prefill due to Flash Attention optimizations. Model sizes are comparable (~11% difference).

Dependencies

  • Requires Olive PR #2464 for QuantizeEmbeddingInt8 and ShareEmbeddingLmHead graph surgeries
  • Requires onnxruntime-genai with Qwen3.5 text-only builder support

Files

  • Qwen-Qwen3.5-2B/cuda/ — CUDA configs
  • Qwen-Qwen3.5-2B/cpu/ — CPU configs
  • Qwen-Qwen3.5-2B/webgpu/ — WebGPU configs
  • Qwen-Qwen3.5-2B/baseline/ — FP16 baseline eval results

Copilot AI review requested due to automatic review settings May 14, 2026 00:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Qwen-Qwen3.5-2B text-only Olive recipes for CUDA, CPU, and WebGPU backends, plus baseline MMLU evaluation configuration and per-backend setup docs.

Changes:

  • Adds INT4 ModelBuilder recipes with embedding quantization and shared embedding/lm_head graph surgeries.
  • Adds matching eval variants using LMEvaluator for MMLU.
  • Adds backend README files, metadata, and requirements files.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
Qwen-Qwen3.5-2B/baseline/Qwen-Qwen3.5-2B_baseline_mmlu.json Adds FP16 baseline MMLU evaluation config.
Qwen-Qwen3.5-2B/baseline/requirements.txt Adds baseline evaluation dependencies.
Qwen-Qwen3.5-2B/cpu/* Adds CPU INT4 recipe, eval recipe, metadata, docs, and requirements.
Qwen-Qwen3.5-2B/cuda/* Adds CUDA INT4 recipe, eval recipe, metadata, docs, and requirements.
Qwen-Qwen3.5-2B/webgpu/* Adds WebGPU INT4 recipe, eval recipe, metadata, docs, and requirements.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Qwen-Qwen3.5-2B/cuda/README.md Outdated
Comment thread Qwen-Qwen3.5-2B/cpu/README.md Outdated
Comment thread Qwen-Qwen3.5-2B/webgpu/README.md Outdated
@apsonawane apsonawane requested a review from xiaoyu-work May 20, 2026 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants