Add Qwen3.5-2B text only olive-recipe with INT4 weights and shared INT8 embedding by apsonawane · Pull Request #422 · microsoft/olive-recipes

apsonawane · 2026-05-14T00:24:41Z

Add Qwen3.5-2B olive-recipe with INT4 weights and shared INT8 embedding

Summary

Adds Qwen3.5-2B (text-only) olive-recipes for CUDA, CPU, and WebGPU backends with INT4 weight quantization, INT8 embedding quantization, and shared embedding/lm_head weights.

Model

Qwen3.5-2B: Hybrid architecture with GatedDeltaNet linear attention (18 layers) + standard GQA attention (6 layers), 248K vocab, 2048 hidden size, tie_word_embeddings=True
Quantization pipeline: ModelBuilder (INT4 via Neural Compressor) → QuantizeEmbeddingInt8 → ShareEmbeddingLmHead

Configs

Backend	Configs
CUDA	`Qwen-Qwen3.5-2B_cuda_int4.json`, `Qwen-Qwen3.5-2B_cuda_int4_with_eval.json`
CPU	`Qwen-Qwen3.5-2B_cpu_int4.json`, `Qwen-Qwen3.5-2B_cpu_int4_with_eval.json`
WebGPU	`Qwen-Qwen3.5-2B_webgpu_int4.json`, `Qwen-Qwen3.5-2B_webgpu_int4_with_eval.json`

Results

Metric	Value
Model size	1.4 GB (down from 4.3 GB FP16)
MMLU accuracy	57.11% (vs 59.27% FP16 baseline, -2.16%)
Decode throughput	212 tok/s (CUDA, GenAI)
Prefill throughput	3,007 tok/s (CUDA, GenAI)

Performance vs llama.cpp (Q4_K_M GGUF)

Benchmarked on NVIDIA GPU with CUDA, prompt length 64, 50 decode tokens:

Metric	ORT GenAI (1.32 GB)	llama.cpp (1.19 GB)	Ratio
Decode (tok/s)	212	166	1.27x faster
Prefill (tok/s)	3,007	6,228	0.48x
Model size	1.32 GB	1.19 GB	1.11x

========================================================================================================
TRANSLATION TABLE (single request)
========================================================================================================
Metric                                 ORT 2B CUDA        ORT 2B CPU     llama.cpp GPU     llama.cpp CPU
--------------------------------------------------------------------------------------------------------
EP runtime size                             638 MB            638 MB             37 MB             37 MB
Model size                                 1.32 GB           1.32 GB           1.19 GB           1.19 GB
Memory increase after inference             961 MB           2.13 GB           1.52 GB           1.33 GB
GPU memory increase after inference           3.28 GB              0 MB           1.72 GB            888 MB
Translation time                            469 ms          39938 ms            472 ms           1726 ms
CPU usage during inference                   93.8%            833.7%             96.3%           1092.7%
GPU usage during inference                   18.4%              0.0%             79.0%              0.0%
--------------------------------------------------------------------------------------------------------

ORT GenAI is 27% faster on decode (the primary bottleneck for generation workloads). llama.cpp has faster prefill due to Flash Attention optimizations. Model sizes are comparable (~11% difference).

Dependencies

Requires Olive PR #2464 for QuantizeEmbeddingInt8 and ShareEmbeddingLmHead graph surgeries
Requires onnxruntime-genai with Qwen3.5 text-only builder support

Files

Qwen-Qwen3.5-2B/cuda/ — CUDA configs
Qwen-Qwen3.5-2B/cpu/ — CPU configs
Qwen-Qwen3.5-2B/webgpu/ — WebGPU configs
Qwen-Qwen3.5-2B/baseline/ — FP16 baseline eval results

…T8 embedding

Copilot

Pull request overview

Adds Qwen-Qwen3.5-2B text-only Olive recipes for CUDA, CPU, and WebGPU backends, plus baseline MMLU evaluation configuration and per-backend setup docs.

Changes:

Adds INT4 ModelBuilder recipes with embedding quantization and shared embedding/lm_head graph surgeries.
Adds matching eval variants using LMEvaluator for MMLU.
Adds backend README files, metadata, and requirements files.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`Qwen-Qwen3.5-2B/baseline/Qwen-Qwen3.5-2B_baseline_mmlu.json`	Adds FP16 baseline MMLU evaluation config.
`Qwen-Qwen3.5-2B/baseline/requirements.txt`	Adds baseline evaluation dependencies.
`Qwen-Qwen3.5-2B/cpu/*`	Adds CPU INT4 recipe, eval recipe, metadata, docs, and requirements.
`Qwen-Qwen3.5-2B/cuda/*`	Adds CUDA INT4 recipe, eval recipe, metadata, docs, and requirements.
`Qwen-Qwen3.5-2B/webgpu/*`	Adds WebGPU INT4 recipe, eval recipe, metadata, docs, and requirements.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

apsonawane added 2 commits May 14, 2026 00:22

Add Qwen3.5-2B text only olive-recipe with INT4 weights and shared IN…

c0b60d3

…T8 embedding

Merge branch 'main' into asonawane/int

5503317

Copilot AI review requested due to automatic review settings May 14, 2026 00:24

Copilot started reviewing on behalf of apsonawane May 14, 2026 00:25 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread Qwen-Qwen3.5-2B/cuda/README.md Outdated

Comment thread Qwen-Qwen3.5-2B/cpu/README.md Outdated

Comment thread Qwen-Qwen3.5-2B/webgpu/README.md Outdated

apsonawane added 7 commits May 14, 2026 04:00

Fix Readme

4696d8d

Fix eval

6a9d975

Use external data

eb5fbc1

Update recipes

6d6a815

Update recipes for webgpu

48bd358

Merge branch 'main' into asonawane/int

854c18d

Update WebGPU recipes

95a6f92

apsonawane requested a review from xiaoyu-work May 20, 2026 22:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen3.5-2B text only olive-recipe with INT4 weights and shared INT8 embedding#422

Add Qwen3.5-2B text only olive-recipe with INT4 weights and shared INT8 embedding#422
apsonawane wants to merge 9 commits into
mainfrom
asonawane/int

apsonawane commented May 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

apsonawane commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Qwen3.5-2B olive-recipe with INT4 weights and shared INT8 embedding

Summary

Model

Configs

Results

Performance vs llama.cpp (Q4_K_M GGUF)

Dependencies

Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

apsonawane commented May 14, 2026 •

edited

Loading