Add Qwen3.5-2B text only olive-recipe with INT4 weights and shared INT8 embedding#422
Open
apsonawane wants to merge 9 commits into
Open
Add Qwen3.5-2B text only olive-recipe with INT4 weights and shared INT8 embedding#422apsonawane wants to merge 9 commits into
apsonawane wants to merge 9 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds Qwen-Qwen3.5-2B text-only Olive recipes for CUDA, CPU, and WebGPU backends, plus baseline MMLU evaluation configuration and per-backend setup docs.
Changes:
- Adds INT4 ModelBuilder recipes with embedding quantization and shared embedding/lm_head graph surgeries.
- Adds matching eval variants using
LMEvaluatorfor MMLU. - Adds backend README files, metadata, and requirements files.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
Qwen-Qwen3.5-2B/baseline/Qwen-Qwen3.5-2B_baseline_mmlu.json |
Adds FP16 baseline MMLU evaluation config. |
Qwen-Qwen3.5-2B/baseline/requirements.txt |
Adds baseline evaluation dependencies. |
Qwen-Qwen3.5-2B/cpu/* |
Adds CPU INT4 recipe, eval recipe, metadata, docs, and requirements. |
Qwen-Qwen3.5-2B/cuda/* |
Adds CUDA INT4 recipe, eval recipe, metadata, docs, and requirements. |
Qwen-Qwen3.5-2B/webgpu/* |
Adds WebGPU INT4 recipe, eval recipe, metadata, docs, and requirements. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add Qwen3.5-2B olive-recipe with INT4 weights and shared INT8 embedding
Summary
Adds Qwen3.5-2B (text-only) olive-recipes for CUDA, CPU, and WebGPU backends with INT4 weight quantization, INT8 embedding quantization, and shared embedding/lm_head weights.
Model
tie_word_embeddings=TrueModelBuilder(INT4 via Neural Compressor) →QuantizeEmbeddingInt8→ShareEmbeddingLmHeadConfigs
Qwen-Qwen3.5-2B_cuda_int4.json,Qwen-Qwen3.5-2B_cuda_int4_with_eval.jsonQwen-Qwen3.5-2B_cpu_int4.json,Qwen-Qwen3.5-2B_cpu_int4_with_eval.jsonQwen-Qwen3.5-2B_webgpu_int4.json,Qwen-Qwen3.5-2B_webgpu_int4_with_eval.jsonResults
Performance vs llama.cpp (Q4_K_M GGUF)
Benchmarked on NVIDIA GPU with CUDA, prompt length 64, 50 decode tokens:
ORT GenAI is 27% faster on decode (the primary bottleneck for generation workloads). llama.cpp has faster prefill due to Flash Attention optimizations. Model sizes are comparable (~11% difference).
Dependencies
QuantizeEmbeddingInt8andShareEmbeddingLmHeadgraph surgeriesFiles
Qwen-Qwen3.5-2B/cuda/— CUDA configsQwen-Qwen3.5-2B/cpu/— CPU configsQwen-Qwen3.5-2B/webgpu/— WebGPU configsQwen-Qwen3.5-2B/baseline/— FP16 baseline eval results