Release v0.14.0 · microsoft/onnxruntime-genai

What's Changed

Fix WhisperProcessor divide-by-zero when single prompt is provided by @Copilot in #2068
Fix lm_head tensor loading order dependency in quantized model builder by @thpereir in #2061
Fail to build Whisper model by @xiaofeihan1 in #2075
Rename NemotronCacheConfig to NemotronConfig and add blank penalty to the decoder by @nenad1002 in #2042
Fix YaRN RoPE bugs in model builder and add parity tests by @titaiwangms in #2076
Add Transformers v5 Support by @sayanshaw24 in #2089
macOS ARM64 ADO pipeline by @Copilot in #2091
Reduce CPU-side per-token overhead in GenerateNextToken and SampleTopP by @hanbitmyths in #2085
Add onStageComplete by @apsonawane in #2074
[WebGPU] Support continuous decoding (RewindTo) with graph capture by @qjia7 in #2083
[Mistral3] Add VLM support with multi-image inference by @titaiwangms in #2077
Add k_quant_linear mixed-precision quantization for hybrid attention … by @apsonawane in #2100
Removes QNN packaging from onnxruntime-genai pipelines by @baijumeswani in #2109
Add Gemma4 multimodal support (vision + audio) by @apsonawane in #2103
Update GUIDs during az login by @kunal-vaishnavi in #2122
Add CODEOWNERS file for repository ownership by @kunal-vaishnavi in #2119
Qwen3.5: drop fp32 cast around RMSNorm in builder by @xiaofeihan1 in #2101
Add support for LFM2 in ORT GenAI by @xenova in #1979
Enable CUDA graph capture for CUDA EP to improve decode throughput by @apsonawane in #2070
[Qwen3.5] dedup position ids by @daijh in #2102
Address win-cuda pipeline errors by @baijumeswani in #2154
Update Extensions Commit to Fix Id2Token Bugs by @sayanshaw24 in #2159
Limit the CUDA cmake architectures to 86 for CI builds by @baijumeswani in #2161
Gate leaked-object error reporting in Shutdown() to debug builds or when logging is enabled by @baijumeswani in #2162
Update Copilot instructions for reviewing model builder by @kunal-vaishnavi in #2164
Fix DecoderState input_ids check regression introduced in #2103 by @titaiwangms in #2148
Fix memory leaks by @skottmckay in #2153
[Qwen3.5] Use LpNormalization for L2-norm in linear-attention Q/K by @xiaofeihan1 in #2127
Fix: Win32 build failure when paths contain spaces by @nsubaru in #2053
Fix CUDA build with MSVC by enabling /Zc:preprocessor for nvcc host compilation on VS 16.5 or greater by @nsubaru in #2054
Apply linear rope_scaling in model builder for Neutts/nano by @VishalX in #2142
Fix Quark/AWQ weight loading for Qwen3-VL-4B text model by @anilmartha in #2143
Fix WebGPU inference crash in embedding and multi-modal feature allocation by @feich-ms in #2163
Support Visual Studio 18 2026 build by @Copilot in #2017
Add QNN EP documentation to OGA including Genie note by @qti-kromero in #2158
Use windowsml package and make winml usage simpler by @baijumeswani in #2155
Cleanup TensorObject created by OrtxTensorResultGetAt by @skottmckay in #2168
Fix nemotron leaks by @skottmckay in #2169
[RyzenAI] make speech sub-model optional in PhiMultiModalProcessor by @manasablrm in #2167
Enable graph capture for WebGPU models and DML continuous decoding tests by @qjia7 in #2099
[Qwen3] Allow packed QKV MatMul under QK-Norm via post-MatMul Split by @xiaofeihan1 in #2137
Enable Linux ARM64 builds and packaging by @baijumeswani in #2107
Add gemma4 unit tests by @apsonawane in #2151
Auto-detect fixed kv-cache shape in DefaultKeyValueCache by @akholodnamdcom in #2166
Add text-only mode support for Qwen 3.5 model builder by @apsonawane in #2157
Fix heap overflow issue by @apsonawane in #2110
[Benchmark] Add --use_random_tokens flag to C benchmark by @VishalX in #2170
Add HunYuan Dense V1 (hunyuan_v1_dense) model support by @anilmartha in #2144
Nvidia Parakeet Tdt ASR support by @nenad1002 in #2150
Multilingual Streaming Nemotron ASR + CUDA support by @nenad1002 in #2171
Add Csharp binding for Multi-lingual ASR by @rui-ren in #2176
Add VideoChat-Flash (OpenGVLab) language model support by @anilmartha in #2147
Update Nemotron ASR docs by @rui-ren in #2178
Validate sliding window size before creating KV cache by @baijumeswani in #2181
Fix external weights loading for in-memory models without changing cwd by @baijumeswani in #2180
Enable Qwen3.5 TRT-RTX EP path with CUDA graph by @yen-shi in #2139
Add Qwen3.5-MoE (35B-A3B) model support by @tanzeel-amd in #2146
Update ort-extensions commit by @baijumeswani in #2182

New Contributors

@titaiwangms made their first contribution in #2076
@nsubaru made their first contribution in #2053
@VishalX made their first contribution in #2142
@anilmartha made their first contribution in #2143
@feich-ms made their first contribution in #2163
@qti-kromero made their first contribution in #2158
@manasablrm made their first contribution in #2167
@yen-shi made their first contribution in #2139
@tanzeel-amd made their first contribution in #2146

Full Changelog: v0.13.1...v0.14.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.14.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!