What's Changed
- Fix WhisperProcessor divide-by-zero when single prompt is provided by @Copilot in #2068
- Fix lm_head tensor loading order dependency in quantized model builder by @thpereir in #2061
- Fail to build Whisper model by @xiaofeihan1 in #2075
- Rename NemotronCacheConfig to NemotronConfig and add blank penalty to the decoder by @nenad1002 in #2042
- Fix YaRN RoPE bugs in model builder and add parity tests by @titaiwangms in #2076
- Add Transformers v5 Support by @sayanshaw24 in #2089
- macOS ARM64 ADO pipeline by @Copilot in #2091
- Reduce CPU-side per-token overhead in GenerateNextToken and SampleTopP by @hanbitmyths in #2085
- Add onStageComplete by @apsonawane in #2074
- [WebGPU] Support continuous decoding (RewindTo) with graph capture by @qjia7 in #2083
- [Mistral3] Add VLM support with multi-image inference by @titaiwangms in #2077
- Add k_quant_linear mixed-precision quantization for hybrid attention … by @apsonawane in #2100
- Removes QNN packaging from onnxruntime-genai pipelines by @baijumeswani in #2109
- Add Gemma4 multimodal support (vision + audio) by @apsonawane in #2103
- Update GUIDs during az login by @kunal-vaishnavi in #2122
- Add CODEOWNERS file for repository ownership by @kunal-vaishnavi in #2119
- Qwen3.5: drop fp32 cast around RMSNorm in builder by @xiaofeihan1 in #2101
- Add support for LFM2 in ORT GenAI by @xenova in #1979
- Enable CUDA graph capture for CUDA EP to improve decode throughput by @apsonawane in #2070
- [Qwen3.5] dedup position ids by @daijh in #2102
- Address win-cuda pipeline errors by @baijumeswani in #2154
- Update Extensions Commit to Fix Id2Token Bugs by @sayanshaw24 in #2159
- Limit the CUDA cmake architectures to 86 for CI builds by @baijumeswani in #2161
- Gate leaked-object error reporting in Shutdown() to debug builds or when logging is enabled by @baijumeswani in #2162
- Update Copilot instructions for reviewing model builder by @kunal-vaishnavi in #2164
- Fix DecoderState input_ids check regression introduced in #2103 by @titaiwangms in #2148
- Fix memory leaks by @skottmckay in #2153
- [Qwen3.5] Use LpNormalization for L2-norm in linear-attention Q/K by @xiaofeihan1 in #2127
- Fix: Win32 build failure when paths contain spaces by @nsubaru in #2053
- Fix CUDA build with MSVC by enabling /Zc:preprocessor for nvcc host compilation on VS 16.5 or greater by @nsubaru in #2054
- Apply linear rope_scaling in model builder for Neutts/nano by @VishalX in #2142
- Fix Quark/AWQ weight loading for Qwen3-VL-4B text model by @anilmartha in #2143
- Fix WebGPU inference crash in embedding and multi-modal feature allocation by @feich-ms in #2163
- Support Visual Studio 18 2026 build by @Copilot in #2017
- Add QNN EP documentation to OGA including Genie note by @qti-kromero in #2158
- Use windowsml package and make winml usage simpler by @baijumeswani in #2155
- Cleanup TensorObject created by OrtxTensorResultGetAt by @skottmckay in #2168
- Fix nemotron leaks by @skottmckay in #2169
- [RyzenAI] make speech sub-model optional in PhiMultiModalProcessor by @manasablrm in #2167
- Enable graph capture for WebGPU models and DML continuous decoding tests by @qjia7 in #2099
- [Qwen3] Allow packed QKV MatMul under QK-Norm via post-MatMul Split by @xiaofeihan1 in #2137
- Enable Linux ARM64 builds and packaging by @baijumeswani in #2107
- Add gemma4 unit tests by @apsonawane in #2151
- Auto-detect fixed kv-cache shape in DefaultKeyValueCache by @akholodnamdcom in #2166
- Add text-only mode support for Qwen 3.5 model builder by @apsonawane in #2157
- Fix heap overflow issue by @apsonawane in #2110
- [Benchmark] Add --use_random_tokens flag to C benchmark by @VishalX in #2170
- Add HunYuan Dense V1 (hunyuan_v1_dense) model support by @anilmartha in #2144
- Nvidia Parakeet Tdt ASR support by @nenad1002 in #2150
- Multilingual Streaming Nemotron ASR + CUDA support by @nenad1002 in #2171
- Add Csharp binding for Multi-lingual ASR by @rui-ren in #2176
- Add VideoChat-Flash (OpenGVLab) language model support by @anilmartha in #2147
- Update Nemotron ASR docs by @rui-ren in #2178
- Validate sliding window size before creating KV cache by @baijumeswani in #2181
- Fix external weights loading for in-memory models without changing cwd by @baijumeswani in #2180
- Enable Qwen3.5 TRT-RTX EP path with CUDA graph by @yen-shi in #2139
- Add Qwen3.5-MoE (35B-A3B) model support by @tanzeel-amd in #2146
- Update ort-extensions commit by @baijumeswani in #2182
New Contributors
- @titaiwangms made their first contribution in #2076
- @nsubaru made their first contribution in #2053
- @VishalX made their first contribution in #2142
- @anilmartha made their first contribution in #2143
- @feich-ms made their first contribution in #2163
- @qti-kromero made their first contribution in #2158
- @manasablrm made their first contribution in #2167
- @yen-shi made their first contribution in #2139
- @tanzeel-amd made their first contribution in #2146
Full Changelog: v0.13.1...v0.14.0