Skip to content

Sync master with upstream release b8795#486

Merged
jan-service-account merged 12 commits intodevfrom
update-dev-from-master-2026-04-15-00-58
Apr 15, 2026
Merged

Sync master with upstream release b8795#486
jan-service-account merged 12 commits intodevfrom
update-dev-from-master-2026-04-15-00-58

Conversation

@jan-service-account
Copy link
Copy Markdown

Updates dev branch with latest release (b8795) from ggml-org/llama.cpp

ngxson and others added 12 commits April 14, 2026 11:09
* server: support OAI /v1/audio/transcriptions API

* address autoreview comments

* correct default response_format value
This adds nvfp4 support for get_rows, dequant, and mul_mat(_id). For
mul_mat, it does not add support for the dp4/q8_1 path, it's all via
fp16/fp32.
…ml-org#21870)

* common: skip reasoning budget sampler when no budget is requested

After I added thinking_start_tag / thinking_end_tag for gemma4 in ggml-org#21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state).

More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in ggml-org#21784 (98 t/s to 70 t/s on Vulkan).

So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before.

* common: preserve rbudget when grammar is lazy

Following up on the review feedback on ggml-org#21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from ggml-org#20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy.
…gml-org#21644)

* Update register tiling matmul to use f32 accumulation

* fix profiling code

* Fix register tiling matmul for chrome, i'm blaming dawn

* Update batch tuning value for iOS

* compile fix

* Fix use of new load function
* cmake: fix CMP0194 warning on Windows with MSVC

Set CMP0194 policy to NEW before project() call in ggml/CMakeLists.txt to suppress the "MSVC is not an assembler for language ASM" warning introduced in CMake 4.1.

The ggml project enables ASM globally for Metal (macOS) and KleidiAI (ARM) backends. On Windows/MSVC, no assembler sources are used, but CMake 4.1+ warns because cl.exe is not a valid ASM compiler.

This follows the same pattern used in ggml-vulkan (CMP0114, CMP0147).

Closes ggml-org#20311

* cmake: apply cisc's formatting suggestion

---------

Co-authored-by: texasich <texasich@users.noreply.github.com>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ci : re-enable mac workflows

* vulkan : fix compile warning
…device supports it (ggml-org#21572)

* vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it

* use FetchContent to get SPIRV-Headers

* Fetch spirv-headers unconditionally

* remove fetchcontent, rely on installed headers

* fix ubuntu job

* Update docs/build.md
* mtmd: add mtmd_image_tokens_get_decoder_pos() API

* consistent naming

* fix build
@jan-service-account jan-service-account merged commit d9959e8 into dev Apr 15, 2026
3 checks passed
@jan-service-account jan-service-account deleted the update-dev-from-master-2026-04-15-00-58 branch April 15, 2026 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants