Sync master with upstream release b8795 by jan-service-account · Pull Request #486 · janhq/llama.cpp

jan-service-account · 2026-04-15T00:58:21Z

Updates dev branch with latest release (b8795) from ggml-org/llama.cpp

* server: support OAI /v1/audio/transcriptions API * address autoreview comments * correct default response_format value

This adds nvfp4 support for get_rows, dequant, and mul_mat(_id). For mul_mat, it does not add support for the dp4/q8_1 path, it's all via fp16/fp32.

…ml-org#21870) * common: skip reasoning budget sampler when no budget is requested After I added thinking_start_tag / thinking_end_tag for gemma4 in ggml-org#21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state). More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in ggml-org#21784 (98 t/s to 70 t/s on Vulkan). So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before. * common: preserve rbudget when grammar is lazy Following up on the review feedback on ggml-org#21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from ggml-org#20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy.

…gml-org#21644) * Update register tiling matmul to use f32 accumulation * fix profiling code * Fix register tiling matmul for chrome, i'm blaming dawn * Update batch tuning value for iOS * compile fix * Fix use of new load function

* cmake: fix CMP0194 warning on Windows with MSVC Set CMP0194 policy to NEW before project() call in ggml/CMakeLists.txt to suppress the "MSVC is not an assembler for language ASM" warning introduced in CMake 4.1. The ggml project enables ASM globally for Metal (macOS) and KleidiAI (ARM) backends. On Windows/MSVC, no assembler sources are used, but CMake 4.1+ warns because cl.exe is not a valid ASM compiler. This follows the same pattern used in ggml-vulkan (CMP0114, CMP0147). Closes ggml-org#20311 * cmake: apply cisc's formatting suggestion --------- Co-authored-by: texasich <texasich@users.noreply.github.com>

…g#21559)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ci : re-enable mac workflows * vulkan : fix compile warning

…device supports it (ggml-org#21572) * vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it * use FetchContent to get SPIRV-Headers * Fetch spirv-headers unconditionally * remove fetchcontent, rely on installed headers * fix ubuntu job * Update docs/build.md

* mtmd: add mtmd_image_tokens_get_decoder_pos() API * consistent naming * fix build

ngxson and others added 12 commits April 14, 2026 11:09

server: support OAI /v1/audio/transcriptions API (ggml-org#21863)

e489a5c

* server: support OAI /v1/audio/transcriptions API * address autoreview comments * correct default response_format value

vulkan: Support GGML_TYPE_NVFP4 (ggml-org#21455)

6a6780a

This adds nvfp4 support for get_rows, dequant, and mul_mat(_id). For mul_mat, it does not add support for the dp4/q8_1 path, it's all via fp16/fp32.

ggml : fix ARM NEON nvfp4 dot product on non-dotprod targets (ggml-or…

2e05f06

…g#21559)

vendor : update BoringSSL to 0.20260413.0 (ggml-org#21881)

be76dd0

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

metal : add XIELU unary op (ggml-org#20802)

aa0f189

ci : re-enable mac workflows (ggml-org#21894)

f4b5bf2

* ci : re-enable mac workflows * vulkan : fix compile warning

mtmd: add mtmd_image_tokens_get_decoder_pos() API (ggml-org#21851)

707c0b7

* mtmd: add mtmd_image_tokens_get_decoder_pos() API * consistent naming * fix build

metal : fix FA support logic (ggml-org#21898)

c0de6ed

jan-service-account merged commit d9959e8 into dev Apr 15, 2026
3 checks passed

jan-service-account deleted the update-dev-from-master-2026-04-15-00-58 branch April 15, 2026 00:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync master with upstream release b8795#486

Sync master with upstream release b8795#486
jan-service-account merged 12 commits intodevfrom
update-dev-from-master-2026-04-15-00-58

jan-service-account commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

jan-service-account commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants