Sync master with upstream release b7896 by jan-service-account · Pull Request #409 · janhq/llama.cpp

jan-service-account · 2026-01-31T00:42:57Z

Updates dev branch with latest release (b7896) from ggml-org/llama.cpp

ggml-org#19151) * webgpu : pipeline flash_attn Q/K loads in WGSL * ggml-webgpu: unroll Q*K accumlation inner loop * ggml-webgpu: vectorization * ggml-webgpu: unrolling * ggml-webgpu: remove redundant unrolling * ggml-webgpu: restore the config * ggml-webgpu: remove redundant comments * ggml-webgpu: formatting * ggml-webgpu: formatting and remove vectorization * ggml-webgpu: remove unnecessary constants * ggml-webgpu: change QKV buffer to read_write to pass validation * ggml-webgpu: add explanation for the additional bracket around Q K accumulate * Indentation and for -> if for tail * Kick off CI on wgsl only commits --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Fix typos in SYCL documentation * Update SYCL.md * Update SYCL.md * Update SYCL.md * Update docs/backend/SYCL.md Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> * Update SYCL.md --------- Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>

* sycl: implement GGML_OP_TRI * docs: update ops.md for SYCL TRI * docs: regenerate ops.md * docs: update SYCL support for GGML_OP_TRI

* sycl: add softplus unary op implementation * sycl: add softplus unary op implementation * docs(ops): mark SYCL SOFTPLUS as supported * docs: update SYCL status for SOFTPLUS

…9186)

This commit removes the unused tmp_buf variable from llama-kv-cache.cpp and llama-memory-recurrent.cpp. The tmp_buf variable was declared but never used but since it has a non-trivial constructor/desctuctor we don't get an unused variable warning about it.

…19202) This commit adds a missing return statement to the GraniteMoeModel class to fix an issue in the model conversion process. Resolves: ggml-org#19201

…9203) This commit updates the comments in state_write_data to clarify that it is handling the R and S tensors and not Key and Value tensors.

On macos Sequoia 15.7.3, x86_64, the build has recently started failing with ``` In file included from .../code/cpp/llama.cpp/common/jinja/string.cpp:2: .../code/cpp/llama.cpp/common/./jinja/value.h:478:10: error: no template named 'unordered_map' in namespace 'std' 478 | std::unordered_map<value, value, value_hasher, value_equivalence> unordered; | ~~~~~^ In file included from .../code/cpp/llama.cpp/common/jinja/caps.cpp:1: .../code/cpp/llama.cpp/common/jinja/value.h:478:10: error: no template named 'unordered_map' in namespace 'std' 478 | std::unordered_map<value, value, value_hasher, value_equivalence> unordered; | ~~~~~^ In file included from .../code/cpp/llama.cpp/common/jinja/value.cpp:1: In file included from .../code/cpp/llama.cpp/common/jinja/runtime.h:4: .../code/cpp/llama.cpp/common/jinja/value.h:478:10: error: no template named 'unordered_map' in namespace 'std' 478 | std::unordered_map<value, value, value_hasher, value_equivalence> unordered; [...] ``` After a bit of digging to make sure all the appropriate flags were used, I notifced that the necessary header was not included. This fixes the build for me and should not affect negatively other builds that for some reasons were already succeeding

* spec : add ngram-mod * cont : simplify + keep track of occupancy * cont : cleanup * cont : move initialization to common/speculative * cont : cleanup * cont : cleanup * cont : fix

…gml-org#19194)

* server : wrap around the "id_slot" parameter * cont : minor

* Add Q8_0 OpenCL kernel Co-authored-by: yunjie <yunjie@qti.qualcomm.com> * opencl: fix build for non-adreno * opencl: refactor q8_0 * opencl: enforce subgroup size of 64 for adreno for q8_0 * For A750 and older generations, subgroup size can be 64 or 128. This kernel assumes subgroup size 64. * opencl: suppress warning when adreno kernels are disabled --------- Co-authored-by: yunjie <yunjie@qti.qualcomm.com> Co-authored-by: Li He <lih@qti.qualcomm.com>

* lookup, lookahead: fix crash when n_ctx not specified Since PR ggml-org#16653 (Dec 15, 2025), the default n_ctx is 0 to enable automatic GPU memory fitting. This causes llama-lookup and llama-lookahead to crash when run without explicit -c flag: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") Root cause: Both examples use params.n_ctx directly for batch initialization, but params.n_ctx remains 0 even after the context is properly initialized to n_ctx_train internally. Bug history: - Nov 2023: lookahead.cpp created (PR ggml-org#4207) with params.n_ctx pattern - Dec 2023: lookup.cpp created (PR ggml-org#4484) with same pattern - Nov 2024: default n_ctx changed to 4096 (PR ggml-org#10136) - bug dormant - Dec 2025: default n_ctx changed to 0 (PR ggml-org#16653) - bug activated The bug was dormant for 2+ years because params.n_ctx defaulted to 512, then 4096. PR ggml-org#16653 changed it to 0 for GPU auto-fitting, triggering the crash. Fix: Use llama_n_ctx(ctx) to get the actual runtime context size, matching the pattern already used elsewhere in lookup.cpp (line 72) and in speculative.cpp/speculative-simple.cpp. Tested: llama-lookup now works without -c flag (12.5% acceptance on Gemma-3-1B). Note: llama-lookahead has a separate pre-existing issue with sequence initialization (n_seq_max=1 vs W+G+1 needed) that is unrelated to this fix. * lookahead: fix n_seq_max and kv_unified configuration Lookahead decoding requires: - W + G + 1 = 31 sequences for parallel Jacobi decoding - Unified KV cache for coupled sequences in batch splitting These requirements were broken after PR ggml-org#14482 changed validation logic. Consolidates fix from PR ggml-org#18730 per maintainer request. Commit message drafted with Claude.

Signed-off-by: tc-mb <caitianchi@modelbest.cn>

ArberSephirotheca and others added 20 commits January 29, 2026 14:05

sycl: implement GGML_OP_TRI (ggml-org#19089)

c7358dd

* sycl: implement GGML_OP_TRI * docs: update ops.md for SYCL TRI * docs: regenerate ops.md * docs: update SYCL support for GGML_OP_TRI

sycl: implement GGML_UNARY_OP_SOFTPLUS (ggml-org#19114)

1025fd2

* sycl: add softplus unary op implementation * sycl: add softplus unary op implementation * docs(ops): mark SYCL SOFTPLUS as supported * docs: update SYCL status for SOFTPLUS

add tensor type checking as part of cuda graph properties (ggml-org#1…

ecbf01d

…9186)

docs: Add LlamaLib to UI projects (ggml-org#19181)

b316895

convert : add missing return statement for GraniteMoeModel (ggml-org#…

0562503

…19202) This commit adds a missing return statement to the GraniteMoeModel class to fix an issue in the model conversion process. Resolves: ggml-org#19201

tests : add GQA=20 FA test (ggml-org#19095)

c3b87ce

memory : clarify comments for r_l and s_l tensors [no ci] (ggml-org#1…

f3bc988

…9203) This commit updates the comments in state_write_data to clarify that it is handling the R and S tensors and not Key and Value tensors.

spec : add ngram-mod (ggml-org#19164)

dabaa2e

* spec : add ngram-mod * cont : simplify + keep track of occupancy * cont : cleanup * cont : move initialization to common/speculative * cont : cleanup * cont : cleanup * cont : fix

Correctly fetch q8_1 quantize pipeline in test as needed by 8a3519b (g…

13f3ebf

…gml-org#19194)

server : wrap around the "id_slot" parameter (ggml-org#19207)

bbada8b

* server : wrap around the "id_slot" parameter * cont : minor

cuda : fix compile warnings (whisper/0)

dfd6106

sync : ggml

d9a2a4b

ngram-mod : fix build [no ci] (ggml-org#19216)

4927795

mtmd: support MiniCPM-o 4.5(vision only) (ggml-org#19211)

ec6c742

Signed-off-by: tc-mb <caitianchi@modelbest.cn>

jan-service-account merged commit 9df4feb into dev Jan 31, 2026
3 checks passed

jan-service-account deleted the update-dev-from-master-2026-01-31-00-42 branch January 31, 2026 00:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync master with upstream release b7896#409

Sync master with upstream release b7896#409
jan-service-account merged 20 commits intodevfrom
update-dev-from-master-2026-01-31-00-42

jan-service-account commented Jan 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Conversation

jan-service-account commented Jan 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants