Sync master with upstream release b8739 by jan-service-account · Pull Request #481 · janhq/llama.cpp

jan-service-account · 2026-04-10T00:53:28Z

Updates dev branch with latest release (b8739) from ggml-org/llama.cpp

to also check for enabled flash attention, instead of just auto.

Signed-off-by: John E <jeis4wpi@outlook.com>

* webui : send both backend_sampling == false/true * feat: Parameter sync --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

…ggml-org#21592) * fix: free ctx_copy in ggml_opt_free to plug per-training-session leak ggml_opt_alloc populates opt_ctx->ctx_copy via a free+init pair every time the allocated graph shape changes. The last ctx_copy from the final ggml_opt_alloc call survives until ggml_opt_free is invoked, but ggml_opt_free was only freeing ctx_static and ctx_cpu, never ctx_copy. Each opt_ctx lifetime therefore leaks the final per-batch context — ~900 KB for a typical GNN training session in sindarin-pkg-tensor, surfaced via AddressSanitizer. ctx_copy is nullptr-initialized and ggml_free() handles NULL safely, so the new release is guard-free. * Update ggml/src/ggml-opt.cpp Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: realorko <realorko@nowhere.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

…21635) * CUDA: also store node->src->data ptrs for equality check * address review comments

…-org#21633) We should not assume files are listed in order. Signed-off-by: Adrien Gallouët <angt@huggingface.co>

actions/labeler@v6 removed the `all:` / `any:` composition keys. The `server/webui` and `server` entries used `all:` to combine `any-glob-to-any-file` with negated `all-globs-to-all-files`, which now errors on every PR with: Unknown config options were under "changed-files": all Flatten both entries to a single `any-glob-to-any-file`. PRs touching both webui and other server files will now receive both labels instead of only `server/webui`. Co-authored-by: Marxist-Leninist <noreply@users.noreply.github.com>

* sycl : add flash-attn support for head size 512 This patch extends the SYCL Flash Attention implementation to support head sizes (DKQ/DV) of 512. Changes: - Added DKQ/DV 512 cases to both tile and vector Flash Attention kernels. - Updated kernel selection logic to allow vector kernels for head sizes up to 512 (previously 256). - Removed unused/redundant AMD and RDNA-specific configuration functions in `fattn-tile.hpp`. - Refactored `ggml_backend_sycl_buffer_init_tensor` to use a switch statement for clearer tensor extra buffer initialization. - Added necessary template instances for the new 512 head size across various quantization types. * remove defunct mxfp4 reorder from setting buffer type

…gml-org#21034)

Co-authored-by: AUTOMATIC <->

…lf-filtering (ggml-org#21623) * feat: jinja engine improvements for reka-edge Port three Jinja engine improvements needed for the reka-edge model: 1. Python-style string repetition ("ab" * 3 → "ababab") 2. ensure_ascii=true support for tojson filter (escapes non-ASCII to \uXXXX) 3. int() builtin on value_int_t (identity, needed for Reka Edge template) * fix: escape invalid utf8 bytes when ensure_ascii=true The json_ensure_ascii_preserving_format function does not correctly handle an edge case where if UTF-8 parsing fails, it adds the non-ascii character back to the output as a raw byte. This commit fixes that by adding the unicode standard replacement character \\ufffd to the output instead. This is the standard behavior for various programming languages like Python, Rust, Go, etc. * chore: address PR comments 1. Add todo comment for supporting string repetition for array/tuples 2. Add support for float identity operation 3. Move invalid ascii test case to test_fuzzing * chore: accept suggestion for common/jinja/value.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* YATF (Yet Another Tokenizer Fix) for Gemma 4. With tests! * Remove unnecessary hash from update script. * minor: move constant

* convert gguf * clip impl * fix conversion * wip * corrections * update docs * add gguf to test script

* model: fix multimodal padding token for gemma3n/gemma4 * nits

* common : simplify autoparser tagged parser rules * cont : remove upper limit on optional args * cont : revert changes to parsing at the end * cont : undo arbitrary ordering of optional args * cont : fix uninitialized required parameters * revert to simplify merge * re-apply patches * restore flexible optional arg ordering tests

* common : fix ambiguous grammar rule in gemma4 * cont : fix missing comma...

* webui: make Enter to send chat a setting * Shorten description * Use isMobile hook from $lib/hooks * Rebuild static output

* requirements : update transformers to 5.5.0 This commit updates the transformers dependency to version 5.5.0. The motivation for this is that transformers 5.5.0 includes support for Gemma4 and is required to be able to convert Gemma4 models. This is also causing issues for user of gguf-my-repo. Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/202 * fix huggingface_hub version * set version of transformers to 5.5.0 * convert : add ty ignore directives to convert_hf_to_gguf.py This commit adds `ty: ignore` directives to transformers tokenizers field/methods to avoid type check errors. There might be better ways to handle this and perhaps this can be done in a follow up commit. The motivation for this is that it looks like in transformers 5.5.0 AutoTokenizer.from_pretrained can return generic tokenizer types or None and the type checker now produces an error when the conversion script accesses field like tokenizer.vocab. * convert : add ty ignore to suppress type check errors * convert : remove incorrect type ignores * convert : fix remaining python checks I was running a newer version of ty locally but I've switched to version 0.0.26 which is what CI uses and I was then able to reproduce the errors. Sorry about the noise. * update transformers version to 5.5.1

…y all return cudaError_t) (ggml-org#21676) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (#17) * meta : formatting, naming, indentation (#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…org#21570) Add AMD Instinct MI350X/MI355X (gfx950, CDNA4) support: - vendors/hip.h: Add CDNA4 preprocessor define for __gfx950__ - common.cuh: Add GGML_CUDA_CC_CDNA4 and GGML_CUDA_CC_IS_CDNA4 macros - mma.cuh: Route CDNA4 to compatible MFMA instructions: * f32 matmul: mfma_f32_16x16x4f32 (xf32 variant unavailable on gfx950) * bf16 matmul: mfma_f32_16x16x16bf16_1k (same as CDNA3) * int8 matmul: mfma_i32_16x16x32_i8/32x32x16 (same as CDNA3) - mmq.cuh: Include CDNA4 in stream-k kernel dispatch CDNA4 is largely compatible with CDNA3 except: - No xf32 MFMA (mfma_f32_16x16x8_xf32) — routes to f32 path - Different FP8 format (e4m3fn vs e4m3_fnuz) — not changed here Tested on AMD Instinct MI355X (gfx950), ROCm 7.0.1: - Build: compiles cleanly with -DAMDGPU_TARGETS=gfx950 - llama-bench (Qwen2.5-1.5B Q4_K_M, single GPU): * f16+FA: 40,013 tok/s prefill, 254 tok/s decode * q8_0+FA: functional - Flash attention: works correctly - MMQ: works correctly with stream-k dispatch Co-authored-by: Andy Luo <andyluo7@users.noreply.github.com>

reeselevine and others added 27 commits April 8, 2026 16:08

webgpu : Query for adapter support when registering WebGPU backend (g…

5473949

…gml-org#21579)

kv-cache : extend cache quantization checks (ggml-org#21586)

3ba12fe

to also check for enabled flash attention, instead of just auto.

Propose fix a couple of typos (ggml-org#21581)

e9fd962

Signed-off-by: John E <jeis4wpi@outlook.com>

webui : send both backend_sampling == false/true (ggml-org#18781)

4a05e0c

* webui : send both backend_sampling == false/true * feat: Parameter sync --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

vocab : remove </s> eog token if gemma4 (ggml-org#21492)

d9a12c8

server: respect the ignore eos flag (ggml-org#21203)

6606000

CUDA: also store node->src->data ptrs for equality check (ggml-org#…

d12cc3d

…21635) * CUDA: also store node->src->data ptrs for equality check * address review comments

common : skip non-primary GGUF split files when selecting model (ggml…

4293919

…-org#21633) We should not assume files are listed in order. Signed-off-by: Adrien Gallouët <angt@huggingface.co>

vulkan: unify type macros to use Vx instead of _VECx (ggml-org#21605)

8a132fa

webui: Add option to pre-encode conversation for faster next turns (g…

75511a8

…gml-org#21034)

server : fix grammar commandline args (ggml-org#21543)

3ee9da0

Co-authored-by: AUTOMATIC <->

fix: Model Selector choice sync (ggml-org#21628)

9949ad0

metal : add missing mm-id specializations for q1_0 (ggml-org#21662)

5e9c635

vocab: add gemma4 tokenizer tests, fix edge case (ggml-org#21534)

0ec191e

* YATF (Yet Another Tokenizer Fix) for Gemma 4. With tests! * Remove unnecessary hash from update script. * minor: move constant

mtmd: support dots.ocr (ggml-org#17575)

501aeed

* convert gguf * clip impl * fix conversion * wip * corrections * update docs * add gguf to test script

model: fix multimodal padding token for gemma3n/gemma4 (ggml-org#21625)

057dba3

* model: fix multimodal padding token for gemma3n/gemma4 * nits

common : fix ambiguous grammar rule in gemma4 (ggml-org#21661)

ddf03c6

* common : fix ambiguous grammar rule in gemma4 * cont : fix missing comma...

webui: add "Send message on Enter" setting (ggml-org#21577)

4ef9301

* webui: make Enter to send chat a setting * Shorten description * Use isMobile hook from $lib/hooks * Rebuild static output

ggml : check return value of CUB calls used in argsort and top-k (the…

009a113

…y all return cudaError_t) (ggml-org#21676) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

jan-service-account merged commit ff3628e into dev Apr 10, 2026
5 checks passed

jan-service-account deleted the update-dev-from-master-2026-04-10-00-53 branch April 10, 2026 01:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync master with upstream release b8739#481

Sync master with upstream release b8739#481
jan-service-account merged 27 commits intodevfrom
update-dev-from-master-2026-04-10-00-53

jan-service-account commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

jan-service-account commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants