Sync master with upstream release b8323 by jan-service-account · Pull Request #452 · janhq/llama.cpp

jan-service-account · 2026-03-14T00:44:35Z

Updates dev branch with latest release (b8323) from ggml-org/llama.cpp

* llama : enable chunked fused GDN path * models : avoid Q and K repeats when using fused GDA * cont : fix comment Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix the fix Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix * metal : add GDN kernel (ggml-org#20361) * metal : add Metal backend for GGML_OP_GATED_DELTA_NET Add a fused Metal kernel for the gated delta net recurrence op (ggml-org#19504), enabling GPU-accelerated inference for DeltaNet-based models (Qwen3.5, etc.) on Apple Silicon. Supports both GDA (scalar gate) and KDA (per-row gate) modes with head_size 64 and 128. Unsupported configurations (head_size 32, non-contiguous tensors) gracefully fall back to CPU. Performance: Qwen3.5-0.8B Q4_K_M on M4 Max tg128: 170 -> 213 t/s (+25%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : validate contiguity of all input tensors in supports_op Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : add algorithm equivalence comment for GDA decay path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * cont : unslop + optimize * cont : clean-up --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * CUDA: AR gated delta net improvements (ggml-org#20391) * Add FastDiv to gated_delta_net_cuda * Shard columns across warps This reduces register pressure (avoids spill for S_v = 128) and gives the warp-scheduler more CTAs to schedule (thus hiding data-access latencies). * Remove unneded include in gated_delta_net.cu * Improve comments * Apply code-formating * Make sharding HIP-compatible 1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly 2. Add test with partial warp to test sum reduction on CUDA * Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t * Rename variables * Enable GDN also for prefill, move TODO for chunked_GDN * Actually remove the TODO from 2068908 * Get warp size at runtime warp_size is not known at compile time in hip host code. * Don't expose ggml_cuda_get_physical_warp_size on host --------- Co-authored-by: uvos <devnull@uvos.xyz> * llama : refactor llm_build_delta_net_base API --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Oliver Simons <osimons@nvidia.com> Co-authored-by: uvos <devnull@uvos.xyz>

* Add GGML_OP_REPEAT to webgpu backend. * Add i16 support for GGML_OP_REPEAT.

…gml-org#20416)

…rg#20427)

* Add support for Phi4ForCausalLMV. * Fix Phi-4 vision parity (correcting SigLIP2 patch-kernel export layout) and matching HF NaFlex resize behavior in mtmd. * Rename contants + fix tokenizer label * Clean-ups. * Fix GGUF export. * Set tokenizer.ggml.pre explicitly. * Default vocab name rather than forcing it. * Clean-ups. * Fix indent. * Fix subscriptable error. * remov overcomplicated code path * Clean-ups. --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

Co-authored-by: Mishusha <pmv26021975@gmail.com>

…ml-org#20392)

* OpenCL: add CUMSUM op support * remove unused argument * opencl: refactor cumsum * opencl: refactor * opencl: refactor tmp buffer * opencl: adjust max number of subgroups * opencl: fix whitespace * opencl: fix global size when cumsum the tmp buffer --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

…ls with --no-mmap (ggml-org#20059) * Changed to reuse command buffers to fix crashing on Intel GPU * Removed unused parameter * Fixed compile error and minor mistake * Fix logging * Changing to use usage flag per command buffer * fixed style * added buffer reset * Removed cmd_buffer_idx for reuse consistency * Fixed style

* metal : avoid modulus in bin kernel when not broadcasting * metal : fix capture_started flag

) * webui: auto-select first loaded model for new conversations in router mode * chore: update webui build output

* vulkan: optimize SSM_CONV workgroup dispatch for large ubatch Tile tokens into 2D workgroups (32x16) to reduce workgroup launch overhead at large ubatch sizes. Add vec4 fast path for nc=4 (common d_conv size). Fixes PP performance degradation with ubatch > 512. Ref: ggml-org#18725 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: remove unused shared memory declaration in SSM_CONV Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: add GATED_DELTA_NET op support Implements the fused gated delta net recurrence as a Vulkan compute shader with full support for scalar gate, KDA vector gate, GQA broadcast, multi-token sequences, and permuted (non-contiguous) q/k inputs. Specialization constants select head size (32/64/128) and KDA mode at pipeline creation time. Passes all 13 test-backend-ops cases on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: optimize GATED_DELTA_NET shader (Phase 1) - vec4 dot products on all inner loops (dp4 hardware intrinsic) - Cache exp(g) in shared memory for KDA path, eliminating ~32K redundant global reads and ~16K redundant exp() calls per token - vec4 fused decay + rank-1 update (3 vec4 ops vs 12 scalar ops) - Add perf benchmark cases for GATED_DELTA_NET to test-backend-ops KDA TG: +5.4% throughput. Non-KDA: no regressions. 13/13 test-backend-ops passing on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: address review feedback for GATED_DELTA_NET Pipeline array refactor [3][2], A_TYPE/D_TYPE/FLOAT_TYPE shader macros, scale in push constants, supports_op fix, dispatch restructuring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: use FLOAT_TYPE for buffer/shared declarations, align formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: add explicit FLOAT_TYPE casts for buffer loads Wrap data_q, data_k, and data_g buffer reads with FLOAT_TYPE() casts to ensure correct behavior across all Vulkan configurations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: fix Q/K broadcast for interleaved head layout Adapt to the interleaved broadcast convention from ggml-org#20340: head_id / rq1 → head_id % neq1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* grammar: fix bad check for root symbol, correct error logging * add tests to demonstrate root symbol check failure

This commit updates the bash completion executables list, adding missing executables and removing some that non longer exist.

…rators into file (ggml-org#19896) * tests: allow loading test-backend-ops tests from json * add error threshold based on op * add error when file cannot be read * add graph operator json extraction tool * add nb parameter for non-contiguous input tensors * fix view check * only use view if non-contiguous/permuted, use C++ random instead of rand() * replace internal API calls with public llama_graph_reserve call * reduce test description length * fix nb[0] not getting set for view * add name to tests * fix inplace error * use text file instead of json * move llama_graph_reserve function to new llama-ext header, move export-graph-ops to tests/ * fix missing declaration * use pragma once * fix indent * fix Windows build

…ggml-org#20432)

ggerganov and others added 25 commits March 11, 2026 22:46

ggml-webgpu: Add supports for GGML_OP_REPEAT (ggml-org#20230)

f2ab047

* Add GGML_OP_REPEAT to webgpu backend. * Add i16 support for GGML_OP_REPEAT.

common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (g…

4a748b8

…gml-org#20416)

graph : add optional scale parameter to build_lora_mm [no ci] (ggml-o…

1eea6a2

…rg#20427)

common/parser: add GigaChatV3/3.1 models support (ggml-org#19931)

a8304b4

Co-authored-by: Mishusha <pmv26021975@gmail.com>

hip: compile debug builds with -O2 on hip to avoid a compiler bug (gg…

d63aa39

…ml-org#20392)

opencl: use larger workgroup size for get_rows (ggml-org#20316)

0516e04

vulkan: fix OOB check in flash_attn_mask_opt (ggml-org#20296)

aa429cf

vulkan: fix l2_norm epsilon handling (ggml-org#20350)

246ffc4

ci: Setup self-hosted CI for Intel Linux Vulkan backend (ggml-org#20154)

4cc6eb1

metal : avoid divisions in bin kernel (ggml-org#20426)

e4cff09

* metal : avoid modulus in bin kernel when not broadcasting * metal : fix capture_started flag

ggml-virtgpu: Fix some build commands (ggml-org#20341)

0503996

New conversations now auto-select the first loaded model (ggml-org#20403

de19015

) * webui: auto-select first loaded model for new conversations in router mode * chore: update webui build output

convert : better mtp check and fix return [no ci] (ggml-org#20419)

c3e3f9e

grammar: Fix grammar root symbol check (ggml-org#19761)

0a10c34

* grammar: fix bad check for root symbol, correct error logging * add tests to demonstrate root symbol check failure

common : update completion executables list [no ci] (ggml-org#19934)

6de1bc6

This commit updates the bash completion executables list, adding missing executables and removing some that non longer exist.

tests : use reasoning instead of reasoning_budget in server tests (…

0e81041

…ggml-org#20432)

vendor : update cpp-httplib to 0.37.1 (ggml-org#20390)

557fe2d

llama : disable graph reuse with pipeline parallelism (ggml-org#20463)

57819b8

jan-service-account merged commit 98987da into dev Mar 14, 2026
5 checks passed

jan-service-account deleted the update-dev-from-master-2026-03-14-00-44 branch March 14, 2026 00:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync master with upstream release b8323#452

Sync master with upstream release b8323#452
jan-service-account merged 25 commits intodevfrom
update-dev-from-master-2026-03-14-00-44

jan-service-account commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

jan-service-account commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants