Sync master with upstream release b9415 by jan-service-account · Pull Request #540 · janhq/llama.cpp

jan-service-account · 2026-05-30T01:10:20Z

Updates dev branch with latest release (b9415) from ggml-org/llama.cpp

…ggml-org#23729) * mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING * avoid a mismatch for JIT compilation of Turing device code for Ampere or newer Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

)

…le (ggml-org#23167)

* ci : disable all CPU variant builds for Vulkan workflow * cont : change cache key * cont : change build type

* mtmd: fix gemma 4 audio rms norm eps * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

removed AI-generated comment

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ci : releases use Github-hosted builds for the UI * cont : fix name

When model props are fetched asynchronously from the server, modelPropsVersion is incremented to trigger reactivity, but only the vision effect was listening to it.

* run ui publish on self-hosted fast * run on ubuntu-slim

) * opencl: move backend info print into its own function * opencl: move new log line * opencl: fix for non adreno path

* mtmd-debug: add color and rainbow mode * fix M_PI * max_dist

…l-org#23835) Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.

…gml-org#23480) Without this at least the vulkan backend will skip the `* 0` for !COMPUTE tensors, causing corrupt output.

…-org#23825)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* llama: add llm_graph_input_mtp * rename input_mtp -> input_token_embd * add TODO about mtmd embedding * cont : clean-up --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

[no release] Signed-off-by: Omid Azizi <oazizi@gimletlabs.ai>

* llama: use f16 mask for FA * review: add llama_cast + formatting * simplify

…se Attention (DSA) implementation (ggml-org#23346) * llama : support DeepSeek V3.2 model family (with DSA lightning indexer) * convert : handle DeepseekV32ForCausalLM architecture * ggml : support for f16 GGML_OP_FILL * memory : separate hparams argument in llama_kv_cache constructor * memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache) * llama : support for LLM_ARCH_DEEPSEEK32 * model : llama_model_deepseek32 implementation * model : merge two scale operations into one in DSA lightning indexer implementation * chore : remove unused code * model : support NVFP4 in DeepSeek V3.2 Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * memory : refactoring TODO Co-authored-by: ggerganov <ggerganov@users.noreply.github.com> --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>

* server: bump timeout to 3600s * nits: change wording

…23530) * CUDA: Check PTX version on host side to guard PDL dispatch Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this variable doesn't differentiate between compiling for say sm_90, sm_90a or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX). Thus, one can have a bug when compiling with `DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly dispatch to PDL on sm_90/sm_120 in forward-JIT mode. This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of the incoming kernel at runtime. A check on ptxVersion alone is sufficient, as device-codes will always be >= ptxVersion (and any violation of this would be a severe bug in CUDA/nvcc), see: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code * Implement MurmurHash3 mixer for better hash distribution Magic constants were taken from boost: https://github.com/boostorg/container_hash/blob/2698b43803c012601e6bb1a6116e83767b97986c/include/boost/container_hash/detail/hash_mix.hpp#L19-L65 * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments, make seed non-zero * Apply code-formatting * Replace std::size_t -> size_t for consistency --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution * introduced clip_image_f32::add_viewsep * address PR review - drop redundant ggml_cpy ops in both deepseekocr versions build - drop no-op ggml_cont in build_sam - assert num_image_tokens deepseekocr2 - view_seperator as (1, n_embd) at conversion (for both versions) - drop redundant ggml_reshape_2d * Update tools/mtmd/models/deepseekocr2.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* download: add option to skip_download * fix * fix 2 * if file doesn't exist, respect skip_download flag

yaohengxu and others added 28 commits May 28, 2026 14:51

ggml: auto apply iGPU flag CUDA/HIP if integrated device (ggml-org#23007

30af6e2

)

test-llama-archs: fix table format [no release] (ggml-org#23810)

d374e71

arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-fi…

7fb1e70

…le (ggml-org#23167)

ci : change Vulkan builds to Release to reduce ccache (ggml-org#23820)

dd15579

* ci : disable all CPU variant builds for Vulkan workflow * cont : change cache key * cont : change build type

mtmd: fix gemma 4 audio rms norm eps (ggml-org#23815)

d6be315

* mtmd: fix gemma 4 audio rms norm eps * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

mtmd: n_head_kv defaults to n_head (ggml-org#23782)

0b56d28

removed AI-generated comment

app : improve help output (ggml-org#23805)

479a9a1

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

ci : releases use Github-hosted builds for the UI (ggml-org#23823)

445b7ce

* ci : releases use Github-hosted builds for the UI * cont : fix name

ui: fix audio and video modality detection (ggml-org#23756)

2f6c815

When model props are fetched asynchronously from the server, modelPropsVersion is incremented to trigger reactivity, but only the vision effect was listening to it.

ci : run ui publish on ubuntu-slim (ggml-org#23818)

3ef2369

* run ui publish on self-hosted fast * run on ubuntu-slim

opencl: move backend info printing into its own function (ggml-org#23702

408ae2b

) * opencl: move backend info print into its own function * opencl: move new log line * opencl: fix for non adreno path

mtmd: fix gemma 4 projector pre_norm (ggml-org#23822)

c8914ad

mtmd-debug: add color and rainbow mode (ggml-org#23829)

751ebd1

* mtmd-debug: add color and rainbow mode * fix M_PI * max_dist

hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (ggm…

19e92c3

…l-org#23835) Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.

meta : Add missing buffer set in allreduce fallback !COMPUTE clear (g…

33c718d

…gml-org#23480) Without this at least the vulkan backend will skip the `* 0` for !COMPUTE tensors, causing corrupt output.

cuda : disables launch_fattn PDL enrollment due to compiler bug (ggml…

241cbd4

…-org#23825)

app : move licences to llama-app (ggml-org#23824)

98e480a

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

llama: add llm_graph_input_mtp (ggml-org#23643)

eef59a7

* llama: add llm_graph_input_mtp * rename input_mtp -> input_token_embd * add TODO about mtmd embedding * cont : clean-up --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ngram-mod : Add missing include (ggml-org#23857)

b000431

[no release] Signed-off-by: Omid Azizi <oazizi@gimletlabs.ai>

ggml : bump version to 0.13.1 (ggml/1523)

ea02bc3

sync : ggml

fe12e42

llama: use f16 mask for FA to save VRAM (ggml-org#23764)

031ddb2

* llama: use f16 mask for FA * review: add llama_cast + formatting * simplify

server: bump timeout to 3600s (ggml-org#23842)

cb47092

* server: bump timeout to 3600s * nits: change wording

download: add option to skip_download (ggml-org#23059)

06d26df

* download: add option to skip_download * fix * fix 2 * if file doesn't exist, respect skip_download flag

jan-service-account merged commit 86f3076 into dev May 30, 2026

jan-service-account deleted the update-dev-from-master-2026-05-30-01-10 branch May 30, 2026 01:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync master with upstream release b9415#540

Sync master with upstream release b9415#540
jan-service-account merged 28 commits into
devfrom
update-dev-from-master-2026-05-30-01-10

jan-service-account commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Conversation

jan-service-account commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants