Sync master with upstream release b8179 by jan-service-account · Pull Request #438 · janhq/llama.cpp

jan-service-account · 2026-02-28T00:41:40Z

Updates dev branch with latest release (b8179) from ggml-org/llama.cpp

…g#19928)

…-org#19921)

llama-perplexity -hf ggml-org/Qwen3-0.6B-GGUF:Q4_0 -f wikitext-2-raw/wiki.test.raw -c 2048 -b 2048 --chunks 2 before this commit: ``` perplexity: calculating perplexity over 2 chunks, n_ctx=2048, batch_size=2048, n_seq=1 perplexity: 2.31 seconds per pass - ETA 0.07 minutes [1]17.3868,[2]22.2199, Final estimate: PPL = 22.2199 +/- 1.59692 llama_perf_context_print: load time = 878.56 ms llama_perf_context_print: prompt eval time = 2037.82 ms / 4096 tokens ( 0.50 ms per token, 2009.99 tokens per second) llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_perf_context_print: total time = 6403.17 ms / 4097 tokens llama_perf_context_print: graphs reused = 0 llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | llama_memory_breakdown_print: | - Host | 845 = 318 + 224 + 302 | llama_memory_breakdown_print: | - CPU_REPACK | 288 = 288 + 0 + 0 | llama_memory_breakdown_print: | - AMX | 31 = 31 + 0 + 0 | ``` after this commit: ``` perplexity: calculating perplexity over 2 chunks, n_ctx=2048, batch_size=2048, n_seq=1 perplexity: 1.98 seconds per pass - ETA 0.05 minutes [1]17.2005,[2]21.8220, Final estimate: PPL = 21.8220 +/- 1.56485 llama_perf_context_print: load time = 719.23 ms llama_perf_context_print: prompt eval time = 1676.23 ms / 4096 tokens ( 0.41 ms per token, 2443.58 tokens per second) llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_perf_context_print: total time = 4258.74 ms / 4097 tokens llama_perf_context_print: graphs reused = 0 llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | llama_memory_breakdown_print: | - Host | 845 = 318 + 224 + 302 | llama_memory_breakdown_print: | - AMX | 319 = 319 + 0 + 0 | ``` (no more CPU_REPACK) after this commit, disabling amx: ``` perplexity: calculating perplexity over 2 chunks, n_ctx=2048, batch_size=2048, n_seq=1 perplexity: 2.34 seconds per pass - ETA 0.07 minutes [1]17.2005,[2]21.8220, Final estimate: PPL = 21.8220 +/- 1.56485 llama_perf_context_print: load time = 841.91 ms llama_perf_context_print: prompt eval time = 2057.28 ms / 4096 tokens ( 0.50 ms per token, 1990.98 tokens per second) llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_perf_context_print: total time = 6454.51 ms / 4097 tokens llama_perf_context_print: graphs reused = 0 llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | llama_memory_breakdown_print: | - Host | 845 = 318 + 224 + 302 | llama_memory_breakdown_print: | - CPU_REPACK | 319 = 319 + 0 + 0 | ``` => same perplexity. Signed-off-by: Adrien Gallouët <angt@huggingface.co>

- adapt ggml-zendnn.cpp to the new lowoha::matmul interface - update the ZenDNN git tag in CMake to the latest release (ZenDNN‑2026‑WW08) - add static lib support in CMake

…gml-org#19920) Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>

The binary relies on model files that it tries to find. However, when configuring the build directory to be parallel to the source tree those heuristics fail. This sets the working directory for the test executable to be the source-tree which resolves this issue.

…gml-org#19926) * server : support multiple model aliases via comma-separated --alias * server : update --alias description and regenerate docs * server : multiple model aliases and tags - address review feedback from ngxson - --alias accepts comma-separated values (std::set, no duplicates) - --tags for informational metadata (not used for routing) - aliases resolve transparently in router via get_meta/has_model - /v1/models exposes aliases and tags fields * regenerate docs * nits * server : use first alias as model_name for backward compat address review feedback from ngxson * server : add single-model test for aliases and tags

This commit updates the gguf-py package version to 0.18.0 in preperation of a new release to PyPI. Refs: ggml-org#19948

This commit changes the runner for the gguf-publish workflow from ubuntu-slim back to ubuntu-latest, which was updated in Commit 142cbe2 ("ci : use new 1vCPU runner for lightweight jobs (ggml-org#19107)"). The motivation for this is that the action used in the workflow depends on the docker daemon, which does not seem not available in the ubuntu-slim runner. This is currently causing an error in the workflow and preventing the gguf-publish workflow from running successfully. Today was the the first time since the original change (I think) that publish task has been run which may be why the issue was not noticed before. Refs: https://github.com/ggml-org/llama.cpp/actions/runs/22481900566

…etions pattern (ggml-org#19873)

…#19806) * CUDA: add CDNA3 MFMA support for flash attention MMA kernel Add MI300X (gfx942) MFMA tensor core flash attention using v_mfma_f32_16x16x16_f16 (FP16 in, FP32 accumulate). - Add FATTN_WARP_SIZE=64 for CDNA wavefront64 - Add CDNA config for head sizes 64, 80, 96, 112, 128 - Add FP16 MFMA intrinsic path in mma.cuh - Add manual V transpose load for MFMA register layout - Route CDNA to MMA for prompt processing, VEC for token generation - Fix Q loading and combine stride granularity for non-power-of-2 heads Benchmarks (Qwen2.5-1.5B Q4_K_M, MI300X): pp512 +7%, pp1024 +13%, pp2048 +23%, pp4096 +39% tg128 -10% (FA overhead, VEC used for both) All 2480 flash attention tests pass. Ref: ggml-org#17917 * address review: replace FATTN_WARP_SIZE with constexpr, improve dispatch - Replace #define FATTN_WARP_SIZE with constexpr int warp_size = ggml_cuda_get_physical_warp_size() in each device function - Use ne[1]*gqa_ratio threshold for MMA vs tile dispatch. Benchmarked crossover on MI300X @ d32768 with power-of-2 GQA models: hsk=64 (Llama 1B, gqa=4): MMA wins at eff >= 128 (+11%) hsk=128 (Llama 3B, gqa=4): MMA wins at eff >= 128 (+4%) Unified threshold: eff_nq >= 128 for all head sizes. - Remove VEC fallback; small batches fall through to tile kernel * Update ggml/src/ggml-cuda/fattn.cu * use ggml_cuda_info().devices warp_size instead of hardcoded check --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

ggerganov and others added 15 commits February 26, 2026 18:08

kv-cache : fix can_shift() check to take into account M-RoPE (ggml-or…

99bd67c

…g#19928)

server : fix ctx checkpoint restore logic (ggml-org#19924)

01cd448

mtmd : fix padding of n_tokens (ggml-org#19930)

37964f4

vulkan: fix fp16 Flash Attention on Windows AMD RDNA2 and below (ggml…

723c710

…-org#19921)

ggml-zendnn: update code for latest ZenDNN API (ggml-org#19923)

88cf781

- adapt ggml-zendnn.cpp to the new lowoha::matmul interface - update the ZenDNN git tag in CMake to the latest release (ZenDNN‑2026‑WW08) - add static lib support in CMake

replace the magic nunber 768 by max work group size to support iGPU (g…

c17dce4

…gml-org#19920) Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>

gguf-py : dump version to 0.18.0 (ggml-org#19950)

8387ffb

This commit updates the gguf-py package version to 0.18.0 in preperation of a new release to PyPI. Refs: ggml-org#19948

ggml-cpu: add repack for mxfp4 (ggml-org#19738)

d903f30

server: Mirroring /v1/responses to /responses to match /v1/chat/compl…

5596a35

…etions pattern (ggml-org#19873)

server: Add pragma once to server-context.h (ggml-org#19944)

3e6ab24

jan-service-account merged commit 7ec26b3 into dev Feb 28, 2026
1 check passed

jan-service-account deleted the update-dev-from-master-2026-02-28-00-41 branch February 28, 2026 00:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync master with upstream release b8179#438

Sync master with upstream release b8179#438
jan-service-account merged 15 commits intodevfrom
update-dev-from-master-2026-02-28-00-41

jan-service-account commented Feb 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

jan-service-account commented Feb 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants