merge from upstream #78

l3utterfly · 2025-07-30T06:07:25Z

No description provided.

* SYCL: Use 1D kernel for set_rows * Remove dangling comment * Refactor and use ceil_div

* scripts: benchmark for HTTP server throughput * fix server connection reset

Remove un-necessary templates from class definition and packing functions Reduce deeply nested conditionals, if-else switching in mnapck function Replace repetitive code with inline functions in Packing functions 2 ~ 7% improvement in Q8 Model 15 ~ 50% improvement in Q4 Model Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>

…4687) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

@CISC

* Add PLaMo-2 model using hybrid memory module * Fix z shape * Add cmath to include from llama-vocab.h * Explicitly dequantize normalization weights before RoPE apply * Revert unnecessary cast because the problem can be solved by excluding attn_k, attn_q when quantizing * Use ATTN_K/Q_NORM for k,q weights to prevent quantization * Remove SSM_BCDT that is not used from anywhere * Do not duplicate embedding weights for output.weight * Fix tokenizer encoding problem for multibyte strings * Apply suggestion from @CISC Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Use LLM_FFN_SWIGLU instead of splitting ffn_gate and ffn_up * Remove unnecessary part for Grouped Query Attention * Fix how to load special token id to gguf * Remove unused tensor mapping * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Remove llama_vocab_plamo2 class and replace it with llm_tokenizer_plamo2_session to follow the other tokenizer implementations * Update src/llama-vocab.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix plamo2 tokenizer session to prevent multiple calls of build() --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* vulkan: fix noncontig check for mat_mul_id splitting Remove supports_op check for > 4096 (splitting fixes this) * vulkan: fix batched matmul dequant for Q*_K

* Kimi-K2 conversion * add Kimi_K2 pre type * Kimi-K2 * Kimi-K2 unicode * Kimi-K2 * LLAMA_MAX_EXPERTS 384 * fix vocab iteration * regex space fix * add kimi-k2 to pre_computed_hashes * Updated with kimi-k2 get_vocab_base_pre hash * fix whitespaces * fix flake errors * remove more unicode.cpp whitespaces * change set_vocab() flow * add moonshotai-Kimi-K2.jinja to /models/templates/ * update moonshotai-Kimi-K2.jinja * add kimi-k2 chat template * add kimi-k2 * update NotImplementedError Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * except Exception Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * LLM_CHAT_TEMPLATE_KIMI_K2 if(add_ass){} --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

) Add LLAMA_API to fix the run-time error with llama-cpp-python in Windows env: attributeError: function 'llama_kv_self_seq_div' not found. Did you mean: 'llama_kv_self_seq_add'? Although llama_kv_self_seq_div() has been marked deprecated but it is necessary to export it to make llama-cpp-python happy. Observed software version: OS: windows compiler: MSVC llama-cpp-python: tag: v0.3.12-cu124 llama.cpp: tag: b5833 Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com> Co-authored-by: Min-Hua Chen <minhua.chen@neuchips.ai>

…l-org#14701)

ggml-ci

* ggml : add asserts ggml-ci * cont : fix constant type Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Support diffusion models: Add Dream 7B * Move diffusion to examples * Move stuff to examples. Add patch to not use kv-cache * Address review comments * Make sampling fast * llama: remove diffusion functions * Add basic timings + cleanup * More cleanup * Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length * fixup! * Review: move everything to diffusion-cli for now

* kv-cache : prepare K/V buffers for separation ggml-ci * batched-bench : fix oob write ggml-ci * llama : add "virtual sequences" ggml-ci * llama : use "stream" vs "virtual sequence" ggml-ci * graph : fix stream splitting when KV cache is not used ggml-ci * kv-cache : add multi-stream save/load support ggml-ci * llama : add "--attn-streams" flag ggml-ci * kv-cache : fix handling when find_slot fails ggml-ci * kv-cache : restore find_slot impl ggml-ci * kv-cache : add comments * kv-cache : add bounds checks for sequence id ggml-ci * cont : add n_seq_max to batch allocr ggml-ci * kv-cache : perform stream copies lazily after llama_synchronize ggml-ci * kv-cache : avoid throwing exceptions across the C boundary ggml-ci * CUDA: 4D FlashAttention support (ggml-org#14628) * CUDA: 4D FlashAttention support * CUDA: fix WMMA FA kernel * llama : rename attn_streams -> kv_unified ggml-ci * common : rename kv_split -> kv_unified ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

Co-authored-by: qwaqrm <qwaqrm@126.com>

* Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults * Initialize webgpu device * Making progress on setting up the backend * Finish more boilerplate/utility functions * Organize file and work on alloc buffer * Add webgpu_context to prepare for actually running some shaders * Work on memset and add shader loading * Work on memset polyfill * Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it * Implement get_tensor and buffer_clear * Finish rest of setup * Start work on compute graph * Basic mat mul working * Work on emscripten build * Basic WebGPU backend instructions * Use EMSCRIPTEN flag * Work on passing ci, implement 4d tensor multiplication * Pass thread safety test * Implement permuting for mul_mat and cpy * minor cleanups * Address feedback * Remove division by type size in cpy op * Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends * Fix name * Fix macos dawn prefix path

…g#14725)

* make hf token optional * fail if we can't get necessary tokenizer config

ggml-ci

* llama : clarify comment about pp and tg graphs [no ci] This commit clarifies the comment in `llama-context.cpp` regarding the prefill prompt (pp), and token generation (tg) graphs. The motivation for this is that I've struggled to remember these and had to look them up more than once, so I thought it would be helpful to add a comment that makes it clear what these stand for. * squash! llama : clarify comment about pp and tg graphs [no ci] Change "pp" to "prompt processing".

* Update README.md * Fix trailing whitespace * Update README.md Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* sycl: quantization to q8_1 refactor * Refactored src1 copy logic in op_mul_mat

* support smallthinker * support 20b softmax, 4b no sliding window * new build_moe_ffn_from_probs, and can run 4b * fix 4b rope bug * fix python type check * remove is_moe judge * remove set_dense_start_swa_pattern function and modify set_swa_pattern function * trim trailing whitespace * remove get_vocab_base of SmallThinkerModel in convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * better whitespace Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * use GGML_ASSERT for expert count validation Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Improve null pointer check for probs Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * use template parameter for SWA attention logic * better whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * move the creation of inp_out_ids before the layer loop * remove redundant judge for probs --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* mtmd : add support for Voxtral * clean up * fix python requirements * add [BEGIN_AUDIO] token * also support Devstral conversion * add docs and tests * fix regression for ultravox * minor coding style improvement * correct project activation fn * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* SYCL: Add set_rows support for quantized types This commit adds support for GGML_OP_SET_ROWS operation for various quantized tensor types (Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, IQ4_NL) and BF16 type in the SYCL backend. The quantization/dequantization copy kernels were moved from cpy.cpp to cpy.hpp to make them available for set_rows.cpp. This addresses part of the TODOs mentioned in the code. * Use get_global_linear_id() instead ggml-ci * Fix formatting ggml-ci * Use const for ne11 and size_t variables in set_rows_sycl_q ggml-ci * Increase block size for q kernel to 256 ggml-ci * Cleanup imports * Add float.h to cpy.hpp

* remove redundant code in riscv * remove redundant code in arm * remove redundant code in loongarch * remove redundant code in ppc * remove redundant code in s390 * remove redundant code in wasm * remove redundant code in x86 * remove fallback headers * fix x86 ggml_vec_dot_q8_0_q8_0

Currently if RPC servers are specified with '--rpc' and there is a local GPU available (e.g. CUDA), the benchmark will be performed only on the RPC device(s) but the backend result column will say "CUDA,RPC" which is incorrect. This patch is adding all local GPU devices and makes llama-bench consistent with llama-cli.

* Extend test case filtering 1. Allow passing multiple (comma-separated?) ops to test-backend-ops. This can be convenient when working on a set of ops, when you'd want to test them together (but without having to run every single op). For example: `test-backend-ops.exe test -o "ADD,RMS_NORM,ROPE,SILU,SOFT_MAX"` 2. Support full test-case variation string in addition to basic op names. This would make it easy to select a single variation, either for testing or for benchmarking. It can be particularly useful for profiling a particular variation (ex. a CUDA kernel), for example: `test-backend-ops.exe perf -b CUDA0 -o "MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=2)"` These two can be combined. As the current `-o`, this change doesn't try to detect/report an error if an filter doesn't name existing ops (ex. misspelled) * Updating the usage help text * Update tests/test-backend-ops.cpp

* CUDA: add roll * Make everything const, use __restrict__

* server-bench: make seed choice configurable * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix error formatting * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…-8 sequences) (ggml-org#14937) * bug-fix: don't attempt to log partial parsed messages to avoid crash due to unfinished UTF-8 sequences

…14931) llvm with the amdgcn target dose not support unrolling loops with conditional break statements, when those statements can not be resolved at compile time. Similar to other places in GGML lets simply ignore this warning.

…gml-org#14930) This is useful for testing for regressions on GCN with CDNA hardware. With GGML_HIP_MMQ_MFMA=Off and GGML_CUDA_FORCE_MMQ=On we can conveniently test the GCN code path on CDNA. As CDNA is just GCN renamed with MFMA added and limited use ACC registers, this provides a good alternative for regression testing when GCN hardware is not available.

…AMD targets (ggml-org#14945)

* CANN:add ops docs * CANN: update ops docs

* embeddings: fix extraction of CLS pooling results * merge RANK pooling into CLS case for inputs

ShanoToni and others added 30 commits July 14, 2025 10:37

sycl: Batched mulmat rework for oneDNN dispatch (ggml-org#14617)

65a3ebb

SYCL: use 1D kernel for set_rows (ggml-org#14618)

0f4c6ec

* SYCL: Use 1D kernel for set_rows * Remove dangling comment * Refactor and use ceil_div

scripts: benchmark for HTTP server throughput (ggml-org#14668)

494c589

* scripts: benchmark for HTTP server throughput * fix server connection reset

llama-context: add ability to get logits (ggml-org#14672)

9c9e4fc

sycl: Hotfix for non dnnl codepath (ggml-org#14677)

bdca383

cuda: fix build warnings in set-rows.cu (unused variable) (ggml-org#1…

cbc68be

…4687) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

vulkan: add RTE variants for glu/add/sub/mul/div (ggml-org#14653)

10a0351

vulkan: fix noncontig check for mat_mul_id splitting (ggml-org#14683)

ba1ceb3

* vulkan: fix noncontig check for mat_mul_id splitting Remove supports_op check for > 4096 (splitting fixes this) * vulkan: fix batched matmul dequant for Q*_K

gguf-py : dump bpw per layer and model in markdown mode (ggml-org#14703)

c81f419

convert : add pre-computed hashes first to prevent order mishaps (ggm…

cf91f21

…l-org#14701)

convert : only check for tokenizer folder if we need it (ggml-org#14704)

4b91d6f

scripts: synthetic prompt mode for server-bench.py (ggml-org#14695)

5cae766

server : fix handling of the ignore_eos flag (ggml-org#14710)

538cc77

ggml-ci

llama : fix parallel processing for plamo2 (ggml-org#14716)

e4841d2

server : pre-calculate EOG logit biases (ggml-org#14721)

6ffd4e9

ggml-ci

ggml : add asserts (ggml-org#14720)

6497834

* ggml : add asserts ggml-ci * cont : fix constant type Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

model : support output bias for qwen2 (ggml-org#14711)

b0f0ecc

Co-authored-by: qwaqrm <qwaqrm@126.com>

llama : fix parameter order for hybrid memory initialization (ggml-or…

496957e

…g#14725)

convert : make hf token optional (ggml-org#14717)

19e5943

* make hf token optional * fail if we can't get necessary tokenizer config

ci : disable failing vulkan crossbuilds (ggml-org#14723)

1ba45d4

batch : fix uninitialized has_cpl flag (ggml-org#14733)

ad57d3e

ggml-ci

kv-cache : opt mask set input (ggml-org#14600)

d9b6910

ggml-ci

llama : fix parallel processing for lfm2 (ggml-org#14705)

086cf81

danbev and others added 28 commits July 27, 2025 12:10

SYCL: add ops doc (ggml-org#14901)

bbfc849

vulkan: add ops docs (ggml-org#14900)

bf78f54

quantize : update README.md (ggml-org#14905)

7f97599

* Update README.md * Fix trailing whitespace * Update README.md Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

cmake : Indent ggml-config.cmake (ggml/1310)

613c509

sync : ggml

1f45f28

ops : update Metal (ggml-org#14912)

c35f9ea

ops : update BLAS (ggml-org#14914)

a5771c9

sycl: refactor quantization to q8_1 (ggml-org#14815)

afc0e89

* sycl: quantization to q8_1 refactor * Refactored src1 copy logic in op_mul_mat

CUDA: fix pointer incrementation in FA (ggml-org#14916)

946b1f6

opencl : add ops docs (ggml-org#14910)

8ad7b3e

CUDA: add roll (ggml-org#14919)

0a5036b

* CUDA: add roll * Make everything const, use __restrict__

cuda : add softcap fusion (ggml-org#14907)

138b288

CANN: Add ggml_set_rows (ggml-org#14943)

204f2cf

common : avoid logging partial messages (which can contain broken UTF…

1a67fcc

…-8 sequences) (ggml-org#14937) * bug-fix: don't attempt to log partial parsed messages to avoid crash due to unfinished UTF-8 sequences

HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only …

aa79524

…AMD targets (ggml-org#14945)

CANN: update ops docs (ggml-org#14935)

61550f8

* CANN:add ops docs * CANN: update ops docs

embeddings: fix extraction of CLS pooling results (ggml-org#14927)

a118d80

* embeddings: fix extraction of CLS pooling results * merge RANK pooling into CLS case for inputs

Merge branch 'layla-build' into merge

30de771

l3utterfly merged commit 803a681 into layla-build Jul 30, 2025
17 of 53 checks passed

l3utterfly deleted the merge branch July 30, 2025 06:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge from upstream #78

merge from upstream #78

Uh oh!

l3utterfly commented Jul 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

64 participants

merge from upstream #78

merge from upstream #78

Uh oh!

Conversation

l3utterfly commented Jul 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

64 participants