merge from upstream #85

l3utterfly · 2025-12-03T10:34:44Z

No description provided.

* Add ops needed for new hybrid models: SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Code review * Whitespace * Update tests/test-backend-ops.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> * This is actually sigmoid, duh. * Add CONST, remove TRI_KEEP, other changes from review * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Remove extra script * Update ggml/src/ggml.c Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> * moving changes from laptop [no ci] * pre-rebase * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Refactor tests * ggml : cleanup * cont : fix ggml_fill srcs * tests : add note * ggml : add ggml_fill_inplace * ggml : add asserts * ggml : fix ggml_fill constant cast * cont : ggml_tri minor * Use TENSOR_LOCALS * Fix regression from ggml-org#14596, regenerate * Don't make commits at night... --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* ggml-cpu: handle 3d tensors in repack mul_mat * Removed unnecessary branch, removed need for <algorithm> * Fixed dst_ptr pointer in chunk + clang_format * GGML_ASSERT to check wdata within bounds * Accidental ggml.h inclusion * Improved GGML_ASSERT on wdata boundaries * Address performance regression in Qwen and llama.cpp due to chunking

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>

* metal : refactor argsort * cont : sort chunks * cont : merge sorted buckets * cont : cleanup

…nstruction (ggml-org#17048) * fix : Dangling pointer for non-empty trigger words in llama_sampler_init_grammar_impl (ggml-org#17047) * Replace 'static' workaround, with keeping variable in scope for longer * Create std::array directly and pass into llama_grammar_init_impl * Add back the trigger pattern * Missed array include

* Add AFMOE model support * Update to vocab * Add model sizing * Undo Rope change for ARCEE model * Address review comments * Update modeling code is_sliding -> use_rope, replace hard-coded logic * Fix AFMOE tokenizer * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update AFMoE tokenizer class identification to be more unique --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…gml-org#17158) * vulkan: change graph_compute to be async and enable get_tensor_async This allows some additional CPU/GPU overlap for large pp workloads. Also seems to help a bit for token gen, maybe getting rid of a small bubble between graph_compute and get_tensor. Async set and copy functions seem to be very rarely used, so I didn't enable them because I didn't have a good way to test them. The async commands need to be ordered against each other, so put them all on the compute queue. The non-async commands still use the transfer queue. The fence for graph_compute/get_tensor_async is submitted and waited on in ggml_vk_synchronize. * fix thread safety errors * teardown context cleanly * Handle async read to non-pinned dst

…rg#17244) * vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths * set allow_misalign

* docs: update Vulkan ops * vulkan: add NEG op * vulkan: add ABS op --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

…cli (ggml-org#17277)

…D driver bug (ggml-org#17285)

These both show up in gpt-oss. Also, cleanup the mul_mat_vec fusion code a bit.

…ersistence in chat UI (ggml-org#16618) * webui: add OAI-Compat Harmony tool-call live streaming visualization and persistence in chat UI - Purely visual and diagnostic change, no effect on model context, prompt construction, or inference behavior - Captured assistant tool call payloads during streaming and non-streaming completions, and persisted them in chat state and storage for downstream use - Exposed parsed tool call labels beneath the assistant's model info line with graceful fallback when parsing fails - Added tool call badges beneath assistant responses that expose JSON tooltips and copy their payloads when clicked, matching the existing model badge styling - Added a user-facing setting to toggle tool call visibility to the Developer settings section directly under the model selector option * webui: remove scroll listener causing unnecessary layout updates (model selector) * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: npm run format & update webui build output * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

…g#17278) * fix: Better pointer events handling in chat processing info elements * chore: update webui build output

…ide operator support (ggml-org#17213) * SYCL: add generic unary op implementation for multiple ops (ABS/SGN/…); unify non-contiguous access * SYCL: update documentation and sycl.csv to reflect new unary op support * update ops.md after syncing SYCL.csv changes * Fix SYCL.csv merge conflict * Update ops.md after fixing SYCL.csv conflicts * Fix SYCL.csv tail after merge conflict and regenerate ops.md * Fix line endings and final newline in SYCL.csv * Remove TOPK_MOE entries from SYCL.csv as requested * Update ops.md after removing TOPK_MOE from SYCL.csv * Regenerated SYCL.csv and synced ops.md with upstream * Update ops.md using create_ops_docs.py

… speed (ggml-org#17181) * Add mul_mm_f16_f32_kq_kqv kernel * Add ggml_cl_mul_mat_kq_kqv_adreno func * fix whitespace * remove unused variable * remove redundant * refactor and clean up * remove trailing whitespace

* opencl: use subgrroup reduce for reduction in rms_norm_mul * opencl: add comment about workgroup size

* server : handle context overflow during decode * server : minor refactor

* Delete .github/workflows/build-amd.yml * Update build.yml

…400) (ggml-org#17572) * Make invalid schema a user error (400) * Move invalid_argument exception handler to ex_wrapper * Fix test * Simplify test back to original pattern

- Compute row size for the temp buffer based on the output of the first pass. - Update shader addressing math to use the output row size - Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k" For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer from about 3.2MB to 500KB.

* Added RISC-V supported tests * Added default value for LLAMA_FATAL_WARNINGS and option to specify by user * Added RISC-V supported tests * Added default value for LLAMA_FATAL_WARNINGS and option to specify by user * Removed apt prompt * Added RISC-V specific tests with corrections Corrections included: 1. Changed the test names from debian to ubuntu as it is more stable than Debian Trixie 2. Added explicit compiler in cmake command as GCC compiler below version 14 have been recorded to throw errors with rvv1.0 and some other extensions 3. Added dependencies which are not installed by default in the RISC-V Ubuntu 24.04 4. Separate ccache directory for all jobs as all the ccache results are not the same and may cause ccache to not work * Resolved the merge conflict and cleaned up run.sh * Update ci/run.sh Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Removed previously added build ci for RISC-V * Removed trailing whitespaces * corrected build name Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * cleanup * Enabled build tests (1) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Enabled build tests (2) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * enable openssl --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* server: add --media-path for local media files * remove unused fn

…es (ggml-org#17688)

* Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Fix .gitignore * Add memory64 option and remove unneeded macros for setting threads to 1 --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

…17698) * llama-server: fix duplicate HTTP headers in multiple models mode (ggml-org#17693) * llama-server: address review feedback from ngxson - restrict scope of header after std::move - simplify header check (remove unordered_set)

pwilkin and others added 30 commits November 13, 2025 20:54

server: fixing naming conflict res_error (ggml-org#17243)

c4abcb2

Better UX for handling multiple attachments in WebUI (ggml-org#17246)

f1bad23

readme : add RVV,ZVFH,ZFH,ZICBOP support for RISC-V (ggml-org#17259)

307772f

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>

metal : make the FA extra sizes consistent (ggml-org#17143)

2606b0a

metal : support argsort for ne00 > 1024 (ggml-org#17247)

45c6ef7

* metal : refactor argsort * cont : sort chunks * cont : merge sorted buckets * cont : cleanup

server : fix "can batch with" bug (ggml-org#17263)

d396b43

mtmd: add mtmd_log_set (ggml-org#17268)

9b17d74

vulkan: skip all-negative-inf blocks in FA (ggml-org#17186)

234ae7d

vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths (ggml-o…

439342e

…rg#17244) * vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths * set allow_misalign

vulkan: implement ABS and NEG (ggml-org#17245)

1568d13

* docs: update Vulkan ops * vulkan: add NEG op * vulkan: add ABS op --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

mtmd-cli: Avoid logging to stdout for model loading messages in mtmd-…

c7b7db0

…cli (ggml-org#17277)

convert : set expert gating func in base class (ggml-org#17279)

9d3ef48

convert : use all parts in safetensors index (ggml-org#17286)

9a8860c

vulkan: Replace 16-bit unpack8 calls to work around legacy Windows AM…

4dca015

…D driver bug (ggml-org#17285)

vulkan: Fuse mul_mat_id+add_id+mul and mul_mat+add+add. (ggml-org#17287)

24dc769

These both show up in gpt-oss. Also, cleanup the mul_mat_vec fusion code a bit.

convert : remove unnecessary chat template patching (ggml-org#17289)

662192e

webui: Fix clickability around chat processing statistics UI (ggml-or…

22e1ce2

…g#17278) * fix: Better pointer events handling in chat processing info elements * chore: update webui build output

opencl: fix rms_norm_mul (ggml-org#17250)

52e5d42

* opencl: use subgrroup reduce for reduction in rms_norm_mul * opencl: add comment about workgroup size

server : handle context overflow during decode (ggml-org#17267)

5b2093b

* server : handle context overflow during decode * server : minor refactor

metal : remove obosolete asserts (ggml-org#17295)

416e7c7

ci : revert ggml-org#16249 (ggml-org#17303)

8b1c339

* Delete .github/workflows/build-amd.yml * Update build.yml

vulkan: fix MMQ quantize_y condition (ggml-org#17301)

80deff3

chadvoegele and others added 12 commits December 2, 2025 17:33

Server: Change Invalid Schema from Server Error (500) to User Error (…

c4357dc

…400) (ggml-org#17572) * Make invalid schema a user error (400) * Move invalid_argument exception handler to ex_wrapper * Fix test * Simplify test back to original pattern

cmake : add utf8 compilation options for msvc (ggml-org#17682)

e251e5e

mtmd: fix --no-warmup (ggml-org#17695)

a96283a

server: add --media-path for local media files (ggml-org#17697)

13628d8

* server: add --media-path for local media files * remove unused fn

build: document how to compile with Vulkan using Debian/Ubuntu packag…

16cc3c6

…es (ggml-org#17688)

ggml, llama : use defaulted constructors/destructors (ggml-org#17649)

37adc9c

ci : move release details to the top visible by default (ggml-org#17719)

b3e3060

Merge branch 'layla-build' into merge

189f585

l3utterfly merged commit 4c94c65 into layla-build Dec 3, 2025
16 of 17 checks passed

l3utterfly deleted the merge branch December 3, 2025 10:38

github-actions bot added documentation Improvements or additions to documentation SYCL Nvidia GPU Vulkan testing build examples devops python server ggml Apple Metal script Ascend NPU OpenCL model labels Dec 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge from upstream #85

merge from upstream #85

Uh oh!

l3utterfly commented Dec 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

merge from upstream #85

merge from upstream #85

Uh oh!

Conversation

l3utterfly commented Dec 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants