merge from upstream #60

l3utterfly · 2025-04-07T09:47:16Z

No description provided.

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* ggml : fix MUL_MAT_ID repack with Q8_K ggml-ci * ggml : improve repack templates ggml-ci

* convert : fix squeeze for ssm_conv tensors * convert : match ssm_conv tensors by type --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net>

… backend (ggml-org#12566) * [Fix] Compiling clip-quantize-cli and running it in a CUDA environment will cause ggml_fp16_to_fp32 to report an error when trying to access video memory. You need to switch to the CPU backend to run quantize. After the fix, it will automatically run in the CPU backend and will no longer be bound to CUDA. * [Fix]Roll back the signature and implementation of clip_model_load, and change the call in clip_model_quantize to clip_init.

* metal : refactor mat-vec code ggml-ci * metal : rename all_sum -> sum_all ggml-ci * metal : fix comments [no ci] * metal : fix nr constant [no ci] * metal : mv q6_K support nr0 > 1 ggml-ci * metal : reduce register pressure ggml-ci * metal : fix typo [no ci] * metal : reduce register pressure ggml-ci

* SYCL: implement memset ggml backend buffer interface * use GGML_ABORT macro * Do not wait for all queues to finish for memset operation

* llama : make loras compatible with repacking ggml-ci * cont : simplify ggml-ci * cont : add TODO [no ci]

* ggml : add 128-bit RVV support * ggml : revert to old RVV 256+ q2_K, q3_K, q4_K, q6_K impl * remove trailing whitespaces * restructure vector length selection code

This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le ISA using MMA builtins. This patch handles matrix multiplication between quantised datatypes, block_q4_0 and block_q8_0. This change results in 5% - 50% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>

ggml-ci

* add edgellm model arch[conversation feature doesn't work] * remove output.weight layer for edgellm arch * [Model] update the name of the model * update the name of model arch in convert gguf * [Model] Refarctor the model arch into llama-model * [Bug] Fix the bug in create attn kv * [Code] Fix editorconfig erros * [Code] Remove Trailing whitespace * [Code] Remove Trailing whitespace * [Code] Change the order of model arch in list * [Code] Fix flake8 Lint errors * Remove trailing white space * [Code] Remove call in model arch

…g#12600) * opencl: add `im2col` * opencl: add `gelu_quick` * opencl: add mrope * opencl: add vision rope

* server : Bump cpp-httplib to include AF_UNIX windows support Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com> * server : Allow running the server example on a unix socket Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com> --------- Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>

…org#12496) * rpc : send hash when tensor data is above some fixed threshold ref ggml-org#10095 * rpc : put cache under $HOME/.cache/llama.cpp * try to fix win32 build * another try to fix win32 build * remove llama as dependency

This patch enables usage of MMA when one of the dimensions of the matrix(ie either M or N) is 1. This is useful in case of token generation where N < 2. The concept of 'GEMV Forwarding' is used where when one of the matrix has a single row/column, the elements are broadcasted, instead of using packing routine to prepack the matrix elements. This change results in 5% - 15% improvement in total speed(ie all tokens/total time), across various batch sizes. This is in comparision with the corresponding dot product implementation. The patch is tested with FP32 models of Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>

… enabled (ggml-org#12603) * Include speculative decoding stats when timings_per_token is true New fields added to the `timings` object: - draft_n : number of draft tokens generated - draft_accepted_n : number of draft tokens accepted - draft_accept_ratio: ratio of accepted/generated * Remove redundant draft_accept_ratio var * add draft acceptance rate to server console output

…12272) * vulkan: fix coopmat shader generation when cross-compiling Previously the status of coopmat{,2} support isn't passed to the vulkan-shaders-gen project building on the host, which leads to build failure because of the cross-compiling code expecting coopmat{,2} shaders that didn't get generated. Fix this by passing the coopmat{,2} support status to vulkan-shaders subproject. Signed-off-by: Icenowy Zheng <uwu@icenowy.me> * Only call coop-mat shaders once * Fix whitespace --------- Signed-off-by: Icenowy Zheng <uwu@icenowy.me> Co-authored-by: bandoti <141645996+bandoti@users.noreply.github.com>

* ggml : FA with different K, V head sizes (CPU) ggml-ci * metal : add FA with HS=192 * metal : extend FA to support different K and V head sizes ggml-ci * metal : add FA vector kernels for heads K 192 and V 128 ggml-ci * ggml : restrict op on other backends to equal head sizes ggml-ci * metal : optimize FA-vec kernel ggml-ci * metal : FA remove mq registers * metal : improve MoE mul_mat_id condition ggml-ci * metal : fix comments + remove unnecessary addition ggml-ci * metal : avoid too much shared memory usage with mul_mat_id ggml-ci

…2631)

…g#12747) fixes error for compiler paths with spaces

…io project/solution (ggml-org#12625)

* gguf-split now respects dry-run option * removing trailing space

* Upgrade daisyui, tailwindcss. * Switch to all themes. * Revert a change. * Update formatting. * Install packages before npm build. * Revert "Install packages before npm build." This reverts commit 336c514. * Add index.html.gz * run build --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

ggml-ci

* sync: minja google/minja#57 * fix json include

* common: custom hf endpoint support Add support for custom huggingface endpoints via HF_ENDPOINT environment variable You can now specify a custom huggingface endpoint using the HF_ENDPOINT environment variable when using the --hf-repo flag, which works similarly to huggingface-cli's endpoint configuration. Example usage: HF_ENDPOINT=https://hf-mirror.com/ ./bin/llama-cli --hf-repo Qwen/Qwen1.5-0.5B-Chat-GGUF --hf-file qwen1_5-0_5b-chat-q2_k.gguf -p "The meaning to life and the universe is" The trailing slash in the URL is optional: HF_ENDPOINT=https://hf-mirror.com ./bin/llama-cli --hf-repo Qwen/Qwen1.5-0.5B-Chat-GGUF --hf-file qwen1_5-0_5b-chat-q2_k.gguf -p "The meaning to life and the universe is" * Update common/arg.cpp readability Improvement Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Apply suggestions from code review --------- Co-authored-by: ベアトリーチェ <148695646+MakiSonomura@users.noreply.github.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* refactor clip_init * fix loading file * fix style * test ok * better test with report * add missing headers * clarify * add KEY_MM_PATCH_MERGE_TYPE * remove bool has_* pattern * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/llava/clip.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * use ggml_soft_max_ext * refactor logging system * add minicpm-v-o 2.6 for testing * use nullptr everywhere * fix Yi-VL model --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* arg.cpp: add a missing include * gemma3-cli.cpp: fix cinttypes include

nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.

Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* CANN: Refactor to reduce duplicate code * CANN: fix review comment

yeahdongcn and others added 30 commits March 26, 2025 09:09

doc: [MUSA] minor changes (ggml-org#12583)

fd7855f

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

ggml : fix MUL_MAT_ID repack with Q8_K (ggml-org#12544)

5ed38b6

* ggml : fix MUL_MAT_ID repack with Q8_K ggml-ci * ggml : improve repack templates ggml-ci

convert : fix squeeze for ssm_conv tensors (ggml-org#12573)

df4d20c

* convert : fix squeeze for ssm_conv tensors * convert : match ssm_conv tensors by type --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net>

upgrade to llguidance 0.7.10 (ggml-org#12576)

2447ad8

HIP: Add support for RDNA4 targets (ggml-org#12372)

bd40678

SYCL: implement memset ggml backend buffer interface (ggml-org#12580)

f17a3bb

* SYCL: implement memset ggml backend buffer interface * use GGML_ABORT macro * Do not wait for all queues to finish for memset operation

llama : make loras compatible with repacking (ggml-org#12593)

f28bc4c

* llama : make loras compatible with repacking ggml-ci * cont : simplify ggml-ci * cont : add TODO [no ci]

ggml : riscv: add 128-bit RVV support (ggml-org#12530)

24feaec

* ggml : add 128-bit RVV support * ggml : revert to old RVV 256+ q2_K, q3_K, q4_K, q6_K impl * remove trailing whitespaces * restructure vector length selection code

cmake : sync/merge PowerPC build commands (#0)

0306aad

sync : ggml

df0665a

ggml-ci

scripts : update sync + fix cmake merge

771d843

ggml-ci

sync : ggml

029c693

ggml-ci

convert : Support Qwen2_5_VLForConditionalGeneration (ggml-org#12595)

d5c6309

model : restore support for T5Encoder (ggml-org#12590)

953c2a6

opencl: add multi and vision rope, gelu_quick and im2col (ggml-or…

5dec47d

…g#12600) * opencl: add `im2col` * opencl: add `gelu_quick` * opencl: add mrope * opencl: add vision rope

media : add SVG logo [no ci] (ggml-org#12616)

2969019

rpc : update README for cache usage (ggml-org#12620)

ef03229

llama: fix error on bad grammar (ggml-org#12628)

dd373dd

llama : fix incorrect Qwen2Moe ffn_moe_out graph callback (ggml-org#1…

3714c3e

…2631)

CANN : remove clang-format in ggml-cann (ggml-org#12607)

d07a0d7

hydroo and others added 15 commits April 4, 2025 10:12

cmake: fix ggml-shaders-gen compiler paths containing spaces (ggml-or…

9ac4d61

…g#12747) fixes error for compiler paths with spaces

sycl: allow ggml-sycl configuration and compilation using Visual Stud…

94148ba

…io project/solution (ggml-org#12625)

gguf-split : --merge now respects --dry-run option (ggml-org#12681)

23106f9

* gguf-split now respects dry-run option * removing trailing space

ci: add Linux cross-compile build (ggml-org#12428)

1be76e4

kv-cache : simplify + fix warning for recurrent models (ggml-org#12756)

3e1d293

ggml-ci

sync: minja (ggml-org#12739)

7a84777

* sync: minja google/minja#57 * fix json include

common : fix includes in arg.cpp and gemma3-cli.cpp (ggml-org#12766)

f1e3eb4

* arg.cpp: add a missing include * gemma3-cli.cpp: fix cinttypes include

Vulkan: Tune Vulkan mmq int dot shader for performance (ggml-org#12767)

6bf28f0

vulkan: Use unclamped loads for flash attention mask (ggml-org#12720)

80b717d

nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.

vulkan: fix NaN issue in flash attention shader (ggml-org#12776)

0c74b04

Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.

musa: fix compilation warnings in mp_22/31 (ggml-org#12780)

916c83b

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

CANN: Refactor to reduce duplicate code (ggml-org#12731)

d0d5b22

* CANN: Refactor to reduce duplicate code * CANN: fix review comment

github-actions bot added documentation Improvements or additions to documentation SYCL Nvidia GPU Vulkan testing examples devops python server ggml Apple Metal script labels Apr 7, 2025

Merge branch 'layla-build' into merge

3b06a95

l3utterfly merged commit afd18ce into layla-build Apr 7, 2025
32 of 52 checks passed

l3utterfly deleted the merge branch April 7, 2025 09:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge from upstream #60

merge from upstream #60

Uh oh!

l3utterfly commented Apr 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

50 participants

merge from upstream #60

merge from upstream #60

Uh oh!

Conversation

l3utterfly commented Apr 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

50 participants