Sync master with upstream release b5093 #51

jan-service-account · 2025-04-10T00:08:13Z

Updates dev branch with latest release (b5093) from ggml-org/llama.cpp

* CANN: Refactor to reduce duplicate code * CANN: fix review comment

* cmake : enable curl by default * no curl if no examples * fix build * fix build-linux-cross * add windows-setup-curl * fix * shell * fix path * fix windows-latest-cmake* * run: include_directories * LLAMA_RUN_EXTRA_LIBS * sycl: no llama_curl * no test-arg-parser on windows * clarification * try riscv64 / arm64 * windows: include libcurl inside release binary * add msg * fix mac / ios / android build * will this fix xcode? * try clearing the cache * add bunch of licenses * revert clear cache * fix xcode * fix xcode (2) * fix typo

…et_tensor (ggml-org#12734)

… (ggml/1167) * cpu: refactor SIMD mappings and vectorized op functions into separate files * Fix warning for ggml_float to float * Fix warnings * cpu: move all the operations (except mul_mat) to a separate c++ file * fix whitespace * Update ggml/src/ggml-cpu/vec.h Co-authored-by: Diego Devesa <slarengh@gmail.com> * Fix PR comments - use GGML_UNUSED, use cassert in ops.cpp * Reverse the order of import for ops.h and vec.h, to match what was present in ggml-cpu.c previously --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

* add bf16 support * use convert_from_bf16_cuda instead of convert_unary_cuda for f32 * revert 7ec5085 * move functionality into convert_unary with constexpr

* ggml : simlpify Arm fp16 CPU logic ggml-ci * cont : bring back CUDA/MUSA checks ggml-ci

ggml-ci

* llama4 conversion * initial support, no chat template * clean up a bit * fix tokenizer conversion * correct hparams * try this * fix shexp * ffn_inp_normed * chat template * clean up model conversion * add_bos * add scale_before_ffn * fix order * weight_before_ffn * llm_graph_input_attn_temp * add chunk attn mask * build_inp_attn_scale() * add comment about ggml_repeat * clarify comments * fix build

* gguf-py : support lazy tensor splitting Splitting usually involves returning tuples of tensors, which need to be handled properly to avoid early eager evaluation. * gguf-py : fix flake8 lint

…uffer_set_tensor" (ggml-org#12812) * Revert "sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_s…" This reverts commit 518a014. * Update ggml/src/ggml-sycl/ggml-sycl.cpp * Update ggml/src/ggml-sycl/ggml-sycl.cpp * rm tail space

…rg#12785) * Update ChatScreen.tsx * useAutosizeTextarea.ts useAutosizeTextarea to encapsulate the logic. * Implement responsive auto-sizing chat textarea Replaces the manual textarea resizing with an automatic height adjustment based on content. - `useChatTextarea` hook to manage textarea state and auto-sizing logic via refs, preserving the optimization - Textarea now grows vertically up to a maximum height (`lg:max-h-48`) on large screens (lg breakpoint and up). - Disables auto-sizing and enables manual vertical resizing (`resize-vertical`) on smaller screens for better mobile usability. - Aligns the "Send" button to the bottom of the textarea (`items-end`) for consistent positioning during resize. * -update compressed index.html.gz after npm run build -refactor: replace OptimizedTextareaValue with AutosizeTextareaApi in VSCode context hook * chore: normalize line endings to LF refactor: AutosizeTextareaApi -> chatTextareaApi * refactor: Rename interface to PascalCase --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

…text (ggml-org#12824) Signed-off-by: dm4 <sunrisedm4@gmail.com>

…12825) * ggml : FA supports F32 V * graph : cast KV to F16 when the KV cache is not used ggml-ci * server : add test that exercises embeddings with FA enabled ggml-ci

…ml-org#12834)

This allows BF16 KV-cache on CUDA.

…#12783) This is consistent with the ggml-cuda behavior and the mul_mat fallback.

…gml-org#12833) q4_k and q5_k had a lot of redundant global loads where the same 16B of scale information is repeatedly loaded and decoded during each loop iteration. This change restructures the loops to more explicitly iterate over whole blocks in the outer loop (with unrolled inner loop) and to copy/decode the scale data into shared memory once at the start of each outer loop. The copy is pipelined so the scale load from global memory is relatively cheap. This improves q4_k/q5_k model prompt processing performance by around 5-7%. I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k and hurt for q4_0. The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped variants isn't used as often as it originally was (e.g. due to the padded_N change), so I trimmed it down to offset some of the new complexity of the semi-manual loop unrolling.

* [CANN] Support ELU and CONV_TRANSPOSE_1D * [CANN]Modification review comments * [CANN]Modification review comments * [CANN]name adjustment * [CANN]remove lambda used in template * [CANN]Use std::func instead of template * [CANN]Modify the code according to the review comments --------- Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* fix: detach common from the library * fix: building chat test template

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* add qwen3 & qwen3moe support. * fix --------- Co-authored-by: bozheng-hit <dsoul0621@gmail.com>

error: ISO C++17 does not allow 'register' storage class specifier

hipudding and others added 30 commits April 7, 2025 17:10

CANN: Refactor to reduce duplicate code (ggml-org#12731)

d0d5b22

* CANN: Refactor to reduce duplicate code * CANN: fix review comment

CANN: fix typo in ggml-cann (ggml-org#12733)

52b3d71

ci : no curl on ggml-ci (ggml-org#12796)

e391d3e

sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_s…

518a014

…et_tensor (ggml-org#12734)

CUDA: don't convert BF16 weights to FP32 (ggml/1174)

36ca8b3

* add bf16 support * use convert_from_bf16_cuda instead of convert_unary_cuda for f32 * revert 7ec5085 * move functionality into convert_unary with constexpr

ggml : simplify Arm fp16 CPU logic (ggml/1177)

ff067db

* ggml : simlpify Arm fp16 CPU logic ggml-ci * cont : bring back CUDA/MUSA checks ggml-ci

sync : ggml

a4e46e2

ggml-ci

cuda : fix HIP and MUSA BF16 (#0)

1a1ab7e

ggml-ci

hellaswag: display estimated score confidence interval (ggml-org#12797)

4ccea21

opencl: better identify Adreno GPU (ggml-org#12760)

8297401

gguf-py : support lazy tensor splitting (ggml-org#12809)

a226bc7

* gguf-py : support lazy tensor splitting Splitting usually involves returning tuples of tensors, which need to be handled properly to avoid early eager evaluation. * gguf-py : fix flake8 lint

arg : Including limits file on AIX (ggml-org#12822)

1d343b4

llava: add more helper functions to check projector types in clip con…

2dabf75

…text (ggml-org#12824) Signed-off-by: dm4 <sunrisedm4@gmail.com>

server : fix thread.join() on exit (ggml-org#12831)

78a1ba0

llama : fix FA when KV cache is not used (i.e. embeddings) (ggml-org#…

a19b5ce

…12825) * ggml : FA supports F32 V * graph : cast KV to F16 when the KV cache is not used ggml-ci * server : add test that exercises embeddings with FA enabled ggml-ci

llava: improve clip_ctx destructor to not memleak load_image_size (gg…

b32efad

…ml-org#12834)

cuda : add f32 to bf16 copy op (ggml-org#12806)

7538246

This allows BF16 KV-cache on CUDA.

vulkan: Use fp16 for the flash attention P*V multiplication (ggml-org…

7ecd780

…#12783) This is consistent with the ggml-cuda behavior and the mul_mat fallback.

readme : add rpc backend (ggml-org#12842)

47277d6

clip : do not print ftype (ggml-org#12832)

65a69e6

ci: detach common from the library (ggml-org#12827)

381603a

* fix: detach common from the library * fix: building chat test template

sycl: update documentation to use -no-cnv (ggml-org#12845)

8ed7124

musa: enable freediskspace for docker image build (ggml-org#12839)

d9a63b2

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

bozheng-hit and others added 2 commits April 9, 2025 11:47

llama : Support Qwen3 and Qwen3MoE (ggml-org#12828)

d3bd719

* add qwen3 & qwen3moe support. * fix --------- Co-authored-by: bozheng-hit <dsoul0621@gmail.com>

ggml-impl.h: fix build on POWER9 (ggml-org#12855)

2391506

error: ISO C++17 does not allow 'register' storage class specifier

jan-service-account merged commit 4969c29 into dev Apr 10, 2025
15 checks passed

jan-service-account deleted the update-dev-from-master-2025-04-10-00-08 branch April 10, 2025 00:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync master with upstream release b5093 #51

Sync master with upstream release b5093 #51

Uh oh!

jan-service-account commented Apr 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

23 participants

Sync master with upstream release b5093 #51

Sync master with upstream release b5093 #51

Uh oh!

Conversation

jan-service-account commented Apr 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

23 participants