merge from upstream #61

l3utterfly · 2025-04-20T04:33:10Z

No description provided.

* cmake : enable curl by default * no curl if no examples * fix build * fix build-linux-cross * add windows-setup-curl * fix * shell * fix path * fix windows-latest-cmake* * run: include_directories * LLAMA_RUN_EXTRA_LIBS * sycl: no llama_curl * no test-arg-parser on windows * clarification * try riscv64 / arm64 * windows: include libcurl inside release binary * add msg * fix mac / ios / android build * will this fix xcode? * try clearing the cache * add bunch of licenses * revert clear cache * fix xcode * fix xcode (2) * fix typo

…et_tensor (ggml-org#12734)

… (ggml/1167) * cpu: refactor SIMD mappings and vectorized op functions into separate files * Fix warning for ggml_float to float * Fix warnings * cpu: move all the operations (except mul_mat) to a separate c++ file * fix whitespace * Update ggml/src/ggml-cpu/vec.h Co-authored-by: Diego Devesa <slarengh@gmail.com> * Fix PR comments - use GGML_UNUSED, use cassert in ops.cpp * Reverse the order of import for ops.h and vec.h, to match what was present in ggml-cpu.c previously --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

* add bf16 support * use convert_from_bf16_cuda instead of convert_unary_cuda for f32 * revert 7ec5085 * move functionality into convert_unary with constexpr

* ggml : simlpify Arm fp16 CPU logic ggml-ci * cont : bring back CUDA/MUSA checks ggml-ci

ggml-ci

* llama4 conversion * initial support, no chat template * clean up a bit * fix tokenizer conversion * correct hparams * try this * fix shexp * ffn_inp_normed * chat template * clean up model conversion * add_bos * add scale_before_ffn * fix order * weight_before_ffn * llm_graph_input_attn_temp * add chunk attn mask * build_inp_attn_scale() * add comment about ggml_repeat * clarify comments * fix build

* gguf-py : support lazy tensor splitting Splitting usually involves returning tuples of tensors, which need to be handled properly to avoid early eager evaluation. * gguf-py : fix flake8 lint

…uffer_set_tensor" (ggml-org#12812) * Revert "sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_s…" This reverts commit 518a014. * Update ggml/src/ggml-sycl/ggml-sycl.cpp * Update ggml/src/ggml-sycl/ggml-sycl.cpp * rm tail space

…rg#12785) * Update ChatScreen.tsx * useAutosizeTextarea.ts useAutosizeTextarea to encapsulate the logic. * Implement responsive auto-sizing chat textarea Replaces the manual textarea resizing with an automatic height adjustment based on content. - `useChatTextarea` hook to manage textarea state and auto-sizing logic via refs, preserving the optimization - Textarea now grows vertically up to a maximum height (`lg:max-h-48`) on large screens (lg breakpoint and up). - Disables auto-sizing and enables manual vertical resizing (`resize-vertical`) on smaller screens for better mobile usability. - Aligns the "Send" button to the bottom of the textarea (`items-end`) for consistent positioning during resize. * -update compressed index.html.gz after npm run build -refactor: replace OptimizedTextareaValue with AutosizeTextareaApi in VSCode context hook * chore: normalize line endings to LF refactor: AutosizeTextareaApi -> chatTextareaApi * refactor: Rename interface to PascalCase --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

…text (ggml-org#12824) Signed-off-by: dm4 <sunrisedm4@gmail.com>

…12825) * ggml : FA supports F32 V * graph : cast KV to F16 when the KV cache is not used ggml-ci * server : add test that exercises embeddings with FA enabled ggml-ci

…ml-org#12834)

This allows BF16 KV-cache on CUDA.

…#12783) This is consistent with the ggml-cuda behavior and the mul_mat fallback.

…gml-org#12833) q4_k and q5_k had a lot of redundant global loads where the same 16B of scale information is repeatedly loaded and decoded during each loop iteration. This change restructures the loops to more explicitly iterate over whole blocks in the outer loop (with unrolled inner loop) and to copy/decode the scale data into shared memory once at the start of each outer loop. The copy is pipelined so the scale load from global memory is relatively cheap. This improves q4_k/q5_k model prompt processing performance by around 5-7%. I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k and hurt for q4_0. The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped variants isn't used as often as it originally was (e.g. due to the padded_N change), so I trimmed it down to offset some of the new complexity of the semi-manual loop unrolling.

* [CANN] Support ELU and CONV_TRANSPOSE_1D * [CANN]Modification review comments * [CANN]Modification review comments * [CANN]name adjustment * [CANN]remove lambda used in template * [CANN]Use std::func instead of template * [CANN]Modify the code according to the review comments --------- Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* fix: detach common from the library * fix: building chat test template

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* add qwen3 & qwen3moe support. * fix --------- Co-authored-by: bozheng-hit <dsoul0621@gmail.com>

The Granite's FIM tokens are very similar to Qwen's; it's just that they use underscore instead of a dash. So <fim_middle> for example instead of <fim-middle>. Opening up tokenizer_config.json in ibm-granite/granite-3.3-8b-base shows: ``` "<fim_prefix>", "<fim_middle>", "<fim_suffix>", "<fim_pad>", ... "<reponame>", ```

Submit operators using asynchronous threads to improve performance. Use the environment variable GGML_CANN_ASYNC_MODE to control whether asynchronous submission is enabled. It is disabled by default. Testing shows a 10%–20% performance improvement in scenarios with small parameter sizes, especially in quantized models.

…rg#12970)

…-org#12953) * graph : make mla compatible with FA * metal : add exp FA kernels for DeepSeek models ggml-ci * llama : minor naming updates ggml-ci * ggml : disable FA for DS head sizes * tests : add FA tests for MLA shapes ggml-ci

Add RPC_CMD_HELLO for getting the version of the protocol implemend by the server. Follow the semantic versioning rules at https://semver.org Hopefully this bring better user experience when we make breaking changes at the protocol level and avoid issues like ggml-org#12465

* mtmd : add more api around mtmd_image_tokens * mtmd : ability to calc image hash * shared_ptr for mtmd_image_tokens * move hash to user-define ID (fixed) * fix prompt_modified * rm redundant data member

* SYCL: refactor move to a separate file * Fix binbcast * Remove duplicates * fix include formatting * fix typo

* server : use std::move whenever possible * use r-value ref * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * make task creation scoped * restore std::move * fix task_id not set correctly * apply changes from suggestion Co-authored-by: ggerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

This restores the behavior from ggml-org#491. This does not affect Ctrl+D's ability to terminate --multiline-input lines (ggml-org#1040). This also actually implements ggml-org#587: "If the user wants the text to end in a newline, this should be accomplished by explicitly adding a newline by using \ followed by return, then returning control by pressing return again." Fixes ggml-org#12949

…ml-org#13011) * clip : refactor, add `image_manipulation` and `llava_uhd` * refactor llava-1.6 preprocessing * simplify logic for llava-1.5 * missing include

jeffzhou2000 and others added 30 commits April 7, 2025 19:34

CANN: fix typo in ggml-cann (ggml-org#12733)

52b3d71

ci : no curl on ggml-ci (ggml-org#12796)

e391d3e

sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_s…

518a014

…et_tensor (ggml-org#12734)

CUDA: don't convert BF16 weights to FP32 (ggml/1174)

36ca8b3

* add bf16 support * use convert_from_bf16_cuda instead of convert_unary_cuda for f32 * revert 7ec5085 * move functionality into convert_unary with constexpr

ggml : simplify Arm fp16 CPU logic (ggml/1177)

ff067db

* ggml : simlpify Arm fp16 CPU logic ggml-ci * cont : bring back CUDA/MUSA checks ggml-ci

sync : ggml

a4e46e2

ggml-ci

cuda : fix HIP and MUSA BF16 (#0)

1a1ab7e

ggml-ci

hellaswag: display estimated score confidence interval (ggml-org#12797)

4ccea21

opencl: better identify Adreno GPU (ggml-org#12760)

8297401

gguf-py : support lazy tensor splitting (ggml-org#12809)

a226bc7

* gguf-py : support lazy tensor splitting Splitting usually involves returning tuples of tensors, which need to be handled properly to avoid early eager evaluation. * gguf-py : fix flake8 lint

arg : Including limits file on AIX (ggml-org#12822)

1d343b4

llava: add more helper functions to check projector types in clip con…

2dabf75

…text (ggml-org#12824) Signed-off-by: dm4 <sunrisedm4@gmail.com>

server : fix thread.join() on exit (ggml-org#12831)

78a1ba0

llama : fix FA when KV cache is not used (i.e. embeddings) (ggml-org#…

a19b5ce

…12825) * ggml : FA supports F32 V * graph : cast KV to F16 when the KV cache is not used ggml-ci * server : add test that exercises embeddings with FA enabled ggml-ci

llava: improve clip_ctx destructor to not memleak load_image_size (gg…

b32efad

…ml-org#12834)

cuda : add f32 to bf16 copy op (ggml-org#12806)

7538246

This allows BF16 KV-cache on CUDA.

vulkan: Use fp16 for the flash attention P*V multiplication (ggml-org…

7ecd780

…#12783) This is consistent with the ggml-cuda behavior and the mul_mat fallback.

readme : add rpc backend (ggml-org#12842)

47277d6

clip : do not print ftype (ggml-org#12832)

65a69e6

ci: detach common from the library (ggml-org#12827)

381603a

* fix: detach common from the library * fix: building chat test template

sycl: update documentation to use -no-cnv (ggml-org#12845)

8ed7124

musa: enable freediskspace for docker image build (ggml-org#12839)

d9a63b2

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

llama : Support Qwen3 and Qwen3MoE (ggml-org#12828)

d3bd719

* add qwen3 & qwen3moe support. * fix --------- Co-authored-by: bozheng-hit <dsoul0621@gmail.com>

Noeda and others added 14 commits April 17, 2025 11:37

ggml: Re-enable CUDA graphs in presence of CONT and DUP nodes (ggml-o…

207c22e

…rg#12970)

mtmd : add methods to access mtmd_image_tokens (ggml-org#12906)

b9154ec

* mtmd : add more api around mtmd_image_tokens * mtmd : ability to calc image hash * shared_ptr for mtmd_image_tokens * move hash to user-define ID (fixed) * fix prompt_modified * rm redundant data member

SYCL: Refactor and enable FP16 in binary broadcast OPs (ggml-org#12975)

8d66005

* SYCL: refactor move to a separate file * Fix binbcast * Remove duplicates * fix include formatting * fix typo

gguf-py : GGUF Editor GUI - Python + Qt6 (ggml-org#12930)

aff9d10

clip : refactor, add image_manipulation and llava_uhd classes (gg…

37b9f0d

…ml-org#13011) * clip : refactor, add `image_manipulation` and `llava_uhd` * refactor llava-1.6 preprocessing * simplify logic for llava-1.5 * missing include

gguf-py : fix upload python package workflow (ggml-org#13020)

fb28f4f

Disable CI cross-compile builds (ggml-org#13022)

0013715

Merge branch 'layla-build' into merge

e1b1a5a

l3utterfly merged commit 25819e6 into layla-build Apr 20, 2025
33 of 49 checks passed

l3utterfly deleted the merge branch April 20, 2025 04:34

github-actions bot added documentation Improvements or additions to documentation SYCL Nvidia GPU Vulkan testing build examples devops python android server ggml Apple Metal script labels Apr 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge from upstream #61

merge from upstream #61

Uh oh!

l3utterfly commented Apr 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

47 participants

merge from upstream #61

merge from upstream #61

Uh oh!

Conversation

l3utterfly commented Apr 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

47 participants