Sync master with upstream release b6314 #221

jan-service-account · 2025-08-29T00:12:10Z

Updates dev branch with latest release (b6314) from ggml-org/llama.cpp

* add fused group_norm/norm, mul, add * fix spacing * revert rms_norm logic * fix trailing whitespace

This commit updates the bash completion script to include the -m short option for the --model argument. The motivation for this is that currently tab completion only works the full --model option, and it is nice to have it work for the short option as well.

* ggml-cpu : add basic RVV support for vector f32 ops * ggml-cpu : add RVV support for f32 softmax

…15561) * CANN(flash-attn): refactor mask handling and improve performance 1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode. 2. Optimized performance in non-alibi scenarios by reducing one repeat operation. 3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16. Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: fix review Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: Optimization FA BNSD to BSND Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>

…rg#15610) ggml-ci

…org#15615)

…rg#15622) Prior to this change, we faced undefined cublasLt references when attempting to compile 'llama-cli' with GGML_STATIC=ON on Linux. We add linking with CUDA::cublasLt_static when CUDA version is greater than 10.1.

This commit adds a new target to the Makefile for converting models that are multimodal. This target will convert the original model and in addition also create the mmproj GGUF model. The motivation for this change is that for models that are multimodal, for example those that contain a vision encoders, we will often want to upload both the quantized model and the vision encoder model to HuggingFace. Example usage: ```console $ make causal-convert-mm-model MODEL_PATH=~/work/ai/models/gemma-3-4b-it-qat-q4_0-unquantized/ ... The environment variable CONVERTED_MODEL can be set to this path using: export CONVERTED_MODEL=/home/danbev/work/ai/llama.cpp/models/gemma-3-4b-it-qat-q4_0-unquantized.gguf The mmproj model was created in /home/danbev/work/ai/llama.cpp/models/mmproj-gemma-3-4b-it-qat-q4_0-unquantized.gguf ``` The converted original model can then be quantized, and after that both the quantized model and the mmproj file can then be uploaded to HuggingFace. Refs: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF/tree/main

…15604) * Change to warn instead of debug, to explain reason for stopping. * Update tools/main/main.cpp Fix printing --2 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* gguf-py: implement byteswapping for Q4_0 This is needed to byteswap Mistral model. Also restore original shapes after byteswapping tensors. It is not needed at the moment, but do it in case they'd be used in future. * Rework byteswapping code in gguf-py Move out details from byteswapping tensor blocks code

ggml-ci

* initial jina-embeddings-v3 support * initial jina-embeddings-v3 support * initial jina-embeddings-v3 support * fix vocab parsing with only tokenizer.json * set mask token lstrip attribute * additional unk_token_id fallback just in case [no ci] * revert vocab_size() change [no ci] * merge tensor loading into general bert * rope * add lora embedding and loading (non-functional) * export separate lora ggufs instead * add adapter metadata api * use std::string * convert_hf_to_lora compatibility * fix assert * apply suggestions from review * apply suggestion from review

…15638) ggml-ci

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* CUDA: add conv2d * CUDA: conv2d - correct formatting and added const

… printed values (ggml-org#15637) This makes it much easier to compare between llama.cpp and transformers! https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 Branch: gabe-l-hart/nvidia-nemotron-nano-15409 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

slaren and others added 21 commits August 29, 2025 09:17

tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599)

d248573

OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314)

dccd759

* add fused group_norm/norm, mul, add * fix spacing * revert rms_norm logic * fix trailing whitespace

ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057)

dbd8ebc

* ggml-cpu : add basic RVV support for vector f32 ops * ggml-cpu : add RVV support for f32 softmax

kv-cache : better estimate of n_kv for multi-sequence batches (ggml-o…

4456c0c

…rg#15610) ggml-ci

HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-…

d9c27bf

…org#15615)

presets : add qwen3-30B-a3b FIM (ggml-org#15616)

ab02cc2

server: higher timeout for tests (ggml-org#15621)

a11f1e5

kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505)

246385a

ggml-ci

scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633)

5b262c0

kv-cache : fix find_slot to not search for continuous slot (ggml-org#…

4f4a813

…15638) ggml-ci

ggml : fix SSM_SCAN for n_groups > 1 (ggml-org#15625)

221a8c5

ggml-cpu: fix invalid hsum build in debug s390x (ggml-org#15634)

491b300

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

CUDA: add conv2d (ggml-org#15635)

b139bd4

* CUDA: add conv2d * CUDA: conv2d - correct formatting and added const

Minh141120 force-pushed the update-dev-from-master-2025-08-29-00-12 branch from a8bca68 to b942ade Compare August 29, 2025 02:17

Minh141120 requested a review from qnixsynapse August 29, 2025 02:18

qnixsynapse approved these changes Aug 29, 2025

View reviewed changes

Minh141120 added this pull request to the merge queue Aug 29, 2025

Merged via the queue into dev with commit 8062559 Aug 29, 2025
11 checks passed

Minh141120 deleted the update-dev-from-master-2025-08-29-00-12 branch August 29, 2025 04:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync master with upstream release b6314 #221

Sync master with upstream release b6314 #221

Uh oh!

jan-service-account commented Aug 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

21 participants

Sync master with upstream release b6314 #221

Sync master with upstream release b6314 #221

Uh oh!

Conversation

jan-service-account commented Aug 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

21 participants