Sync master with upstream release b5857 #162

jan-service-account · 2025-07-10T00:09:14Z

Updates dev branch with latest release (b5857) from ggml-org/llama.cpp

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>

* Add Reorder to Q6_K mmvq implementation * Address PR comments: clean up comments * Remove unused parameter after refactoring q4_k * Adding inline to function and removing unnecessary reference to int --------- Signed-off-by: nscipione <nicolo.scipione@codeplay.com>

* webui: fix sidebar being covered by main content Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * webui: update index.html.gz Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Simplify the environment variable setting to specify the memory pool type. * Adjust the GGML_CANN_ASYNC_MODE setting to accept yes, enable, 1, or on (case-insensitive) as valid options. * update * fix CI * update * delete whitespace * fix according to review * update CANN.md * update CANN.md

ggml-ci

* move ggml-cpu-aarch64 to repack * split quantize_row_q8_0/1 * split helper functions * split ggml_vec_dot_q4_0_q8_0 * split ggml_vec_dot_q4_1_q8_1 * split ggml_vec_dot_q5_0_q8_0 * split ggml_vec_dot_q5_1_q8_1 * split ggml_vec_dot_q8_0_q8_0 * split ggml_vec_dot_tq1_0_q8_K * split ggml_vec_dot_tq2_0_q8_K * split ggml_vec_dot_q2_K_q8_K * split ggml_vec_dot_q3_K_q8_K * split ggml_vec_dot_q4_K_q8_K * split ggml_vec_dot_q5_K_q8_K * split ggml_vec_dot_q6_K_q8_K * split ggml_vec_dot_iq2_xxs_q8_K * split ggml_vec_dot_iq2_xs_q8_K * split ggml_vec_dot_iq2_s_q8_K * split ggml_vec_dot_iq3_xxs_q8_K * split ggml_vec_dot_iq3_s_q8_K * split ggml_vec_dot_iq1_s_q8_K * split ggml_vec_dot_iq1_m_q8_K * split ggml_vec_dot_iq4_nl_q8_0 * split ggml_vec_dot_iq4_xs_q8_K * fix typos * fix missing prototypes * rename ggml-cpu-quants.c * rename ggml-cpu-traits * rename arm folder * move cpu-feats-x86.cpp * rename ggml-cpu-hbm * update arm detection macro in quants.c * move iq quant tables * split ggml_quantize_mat_q8_0/K * split ggml_gemv_* * split ggml_gemm_* * rename namespace aarch64 to repack * use weak aliases to replace test macros * rename GGML_CPU_AARCH64 to GGML_CPU_REPACK * rename more aarch64 to repack * clean up rebase leftover * fix compilation errors * remove trailing spaces * try to fix clang compilation errors * try to fix clang compilation errors again * try to fix clang compilation errors, 3rd attempt * try to fix clang compilation errors, 4th attempt * try to fix clang compilation errors, 5th attempt * try to fix clang compilation errors, 6th attempt * try to fix clang compilation errors, 7th attempt * try to fix clang compilation errors, 8th attempt * try to fix clang compilation errors, 9th attempt * more cleanup * fix compilation errors * fix apple targets * fix a typo in arm version of ggml_vec_dot_q4_K_q8_K Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggml-org#13980) * llama : allow building all tests on windows when not using shared libraries * add static windows build to ci * tests : enable debug logs for test-chat --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggml-ci

… device is available, to allow fallback to CPU backend (ggml-org#14099)

ggml-ci

)

…org#13834) * kv-cache : avoid modifying recurrent cells when setting inputs * kv-cache : remove inp_s_mask It was replaced with equivalent and simpler functionality with rs_z (the first zeroed state) and the already-existing inp_s_copy. * kv-cache : fix non-consecutive token pos warning for recurrent models The problem was apparently caused by how the tail cells were swapped. * graph : simplify logic for recurrent state copies * kv-cache : use cell without src refs for rs_z in recurrent cache * llama-graph : fix recurrent state copy The `state_copy` shuffle assumes everything is moved at once, which is not true when `states_extra` is copied back to the cache before copying the range of states between `head` and `head + n_seqs`. This is only a problem if any of the cells in [`head`, `head + n_seqs`) have an `src` in [`head + n_seqs`, `head + n_kv`), which does happen when `n_ubatch > 1` in the `llama-parallel` example. Changing the order of the operations avoids the potential overwrite before use, although when copies are avoided (like with Mamba2), this will require further changes. * llama-graph : rename n_state to state_size in build_recurrent_state This naming should reduce confusion between the state size and the number of states.

Use the same descriptor set layout for all pipelines (MAX_PARAMETER_COUNT == 8) and move it to the vk_device. Move all the descriptor pool and set tracking to the context - none of it is specific to pipelines anymore. It has a single vector of pools and vector of sets, and a single counter to track requests and a single counter to track use.

)

ggml-ci

…org#14062) * webui: Wrap long numbers instead of infinite horizontal scroll * Use tailwind class * update index.html.gz

This change moves the command pool/buffer tracking into a vk_command_pool structure. There are two instances per context (for compute+transfer) and two instances per device for operations that don't go through a context. This should prevent separate contexts from stomping on each other.

* ggml-cpu: Factor out feature detection build from x86 * ggml-cpu: Add ARM feature detection and scoring This is analogous to cpu-feats-x86.cpp. However, to detect compile-time activation of features, we rely on GGML_USE_<FEAT> which need to be set in cmake, instead of GGML_<FEAT> that users would set for x86. This is because on ARM, users specify features with GGML_CPU_ARM_ARCH, rather than with individual flags. * ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for ARM Like x86, however to pass around arch flags within cmake, we use GGML_INTERNAL_<FEAT> as we don't have GGML_<FEAT>. Some features are optional, so we may need to build multiple backends per arch version (armv8.2_1, armv8.2_2, ...), and let the scoring function sort out which one can be used. * ggml-cpu: Limit ARM GGML_CPU_ALL_VARIANTS to Linux for now The other platforms will need their own specific variants. This also fixes the bug that the the variant-building branch was always being executed as the else-branch of GGML_NATIVE=OFF. The branch is moved to an elseif-branch which restores the previous behavior.

…rg#14130) ggml-ci

ggml-ci

* batch : remove logits_all flag ggml-ci * context : simplify output counting logic during decode ggml-ci * cont : fix comments

* cmake: Simplify build-info.cpp generation The rebuild of build-info.cpp still gets triggered when .git/index gets changes. * cmake: generate build-info.cpp in build dir

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>

* vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1

The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results.

Commit taken from remyoudompheng's PR ggml-org#12260 Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* cuda : fix rope non-cont ggml-ci * cont : fix multi-rope + add test ggml-ci * sycl : try fix ggml-ci * cont : fix sycl + clean-up cuda ggml-ci

* model : add hunyuan moe * tokenizer ok * fix tensor name * cgraph init * chat template * wip * almost working * skip embed, fix bos * cleanup * yarn scaling * cleanup * correct rope type * failed token fix * ntk alpha freq_base * tokenization working * cleanup and pr changes * vocab_size sanity check * ntk alpha generic * Update convert_hf_to_gguf.py * Apply suggestions from code review * fix regression * fix style --------- Co-authored-by: kooshi <1934337+kooshi@users.noreply.github.com>

* Add server_prefix * Correct server path env * Rename cli flag to --api-prefix * Change all to api_prefix

)

Splits producing more than one ubatch per batch for recurrent models were broken with ggml-org#14512. This fixes it by moving the completeness check after the ubatch split loop.

* Init - first pass. * Model -> ModelBase. * fix errors in conversion. * Update the graph. * up. * up. * wip * cgraph ok * rm redundant code --------- Co-authored-by: Vaibhavs10 <vaibhavs10@gmail.com>

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* vulkan: allow FA split_k with smaller KV values * vulkan: spread split_k_reduce work across more threads k_num can get rather large. Use the whole workgroup to reduce the M/L values. Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).

* v1 * push more fixes * another fix * fix * more fixes * minor fix * more cleaning on python code * python fixes * changed precision for multipliers float 32->64 * fixes * another fix * fix * pre-norm -> norm * fix * Revert "fix" This reverts commit 243e4d1. * fix * small fix ffn_norm * try * mix instead of max * fix vocab size * conflict solve * fixed multipliers * falcon-h1 specefic vocab resolved * read arch from gguf.MODEL_ARCH * mamba_d_ssm added to d_inner find_hparam * remove unused functions from gguf_writer.py * override modify_tensors instead of get_tensors * fix conversion and d_inner * added some cb functions for debugging puposes * inp_out_ids moved outside of layers loop * mup_vec create as float64 * fix rope_theta * injected mup * clean ups * rm extra space * rm unused MAMBA_CHUNK_SIZE * rm unused key * add bos False * changed ROPE_TYPE * cleaning debugging stuff * cleaning debug quant * fix comment * some cleanups * some cleanups * Update src/llama-model-loader.cpp * more cleanups * moe cleanuips * d_ssm -> d_inner; * cleaning unused hparams * cleanup * more cleanups * more cleanups on python conversion; * minor cleanups * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * remove todo * added falcon-h1 * tensor not required * clean * remove unneeded attributes * more cleanups and fixed conversion * remove final_norm * flake8 fixes * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * flake8 fixes * Update src/llama-hparams.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-arch.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * added hashes * Update src/llama-arch.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update src/llama-vocab.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update the update file * Revert "update the update file" This reverts commit 082ab4a. * fix: address suggestions * fix: update convert_hf_to_gguf.py * Update gguf-py/gguf/constants.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model-loader.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * d_inner fixed * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * reshaping ssm_norm for 34B * removing generate_mup * remove duplicates metadata keys * rm comment * final comment * fix unused args * fix constants * fix bad merge * Update src/llama-model.cpp Co-authored-by: compilade <git@compilade.net> * falcon-h1: remove unused ssm_in_b and bad merge * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * falcon-h1: fix last comment * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * falcon-h1: revert add_add_bos(False) * falcon-h1: fix tied weights * falcon-h1: remove whitespace * falcon-h1: fix wrong size param * falcon-h1: fix whitespace issues --------- Co-authored-by: younesbelkada <younes.belkada@tii.ae> Co-authored-by: Younes B <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: compilade <git@compilade.net>

…-org#14595)

* ggml : add ggml_scale_bias * ggml_vec_mad1_f32 * add more simd * add CUDA * sycl * vulkan * cann (placeholder) * opencl * will this fix cpu? * fix cuda * suggestions from coderabbit * fix cann compile error * vDSP_vsmsa * rm __ARM_FEATURE_SVE * use memcpy for op params * make code looks more consistent * use scalar for __ARM_FEATURE_SVE * add x param to ggml_vec_mad1_f32

* wip: llama : separate recurrent states from the KV cache This will be necessary to support Jamba (and other recurrent models mixed with Attention). Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states. * llama : use std::find for seq_nodes in llama_rs_cache * llama : state checkpoints for recurrent models * llama : correctly handle more edge cases for the rs cache * llama : rename many llama_kv_cache_* functions * llama : remove useless return value for some llama_cache_* functions * llama : rethink recurrent state cell counts * llama : begin work on support for variable GQA This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads. * llama : gracefully fail when not finding hybrid slot * llama : support Jamba * llama : fix BERT inference without KV cache * convert-hf : check for unprocessed Jamba experts * convert-hf : support Mini-Jamba conversion * llama : fix Jamba quantization sanity checks * llama : sequence-length-aware batch splitting * llama : use equal-sequence-length sub-batches for recurrent models * ggml : simplify SSM-related operators * llama : make recurrent state slot allocation contiguous * llama : adapt internal uses of batches to llama_ubatch * llama : fix batch split output count for embeddings * llama : minimize swaps when reordering logits This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models. * llama : fix edge case finding batch seq_id of split recurrent cell This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash. * llama : avoid copies for simple batch splits * ggml : make ggml_ssm_scan not modify its source tensors * llama : fix shared recurrent tail cell count for small ubatch sizes Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model. * llama : fix .base() compilation error on Windows * llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL * ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors The implementation already supported it, and this makes Mamba's conv step slightly faster. * mamba : fix non-contiguous usage of ggml_silu * llama : session saving and reloading for hybrid models * convert_hf : fix Jamba conversion * llama : fix mixed signedness comparison * llama : use unused n_embd_k_gqa in k_shift This also slightly reduces the diff from the master branch * llama : begin renaming llama_past back to llama_kv_cache * llama : remove implicit recurrent state rollbacks * llama : partially apply clang-format style * convert : fix jamba conv1d shape squeezing * graph : add back hybrid memory graph input But this time it contains the sub-cache graph inputs. This *should* make it easier to handle updating the inputs when caching the graph (eventually). * model : add Jamba to Mamba-specific hparams printing * jamba : remove redundant nullptr initializations * model : remove unnecessary prefix for tensor loading constants Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * model : use ggml_swiglu_split for Mamba Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * model : make falcon-h1 use shared mamba2 layer builder * memory : avoid referring to KV in recurrent cache logs * gguf-py : avoid adding duplicate tensor mappings for Jamba Some of the tensor names are common with Llama4 --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

qnixsynapse force-pushed the update-dev-from-master-2025-07-10-00-09 branch from cb9178f to 310794a Compare July 10, 2025 02:37

huydt84 and others added 29 commits July 10, 2025 08:08

add geglu activation function (ggml-org#14074)

bf10771

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>

graph : fix geglu (ggml-org#14077)

a603bc6

ggml-ci

sync : ggml

87cf9bd

ggml-ci

Vulkan: Don't default to CPU device (like llvmpipe), even if no other…

c01dc70

… device is available, to allow fallback to CPU backend (ggml-org#14099)

ggml : fix weak alias win32 (whisper/0)

0668a29

ggml-ci

sync : ggml

ebc78bb

ggml-ci

vulkan: force device 0 in CI (ggml-org#14106)

4d24034

llama : support GEGLU for jina-bert-v2 (ggml-org#14090)

76651f6

convert : fix duplicate key DeepSeek-R1 conversion error (ggml-org#14103

626da8b

)

opencl: add mul_mv_id_q4_0_f32_8x_flat (ggml-org#14003)

91b760a

kv-cache : add LLAMA_KV_CACHE_DEBUG environment variable (ggml-org#14121

401176a

)

kv-cache : relax SWA masking condition (ggml-org#14119)

534a1ae

ggml-ci

webui: Wrap long numbers instead of infinite horizontal scroll (ggml-…

4ed4062

…org#14062) * webui: Wrap long numbers instead of infinite horizontal scroll * Use tailwind class * update index.html.gz

tests : add test-tokenizers-repo (ggml-org#14017)

017fbb2

chore : clean up relative source dir paths (ggml-org#14128)

41873e5

kv-cache : fix split_equal handling in unified implementation (ggml-o…

c022829

…rg#14130) ggml-ci

batch : remove logits_all flag (ggml-org#14141)

99cfd5e

ggml-ci

context : simplify output counting logic during decode (ggml-org#14142)

ee4534c

* batch : remove logits_all flag ggml-ci * context : simplify output counting logic during decode ggml-ci * cont : fix comments

cmake : Improve build-info.cpp generation (ggml-org#14156)

ae02066

* cmake: Simplify build-info.cpp generation The rebuild of build-info.cpp still gets triggered when .git/index gets changes. * cmake: generate build-info.cpp in build dir

pooling : make cls_b and cls_out_b optional (ggml-org#14165)

2cdab0c

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>

CISC and others added 26 commits July 10, 2025 08:09

opencl: add GELU_ERF (ggml-org#14476)

77ffd1f

vulkan: Handle updated FA dim2/3 definition (ggml-org#14518)

39dceac

* vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1

vulkan: fix rms_norm+mul fusion (ggml-org#14545)

f854452

The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results.

vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (ggml-org#14485)

d014fbc

Commit taken from remyoudompheng's PR ggml-org#12260 Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>

CUDA: add bf16 and i32 to getrows (ggml-org#14529)

8d25eb1

llama : remove ggml_cont where possible (ggml-org#14568)

0104b6a

llama : fix incorrect minicpm3 v_states shape (ggml-org#14571)

0fb5b47

musa: fix build warnings (unused variable) (ggml-org#14561)

3d5a6e8

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

CUDA: add bilinear interpolation for upscale (ggml-org#14563)

2a755e0

cuda : fix rope with partial rotation and non-cont src (ggml-org#14580)

b5940d7

* cuda : fix rope non-cont ggml-ci * cont : fix multi-rope + add test ggml-ci * sycl : try fix ggml-ci * cont : fix sycl + clean-up cuda ggml-ci

vulkan: increase timeout for CI (ggml-org#14574)

8ac5f58

server: Add ability to mount server at prefix (ggml-org#14544)

839a7d9

* Add server_prefix * Correct server path env * Rename cli flag to --api-prefix * Change all to api_prefix

vulkan : fix rope with partial rotation and non-cont src (ggml-org#14582

a77eda3

)

memory : fix broken batch splits for recurrent cache (ggml-org#14575)

93a7c18

Splits producing more than one ubatch per batch for recurrent models were broken with ggml-org#14512. This fixes it by moving the completeness check after the ubatch split loop.

model : add SmolLM3 (ggml-org#14581)

4c8daec

* Init - first pass. * Model -> ModelBase. * fix errors in conversion. * Update the graph. * up. * up. * wip * cgraph ok * rm redundant code --------- Co-authored-by: Vaibhavs10 <vaibhavs10@gmail.com>

model : fix hunyuan moe chat template (ggml-org#14584)

b1016aa

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

convert : fix smollm3 jinja template (ggml-org#14586)

79e2a37

llama : remove unintended whitespace (ggml-org#14592)

b883d1b

model : add skt/A.X-4.0 model vocabulary (ggml-org#14589)

fa016ac

ggml : prevent integer overflow in gguf tensor size calculation (ggml…

62eb62c

…-org#14595)

llama : remove llm_graph_input_one (ggml-org#14603)

3762fb3

qnixsynapse force-pushed the update-dev-from-master-2025-07-10-00-09 branch from 310794a to 3762fb3 Compare July 10, 2025 02:39

Minh141120 approved these changes Jul 10, 2025

View reviewed changes

Minh141120 merged commit f3471ce into dev Jul 10, 2025
9 checks passed

Minh141120 deleted the update-dev-from-master-2025-07-10-00-09 branch July 10, 2025 03:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Sync master with upstream release b5857 #162

Sync master with upstream release b5857 #162

Uh oh!

jan-service-account commented Jul 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

54 participants

Uh oh!

Sync master with upstream release b5857 #162

Sync master with upstream release b5857 #162

Uh oh!

Conversation

jan-service-account commented Jul 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

54 participants