Merge from upstream by pi6am · Pull Request #8 · pi6am/koboldcpp

pi6am · 2025-03-25T03:29:17Z

No description provided.

This reverts commit d4791c5. (+2 squashed commit) Squashed commit: [d4791c5] try use docker to prepare for upcoming deprecation [1e12097] updated lite

* repo : update links to new url ggml-ci * cont : more urls ggml-ci

# Conflicts: # .devops/llama-cpp-cuda.srpm.spec # .devops/llama-cpp.srpm.spec # .devops/nix/package.nix # .devops/rocm.Dockerfile # .github/ISSUE_TEMPLATE/020-enhancement.yml # .github/ISSUE_TEMPLATE/030-research.yml # .github/ISSUE_TEMPLATE/040-refactor.yml # .github/ISSUE_TEMPLATE/config.yml # .github/pull_request_template.md # .github/workflows/bench.yml.disabled # .github/workflows/build.yml # .github/workflows/labeler.yml # CONTRIBUTING.md # Makefile # README.md # SECURITY.md # ci/README.md # common/CMakeLists.txt # docs/android.md # docs/backend/SYCL.md # docs/build.md # docs/cuda-fedora.md # docs/development/HOWTO-add-model.md # docs/docker.md # docs/install.md # docs/llguidance.md # examples/cvector-generator/README.md # examples/imatrix/README.md # examples/imatrix/imatrix.cpp # examples/llama.android/llama/src/main/cpp/CMakeLists.txt # examples/llama.swiftui/README.md # examples/llama.vim # examples/lookahead/README.md # examples/lookup/README.md # examples/main/README.md # examples/passkey/README.md # examples/pydantic_models_to_grammar_examples.py # examples/retrieval/README.md # examples/server/CMakeLists.txt # examples/server/README.md # examples/simple-cmake-pkg/README.md # examples/speculative/README.md # flake.nix # grammars/README.md # pyproject.toml # scripts/check-requirements.sh

* readme : add notice about new package registry * cont : fix whitespace

* simple typo fixed * Update examples/imatrix/README.md --------- Co-authored-by: Tobias Bergmann <tobias.bergmann@gmx.de> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

# Conflicts: # README.md # examples/imatrix/README.md # scripts/compare-llama-bench.py

…Intel Macs. (ggml-org#11904)

…1902)

* docker : attempt fixing arm64 build on ci * qemu v7.0.0-28

This patch fixes a typo in command help. prefx -> prefix Signed-off-by: Masanari Iida <standby24x7@gmail.com>

verbose outputs (+3 squashed commit) Squashed commit: [7bbbfc10] fixed a retry history bug [824b9bf7] another autoguess fix

* vulkan: support memset_tensor * vulkan: support GGML_OP_SUM * vulkan: implement GGML_OP_ARGMAX * vulkan: implement GGML_OP_SUB * vulkan: implement GGML_OP_COUNT_EQUAL * vulkan: implement GGML_OP_OPT_STEP_ADAMW * vulkan: fix check_results RWKV_WKV6 crash and memory leaks * vulkan: implement GGML_OP_REPEAT_BACK * tests: remove invalid test-backend-ops REPEAT_BACK tests * vulkan: fix COUNT_EQUAL memset using a fillBuffer command

* CUDA: use async data loading for FlashAttention --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

This commit fixes an issue in the llama.cpp project where the command for testing the llama-server object contained a duplicated file extension. The original command was: ./tests.sh unit/test_chat_completion.py.py -v -x It has been corrected to: ./tests.sh unit/test_chat_completion.py -v -x This change ensures that the test script correctly locates and executes the intended test file, preventing test failures due to an incorrect file name.

… (ggml-org#11907) Signed-off-by: MoonRide303 <moonride303@gmail.com>

… for gemma3

…rovides device names and nothing else (+1 squashed commits) Squashed commits: [4a73c8d3] replaced winclinfo.exe with a simplified simpleclinfo.exe that only provides device names and nothing else

# Conflicts: # src/llama-model.cpp

* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup * common : use new API to enable warmup mode during model warmup --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

) * add system_prompt_file * add -sysf / --system-prompt-file * remove system_prompt_file

…rg#12370) We default to 4, sometimes we want to manually adjust this Signed-off-by: Eric Curtin <ecurtin@redhat.com>

# Conflicts: # examples/run/run.cpp # ggml/src/ggml-cann/aclnn_ops.cpp

…gml-org#12399) * sycl : support non-contiguous tensors in binary ops * sycl : silence unused variable warning --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* added -o option to specify an output file name * llama-tts returns ENOENT in case of file write error note : PR ggml-org#12042 is closed as superseded with this one.

# Conflicts: # examples/tts/tts.cpp

) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (LostRuins#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (LostRuins#17) * meta : formatting, naming, indentation (LostRuins#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

LostRuins and others added 30 commits February 15, 2025 22:11

Revert "try use docker to prepare for upcoming deprecation"

5b9fc4b

This reverts commit d4791c5. (+2 squashed commit) Squashed commit: [d4791c5] try use docker to prepare for upcoming deprecation [1e12097] updated lite

repo : update links to new url (ggml-org#11886)

68ff663

* repo : update links to new url ggml-ci * cont : more urls ggml-ci

trying new ubuntu for ci

2ca1369

correction

673e33c

fixed lite

fd211db

readme : add notice about new package registry (ggml-org#11890)

c2cd24f

* readme : add notice about new package registry * cont : fix whitespace

metal : optimize dequant q6_K kernel (ggml-org#11892)

2288510

examples: fix typo in imatrix/README.md (ggml-org#11884)

fc10c38

* simple typo fixed * Update examples/imatrix/README.md --------- Co-authored-by: Tobias Bergmann <tobias.bergmann@gmx.de> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

scripts: fix compare-llama-bench commit hash logic (ggml-org#11891)

6dde178

CUDA: use async data loading for FlashAttention

eb4f795

try CI fix

727db80

horde advertised max ctx

299d6ce

add short delay before launching browser

5a79dd5

Merge branch 'upstream' into concedo_experimental

e0bdb2f

# Conflicts: # README.md # examples/imatrix/README.md # scripts/compare-llama-bench.py

metal : fix the crash caused by the lack of residency set support on …

c2ea16f

…Intel Macs. (ggml-org#11904)

vulkan: support multi/vision rope, and noncontiguous rope (ggml-org#1…

bf42a23

…1902)

ci : fix (again) arm64 build fails (ggml-org#11895)

818a340

* docker : attempt fixing arm64 build on ci * qemu v7.0.0-28

common : Fix a typo in help (ggml-org#11899)

fe163d5

This patch fixes a typo in command help. prefx -> prefix Signed-off-by: Masanari Iida <standby24x7@gmail.com>

safer autoguess fix

5838015

verbose outputs (+3 squashed commit) Squashed commit: [7bbbfc10] fixed a retry history bug [824b9bf7] another autoguess fix

better error handling for downloads

15ae98c

allow kcppt for config switching

6fa50f7

server : bump httplib to 0.19.0 (ggml-org#11908)

0f2bbe6

Merge remote-tracking branch 'jg/cuda-fa-mma-17' into debug4

a670442

server : fix divide-by-zero in metrics reporting (ggml-org#11915)

c4d29ba

update release requirements (ggml-org#11897)

f7b1116

CUDA: use async data loading for FlashAttention (ggml-org#11894)

73e2ed3

* CUDA: use async data loading for FlashAttention --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

scripts: corrected encoding when getting chat template (ggml-org#11866)…

5137da7

… (ggml-org#11907) Signed-off-by: MoonRide303 <moonride303@gmail.com>

LostRuins and others added 28 commits March 14, 2025 17:47

gemma3 template, updated lite, fixed tool calling, reenable ctx shift…

6a1dd57

… for gemma3

replaced winclinfo.exe with a simplified simpleclinfo.exe that only p…

782e1e1

…rovides device names and nothing else (+1 squashed commits) Squashed commits: [4a73c8d3] replaced winclinfo.exe with a simplified simpleclinfo.exe that only provides device names and nothing else

server: fix "--grammar-file" parameter (ggml-org#12285)

add2a3a

Merge branch 'upstream' into concedo_experimental

be3bba6

# Conflicts: # src/llama-model.cpp

rename replace_instruct_placeholders field

30cb77a

added model switching to gguf in admin mode (auto guess layers)

d7498e7

edit readme

4a29e21

main : add -sysf / --system-prompt-file (ggml-org#12249) (ggml-org#12250

774973b

) * add system_prompt_file * add -sysf / --system-prompt-file * remove system_prompt_file

Add CLI arg to llama-run to adjust the number of threads used (ggml-o…

9f2250b

…rg#12370) We default to 4, sometimes we want to manually adjust this Signed-off-by: Eric Curtin <ecurtin@redhat.com>

[CANN]MUL_MAT optimization (ggml-org#12382)

92a3913

wip on multiple fixes

4212f0b

verbosity

7272165

fixed a clip processing bug

bfc3006

add config for default gen tokens and bos toggle

e84596e

Merge branch 'upstream' into concedo_experimental

67851e5

# Conflicts: # examples/run/run.cpp # ggml/src/ggml-cann/aclnn_ops.cpp

SYCL : support non-contiguous tensors in binary ops (add, sub, etc) (g…

b19bd06

…gml-org#12399) * sycl : support non-contiguous tensors in binary ops * sycl : silence unused variable warning --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

SYCL: Delete redundant plus sign and space (ggml-org#12391)

3d35d87

more rocm include dir

98eade3

llama-tts : add '-o' option (ggml-org#12398)

f4c3dd5

* added -o option to specify an output file name * llama-tts returns ENOENT in case of file write error note : PR ggml-org#12042 is closed as superseded with this one.

revert unwanted change to tool calling

9f7fd63

improvement to tool calling, allowing specific tools to be used

2401502

Merge branch 'upstream' into concedo_experimental

5d7c5e9

# Conflicts: # examples/tts/tts.cpp

improve model estimation

0954e9e

fix for sd

5ef1722

revert clean

8708403

updated sd metadata

e466ce6

allow quantkv with contextshift

6888f54

pi6am merged commit 223ab64 into pi6am:concedo Mar 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge from upstream#8

Merge from upstream#8
pi6am merged 1761 commits intopi6am:concedofrom
LostRuins:concedo

pi6am commented Mar 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

pi6am commented Mar 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants