Skip to content

Conversation

@jan-service-account
Copy link

Updates dev branch with latest release (b5191) from ggml-org/llama.cpp

rgerganov and others added 5 commits April 25, 2025 10:08
…org#12943)

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
* clip : fix pixtral on some GPU backends

* refactor inp_raw set

* rm outdated comment

* fix dynamic size

* add TODO
* Force FP32 compute in cuBLAS GEMM

* Revert "Force FP32 compute in cuBLAS GEMM"

This reverts commit 6efd872.

* Force F32 compute in GLM4 ffn down

* Edit comment to clarify issue

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
@jan-service-account jan-service-account merged commit 5fa0519 into dev Apr 26, 2025
9 checks passed
@jan-service-account jan-service-account deleted the update-dev-from-master-2025-04-26-00-08 branch April 26, 2025 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants