-
|
Hello all, LLAMA.cpp bench:
build: d5dfc3302 (8069)` ik_llama.cpp bench:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
In your test case most of the time is spent computing on the CPU matrix-vector multiplications of MoE tensors left in RAM. These computations are 100% memory bandwidth bound, so there will be not much of a difference between Prompt processing, which you didn't test, may be slightly faster with Sorry, but If you want to observe larger performance differences, pick a model that fits in your 32 GB RAM and run CPU-only with a long context. |
Beta Was this translation helpful? Give feedback.
@xantrk
In your test case most of the time is spent computing on the CPU matrix-vector multiplications of MoE tensors left in RAM. These computations are 100% memory bandwidth bound, so there will be not much of a difference between
llama.cppandik_llama.cpp(or between these two and the inference framework my grandmother put together over the weekend with the help of Claude Opus).Prompt processing, which you didn't test, may be slightly faster with
ik_llama.cpp, but the difference will not be very big because in that case processing time is dominated by the time it takes to copy MoE tensors left in RAM to the GPU. Unless you use very large batches, but you cannot really do that with yo…