Skip to content
Discussion options

You must be logged in to vote

@xantrk

In your test case most of the time is spent computing on the CPU matrix-vector multiplications of MoE tensors left in RAM. These computations are 100% memory bandwidth bound, so there will be not much of a difference between llama.cpp and ik_llama.cpp (or between these two and the inference framework my grandmother put together over the weekend with the help of Claude Opus).

Prompt processing, which you didn't test, may be slightly faster with ik_llama.cpp, but the difference will not be very big because in that case processing time is dominated by the time it takes to copy MoE tensors left in RAM to the GPU. Unless you use very large batches, but you cannot really do that with yo…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@xantrk
Comment options

Answer selected by xantrk
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants