fix(llama-cpp): populate tensor_buft_override buffer so llama-cpp properly performs fit calculations#8560
Conversation
✅ Deploy Preview for localai ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
As an extra note, I noticed by default this auto-fit logic from llama-cpp is bypassed because by default we pass a large value for gpu_layers (not sure if extra logic is involved, but that's what I saw for my model config). I was able to enable auto-fit by setting Curious if I should update the markdowns to advertise enabling auto-fit by setting |
I'm still weighing the best approach here due to a subtle trade-off. LocalAI already unloads models based on a VRAM usage threshold, and enabling auto-fitting by default in llama.cpp would conflict with this mechanism. This requires either custom logic or a runtime toggle to fix correctly. In the short term, probably documenting the limitation is the best path forward. |
|
Opened #8562 for discussion |
Description
While testing loading some large models on my PC (Qwen3-Coder-Next) I noticed that llama-cpp's CLI would load it and offload what it needed to system RAM, while LocalAI errored saying I didn't have enough VRAM.
Looking at the logs, I saw that llama-cpp was trying to perform fit calculations to load some tensors into VRAM and the rest into system RAM, but was erroring out:

Looking through llama-cpp's codebase, it looks like they expect the tensor_buft_overrides parameter to be populated with a buffer even if it isn't used.
This PR adds logic that mimics what llama-cpp does during arg parsing to fill this with an arbitrary buffer to avoid this bailout later.
Notes for Reviewers
Tested by building locally for cuda13 and validating that the 54GB model could load despite only 24GB VRAM available on a combination of VRAM/System RAM:

Signed commits