Skip to content

fix(llama-cpp): populate tensor_buft_override buffer so llama-cpp properly performs fit calculations#8560

Merged
mudler merged 1 commit intomudler:masterfrom
cvpcs:llama-cpp-fit
Feb 14, 2026
Merged

fix(llama-cpp): populate tensor_buft_override buffer so llama-cpp properly performs fit calculations#8560
mudler merged 1 commit intomudler:masterfrom
cvpcs:llama-cpp-fit

Conversation

@cvpcs
Copy link
Contributor

@cvpcs cvpcs commented Feb 14, 2026

Description

While testing loading some large models on my PC (Qwen3-Coder-Next) I noticed that llama-cpp's CLI would load it and offload what it needed to system RAM, while LocalAI errored saying I didn't have enough VRAM.

Looking at the logs, I saw that llama-cpp was trying to perform fit calculations to load some tensors into VRAM and the rest into system RAM, but was erroring out:
image

Looking through llama-cpp's codebase, it looks like they expect the tensor_buft_overrides parameter to be populated with a buffer even if it isn't used.

This PR adds logic that mimics what llama-cpp does during arg parsing to fill this with an arbitrary buffer to avoid this bailout later.

Notes for Reviewers

Tested by building locally for cuda13 and validating that the 54GB model could load despite only 24GB VRAM available on a combination of VRAM/System RAM:
image

Signed commits

  • Yes, I signed my commits.

@netlify
Copy link

netlify bot commented Feb 14, 2026

Deploy Preview for localai ready!

Name Link
🔨 Latest commit 6590a54
🔍 Latest deploy log https://app.netlify.com/projects/localai/deploys/699020d8f725580008387ab0
😎 Deploy Preview https://deploy-preview-8560--localai.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@cvpcs
Copy link
Contributor Author

cvpcs commented Feb 14, 2026

As an extra note, I noticed by default this auto-fit logic from llama-cpp is bypassed because by default we pass a large value for gpu_layers (not sure if extra logic is involved, but that's what I saw for my model config).

I was able to enable auto-fit by setting gpu_layers = -1 in my model config which is associated to llama-cpp's auto-fit flag value, at which point I saw the tensor_buft_override buffer error which this PR addresses.

Curious if I should update the markdowns to advertise enabling auto-fit by setting gpu_layers=-1 for anyone interested in leveraging it, or if we should make changes to the default logic. @mudler thoughts?

Copy link
Owner

@mudler mudler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch, good catch - thank you @cvpcs !

@mudler
Copy link
Owner

mudler commented Feb 14, 2026

As an extra note, I noticed by default this auto-fit logic from llama-cpp is bypassed because by default we pass a large value for gpu_layers (not sure if extra logic is involved, but that's what I saw for my model config).

I was able to enable auto-fit by setting gpu_layers = -1 in my model config which is associated to llama-cpp's auto-fit flag value, at which point I saw the tensor_buft_override buffer error which this PR addresses.

Curious if I should update the markdowns to advertise enabling auto-fit by setting gpu_layers=-1 for anyone interested in leveraging it, or if we should make changes to the default logic. @mudler thoughts?

I'm still weighing the best approach here due to a subtle trade-off. LocalAI already unloads models based on a VRAM usage threshold, and enabling auto-fitting by default in llama.cpp would conflict with this mechanism. This requires either custom logic or a runtime toggle to fix correctly. In the short term, probably documenting the limitation is the best path forward.

@mudler mudler merged commit 42cb7bd into mudler:master Feb 14, 2026
45 of 46 checks passed
@mudler
Copy link
Owner

mudler commented Feb 14, 2026

Opened #8562 for discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants