-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch back to subprocessing for llama.cpp #3218
Conversation
b8a6cc0
to
d0fb79b
Compare
llm/server.go
Outdated
return fmt.Errorf("timed out waiting for llama runner to start") | ||
} | ||
if s.cmd.ProcessState != nil { | ||
return fmt.Errorf("llama runner process no longer running: %d", s.cmd.ProcessState.ExitCode()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error will likely be returned for many different "root" errors, which will cause many people to end up in issues not related to their actual problems.
You can see in a previous implementation that I redirected the stderr to a function that intercepts the log messages to track the actual error. If possible I'd like to do something similar again.
Line 430 in 5e7fd69
statusWriter := NewStatusWriter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll port this code over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current setup, this is going to have somewhat limited value. With the current structure, the common scenario is some failure in the GPU runner, and the retry logic means we'll switch to the CPU runner(s) and the captured last error will only be reported in the log. If there's a problem with the model, then we'd iterate through, all the runners would fail, and we would bubble that back up through the API.
In the future it may be interesting to explore warnings bubbling up in the API so that we could capture this fallback scenario and report why we couldn't load on the GPU but are running on CPU.
0017bd1
to
87818f4
Compare
1569569
to
b77d6ba
Compare
Not fixed for me. Before updating, ollama didn't use any (significant, at least) memory on startup. Now, the instance mapped to my 1080 Ti (11 GiB) is using 136 MiB and the instances mapped to my 1070 Ti's (8 GiB) are using 100 MiB each. This is before loading any models. Not too cool |
@oldmanjk this PR switches the GPU specific code over to a subprocess, which will unload after expiration of the keep alive timeout as long as requests are not being sent. If you start the server and don't make any requests, no GPU resources should be used. Once you send a request, then we'll start using GPU resources. If that's not the behavior you're seeing, can you explain your test setup a bit more where you see it holding VRAM indefinitely in an idle setup? |
That's not the behavior I'm seeing on two separate machines. What specifically would you like to know about them? |
Ah, this is a unintended side effect after I rebased on top of #2279 which was recently merged. Thanks for catching this @oldmanjk! It looks like our new cudart based GPU discovery isn't unloading when idle. The brunt of the VRAM is cleared with the subprocess, but the main process still holds ~30M At Startup:
After loading a small model
After it unloads the subprocess once idle
I'll figure out how to get it unloaded between queries and update the PR. |
GPU memory leak fixed and confirmed on linux and windows. |
The amount of memory seems to vary based on GPU VRAM capacity. On my 4090 (24 GiB), ollama uses 384 MiB on launch. My 1080 Ti (11 GiB) uses 136 MiB and my 1070 Ti's (8 GiB) use 100 MiB each. I can't check the 4090 ATM, but after loading and unloading one model, the other three cards are at 224, 188, and 188 MiB. I started writing that before you posted. I'll leave it for posterity |
65f6cd7
to
8d47b8b
Compare
This should resolve a number of memory leak and stability defects by allowing us to isolate llama.cpp in a separate process and shutdown when idle, and gracefully restart if it has problems. This also serves as a first step to be able to run multiple copies to support multiple models concurrently.
Cleaner shutdown logic, a bit of response hardening
"cudart init failure: 35" isn't particularly helpful in the logs.
We may have users that run into problems with our current payload model, so this gives us an escape valve.
Leaving the cudart library loaded kept ~30m of memory pinned in the GPU in the main process. This change ensures we don't hold GPU resources when idle.
This should resolve a number of memory leak and stability defects by allowing us to isolate llama.cpp in a separate process and shutdown when idle, and gracefully restart if it has problems. This also serves as a first step to be able to run multiple copies to support multiple models concurrently.
Tested on Windows, Linux, Mac. Nvidia/AMD, and simulated a number of different failure modes to ensure it detected the runner not responding and restarted it on next request.
Fixes #1691
Fixes #1848
Fixes #1871
Fixes #2767