Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch back to subprocessing for llama.cpp #3218

Merged
merged 7 commits into from
Apr 2, 2024
Merged

Conversation

dhiltgen
Copy link
Collaborator

@dhiltgen dhiltgen commented Mar 18, 2024

This should resolve a number of memory leak and stability defects by allowing us to isolate llama.cpp in a separate process and shutdown when idle, and gracefully restart if it has problems. This also serves as a first step to be able to run multiple copies to support multiple models concurrently.

Tested on Windows, Linux, Mac. Nvidia/AMD, and simulated a number of different failure modes to ensure it detected the runner not responding and restarted it on next request.

Fixes #1691
Fixes #1848
Fixes #1871
Fixes #2767

llm/server.go Show resolved Hide resolved
llm/server.go Outdated
return fmt.Errorf("timed out waiting for llama runner to start")
}
if s.cmd.ProcessState != nil {
return fmt.Errorf("llama runner process no longer running: %d", s.cmd.ProcessState.ExitCode())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error will likely be returned for many different "root" errors, which will cause many people to end up in issues not related to their actual problems.

You can see in a previous implementation that I redirected the stderr to a function that intercepts the log messages to track the actual error. If possible I'd like to do something similar again.

statusWriter := NewStatusWriter()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll port this code over.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current setup, this is going to have somewhat limited value. With the current structure, the common scenario is some failure in the GPU runner, and the retry logic means we'll switch to the CPU runner(s) and the captured last error will only be reported in the log. If there's a problem with the model, then we'd iterate through, all the runners would fail, and we would bubble that back up through the API.

In the future it may be interesting to explore warnings bubbling up in the API so that we could capture this fallback scenario and report why we couldn't load on the GPU but are running on CPU.

@dhiltgen dhiltgen force-pushed the subprocess branch 4 times, most recently from 0017bd1 to 87818f4 Compare March 27, 2024 22:57
@dhiltgen dhiltgen force-pushed the subprocess branch 2 times, most recently from 1569569 to b77d6ba Compare March 28, 2024 21:26
@oldmanjk
Copy link

oldmanjk commented Mar 30, 2024

Not fixed for me. Before updating, ollama didn't use any (significant, at least) memory on startup. Now, the instance mapped to my 1080 Ti (11 GiB) is using 136 MiB and the instances mapped to my 1070 Ti's (8 GiB) are using 100 MiB each. This is before loading any models. Not too cool

@dhiltgen
Copy link
Collaborator Author

@oldmanjk this PR switches the GPU specific code over to a subprocess, which will unload after expiration of the keep alive timeout as long as requests are not being sent. If you start the server and don't make any requests, no GPU resources should be used. Once you send a request, then we'll start using GPU resources.

If that's not the behavior you're seeing, can you explain your test setup a bit more where you see it holding VRAM indefinitely in an idle setup?

@oldmanjk
Copy link

@oldmanjk this PR switches the GPU specific code over to a subprocess, which will unload after expiration of the keep alive timeout as long as requests are not being sent. If you start the server and don't make any requests, no GPU resources should be used. Once you send a request, then we'll start using GPU resources.

If that's not the behavior you're seeing, can you explain your test setup a bit more where you see it holding VRAM indefinitely in an idle setup?

That's not the behavior I'm seeing on two separate machines. What specifically would you like to know about them?

@dhiltgen
Copy link
Collaborator Author

Ah, this is a unintended side effect after I rebased on top of #2279 which was recently merged. Thanks for catching this @oldmanjk!

It looks like our new cudart based GPU discovery isn't unloading when idle. The brunt of the VRAM is cleared with the subprocess, but the main process still holds ~30M

At Startup:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    198339      C   ./ollama-linux-amd64               29MiB |
+-----------------------------------------------------------------------------+

After loading a small model

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    198339      C   ./ollama-linux-amd64               29MiB |
|    0   N/A  N/A    198507      C   ...a_v11/ollama_llama_server     2783MiB |
+-----------------------------------------------------------------------------+

After it unloads the subprocess once idle

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    198339      C   ./ollama-linux-amd64               29MiB |
+-----------------------------------------------------------------------------+

I'll figure out how to get it unloaded between queries and update the PR.

@dhiltgen
Copy link
Collaborator Author

GPU memory leak fixed and confirmed on linux and windows.

@oldmanjk
Copy link

oldmanjk commented Mar 31, 2024

The amount of memory seems to vary based on GPU VRAM capacity. On my 4090 (24 GiB), ollama uses 384 MiB on launch. My 1080 Ti (11 GiB) uses 136 MiB and my 1070 Ti's (8 GiB) use 100 MiB each. I can't check the 4090 ATM, but after loading and unloading one model, the other three cards are at 224, 188, and 188 MiB.

I started writing that before you posted. I'll leave it for posterity

@dhiltgen dhiltgen force-pushed the subprocess branch 3 times, most recently from 65f6cd7 to 8d47b8b Compare April 1, 2024 22:30
This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems.  This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.
Cleaner shutdown logic, a bit of response hardening
"cudart init failure: 35" isn't particularly helpful in the logs.
We may have users that run into problems with our current
payload model, so this gives us an escape valve.
Leaving the cudart library loaded kept ~30m of memory
pinned in the GPU in the main process.  This change ensures
we don't hold GPU resources when idle.
@dhiltgen dhiltgen merged commit c863c6a into ollama:main Apr 2, 2024
15 checks passed
@dhiltgen dhiltgen deleted the subprocess branch April 2, 2024 17:49
@jmorganca jmorganca mentioned this pull request Apr 7, 2024
3 tasks
@volfyd volfyd mentioned this pull request May 13, 2024
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants