Switch back to subprocessing for llama.cpp #3218

dhiltgen · 2024-03-18T09:28:02Z

This should resolve a number of memory leak and stability defects by allowing us to isolate llama.cpp in a separate process and shutdown when idle, and gracefully restart if it has problems. This also serves as a first step to be able to run multiple copies to support multiple models concurrently.

Tested on Windows, Linux, Mac. Nvidia/AMD, and simulated a number of different failure modes to ensure it detected the runner not responding and restarted it on next request.

Fixes #1691
Fixes #1848
Fixes #1871
Fixes #2767

llm/server.go

BruceMacD · 2024-03-20T17:04:55Z

llm/server.go

+				return fmt.Errorf("timed out waiting for llama runner to start")
+			}
+			if s.cmd.ProcessState != nil {
+				return fmt.Errorf("llama runner process no longer running: %d", s.cmd.ProcessState.ExitCode())


This error will likely be returned for many different "root" errors, which will cause many people to end up in issues not related to their actual problems.

You can see in a previous implementation that I redirected the stderr to a function that intercepts the log messages to track the actual error. If possible I'd like to do something similar again.

ollama/llm/llama.go

Line 430 in 5e7fd69

statusWriter := NewStatusWriter()

Thanks! I'll port this code over.

With the current setup, this is going to have somewhat limited value. With the current structure, the common scenario is some failure in the GPU runner, and the retry logic means we'll switch to the CPU runner(s) and the captured last error will only be reported in the log. If there's a problem with the model, then we'd iterate through, all the runners would fail, and we would bubble that back up through the API.

In the future it may be interesting to explore warnings bubbling up in the API so that we could capture this fallback scenario and report why we couldn't load on the GPU but are running on CPU.

oldmanjk · 2024-03-30T03:51:02Z

Not fixed for me. Before updating, ollama didn't use any (significant, at least) memory on startup. Now, the instance mapped to my 1080 Ti (11 GiB) is using 136 MiB and the instances mapped to my 1070 Ti's (8 GiB) are using 100 MiB each. This is before loading any models. Not too cool

dhiltgen · 2024-03-30T17:10:55Z

@oldmanjk this PR switches the GPU specific code over to a subprocess, which will unload after expiration of the keep alive timeout as long as requests are not being sent. If you start the server and don't make any requests, no GPU resources should be used. Once you send a request, then we'll start using GPU resources.

If that's not the behavior you're seeing, can you explain your test setup a bit more where you see it holding VRAM indefinitely in an idle setup?

oldmanjk · 2024-03-30T17:57:37Z

@oldmanjk this PR switches the GPU specific code over to a subprocess, which will unload after expiration of the keep alive timeout as long as requests are not being sent. If you start the server and don't make any requests, no GPU resources should be used. Once you send a request, then we'll start using GPU resources.

If that's not the behavior you're seeing, can you explain your test setup a bit more where you see it holding VRAM indefinitely in an idle setup?

That's not the behavior I'm seeing on two separate machines. What specifically would you like to know about them?

dhiltgen · 2024-03-30T22:03:50Z

Ah, this is a unintended side effect after I rebased on top of #2279 which was recently merged. Thanks for catching this @oldmanjk!

It looks like our new cudart based GPU discovery isn't unloading when idle. The brunt of the VRAM is cleared with the subprocess, but the main process still holds ~30M

At Startup:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    198339      C   ./ollama-linux-amd64               29MiB |
+-----------------------------------------------------------------------------+

After loading a small model

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    198339      C   ./ollama-linux-amd64               29MiB |
|    0   N/A  N/A    198507      C   ...a_v11/ollama_llama_server     2783MiB |
+-----------------------------------------------------------------------------+

After it unloads the subprocess once idle

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    198339      C   ./ollama-linux-amd64               29MiB |
+-----------------------------------------------------------------------------+

I'll figure out how to get it unloaded between queries and update the PR.

dhiltgen · 2024-03-30T22:55:10Z

GPU memory leak fixed and confirmed on linux and windows.

oldmanjk · 2024-03-31T02:07:18Z

The amount of memory seems to vary based on GPU VRAM capacity. On my 4090 (24 GiB), ollama uses 384 MiB on launch. My 1080 Ti (11 GiB) uses 136 MiB and my 1070 Ti's (8 GiB) use 100 MiB each. I can't check the 4090 ATM, but after loading and unloading one model, the other three cards are at 224, 188, and 188 MiB.

I started writing that before you posted. I'll leave it for posterity

This should resolve a number of memory leak and stability defects by allowing us to isolate llama.cpp in a separate process and shutdown when idle, and gracefully restart if it has problems. This also serves as a first step to be able to run multiple copies to support multiple models concurrently.

Cleaner shutdown logic, a bit of response hardening

"cudart init failure: 35" isn't particularly helpful in the logs.

We may have users that run into problems with our current payload model, so this gives us an escape valve.

Leaving the cudart library loaded kept ~30m of memory pinned in the GPU in the main process. This change ensures we don't hold GPU resources when idle.

dhiltgen force-pushed the subprocess branch 16 times, most recently from b8a6cc0 to d0fb79b Compare March 20, 2024 15:35

dhiltgen marked this pull request as ready for review March 20, 2024 15:37

dhiltgen force-pushed the subprocess branch from d0fb79b to e7ec0c4 Compare March 20, 2024 15:41

This was referenced Mar 20, 2024

CUDA Error when changing models #3267

Closed

Detect and bubble up terminal errors in server main loop #3103

Closed

dhiltgen force-pushed the subprocess branch from e7ec0c4 to 091eeba Compare March 20, 2024 16:08

BruceMacD reviewed Mar 20, 2024

View reviewed changes

dhiltgen mentioned this pull request Mar 20, 2024

Ollama stuck after few runs #1863

Open

dhiltgen force-pushed the subprocess branch from 091eeba to 71afe8b Compare March 21, 2024 09:59

dhiltgen mentioned this pull request Mar 21, 2024

[Win11] mistral 7B performance down between 0.1.28 and 0.1.27 #2960

Closed

dhiltgen force-pushed the subprocess branch 4 times, most recently from 0017bd1 to 87818f4 Compare March 27, 2024 22:57

BruceMacD approved these changes Mar 28, 2024

View reviewed changes

dhiltgen force-pushed the subprocess branch 2 times, most recently from 1569569 to b77d6ba Compare March 28, 2024 21:26

dhiltgen force-pushed the subprocess branch from b77d6ba to ffe8c33 Compare March 30, 2024 21:42

dhiltgen force-pushed the subprocess branch 3 times, most recently from 65f6cd7 to 8d47b8b Compare April 1, 2024 22:30

dhiltgen added 7 commits April 1, 2024 16:48

Apply 01-cache.diff

0a0e9f3

Integration test improvements

4fec581

Cleaner shutdown logic, a bit of response hardening

Detect too-old cuda driver

10ed1b6

"cudart init failure: 35" isn't particularly helpful in the logs.

Safeguard for noexec

0a74cb3

We may have users that run into problems with our current payload model, so this gives us an escape valve.

Release gpu discovery library after use

526d4eb

Leaving the cudart library loaded kept ~30m of memory pinned in the GPU in the main process. This change ensures we don't hold GPU resources when idle.

Refined min memory from testing

1f11b52

dhiltgen force-pushed the subprocess branch from 8d47b8b to 1f11b52 Compare April 1, 2024 23:48

jmorganca approved these changes Apr 2, 2024

View reviewed changes

dhiltgen merged commit c863c6a into ollama:main Apr 2, 2024
15 checks passed

dhiltgen deleted the subprocess branch April 2, 2024 17:49

jmorganca mentioned this pull request Apr 7, 2024

Run inference in a subprocess #2910

Closed

3 tasks

volfyd mentioned this pull request May 13, 2024

ollama 0.1.31 -> 0.1.33 NixOS/nixpkgs#309330

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch back to subprocessing for llama.cpp #3218

Switch back to subprocessing for llama.cpp #3218

dhiltgen commented Mar 18, 2024 •

edited

BruceMacD Mar 20, 2024

dhiltgen Mar 21, 2024

dhiltgen Mar 21, 2024

oldmanjk commented Mar 30, 2024 •

edited

dhiltgen commented Mar 30, 2024

oldmanjk commented Mar 30, 2024

dhiltgen commented Mar 30, 2024

dhiltgen commented Mar 30, 2024

oldmanjk commented Mar 31, 2024 •

edited

Switch back to subprocessing for llama.cpp #3218

Switch back to subprocessing for llama.cpp #3218

Conversation

dhiltgen commented Mar 18, 2024 • edited

BruceMacD Mar 20, 2024

Choose a reason for hiding this comment

dhiltgen Mar 21, 2024

Choose a reason for hiding this comment

dhiltgen Mar 21, 2024

Choose a reason for hiding this comment

oldmanjk commented Mar 30, 2024 • edited

dhiltgen commented Mar 30, 2024

oldmanjk commented Mar 30, 2024

dhiltgen commented Mar 30, 2024

dhiltgen commented Mar 30, 2024

oldmanjk commented Mar 31, 2024 • edited

dhiltgen commented Mar 18, 2024 •

edited

oldmanjk commented Mar 30, 2024 •

edited

oldmanjk commented Mar 31, 2024 •

edited