Enhanced GPU discovery and multi-gpu support with concurrency #4517

dhiltgen · 2024-05-18T23:07:08Z

Carries (and obsoletes if we move this one forward first) #4266 and #4441

This refines our GPU discovery to split it into bootstrapping where we discover information about the GPUs once at startup, and then incrementally refresh just free space information, instead of fully rediscovering the GPUs over and over.

Fixes #3158
Fixes #4198
Fixes #3765

jmorganca · 2024-06-02T04:14:49Z

gpu/gpu.go

+
+	switch runtime.GOOS {
+	case "windows":
+		oneapiMgmtName = "ze_intel_gpu64.dll"


This DLL gets installed on Windows with Intel iGPUs as part of the OS base install and doesn't always open reliably – it seems to be causing some crashes on both Win10 and Win11 and so we may want to put this behind a flag until we resolve those issues

What I'm thinking is I'll add a temporary check to see if we have a oneapi runner available, and if not, disable gpu discovery for the oneapi library that way it can still be built from source and theoretically work but be truly a no-op for the official builds until we can test it more fully.

Never mind - this would lead to circular dependencies since the llm package with the payloads depends on gpu.

I'm pretty sure I fixed the bug that lead to the crash on oneapi initialization, so I think we'll be Ok leaving this in place.

gpu/types.go

jmorganca · 2024-06-02T04:24:45Z

llm/server.go

@@ -232,6 +228,10 @@ func NewLlamaServer(gpus gpu.GpuInfoList, model string, ggml *GGML, adapters, pr

 	params = append(params, "--parallel", fmt.Sprintf("%d", numParallel))

+	if estimate.TensorSplit != "" {
+		params = append(params, "--tensor-split", estimate.TensorSplit)


This is super cool! Can't wait to try it more on 2x, 4x and 8x gpu systems

llm/ext_server/server.cpp

jmorganca

Overall looks great! Small comment RE some oneapi dll open panics we are seeing on Windows boxes with iGPUs - we'd want to avoid making that part of the critical path until we resolve this

server/sched.go

llm/server.go

jmorganca · 2024-06-14T21:21:54Z

llm/memory_test.go

+	"github.com/ollama/ollama/api"
+	"github.com/ollama/ollama/envconfig"
+	"github.com/ollama/ollama/gpu"
+	"github.com/stretchr/testify/assert"


Not critical for this PR but if only simple checks, it would be awesome to use t.Fatal as the rest of the codebase sticks as close to stdlib as possible

I'll look at this in a follow up

gpu/gpu_info_oneapi.c

This reverts commit 476fb8e.

The amdgpu drivers free VRAM reporting omits some other apps, so leverage the upstream DRM driver which keeps better tabs on things

Now that we call the GPU discovery routines many times to update memory, this splits initial discovery from free memory updating.

This worked remotely but wound up trying to spawn multiple servers locally which doesn't work

Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block

Our default behavior today is to try to fit into a single GPU if possible. Some users would prefer the old behavior of always spreading across multiple GPUs even if the model can fit into one. This exposes that tunable behavior.

adjust timing on some tests so they don't timeout on small/slow GPUs

This library will give us the most reliable free VRAM reporting on windows to enable concurrent model scheduling.

While models are loading, the VRAM metrics are dynamic, so try to load on a GPU that doesn't have a model actively loading, or wait to avoid races that lead to OOMs

dmatora · 2024-09-09T18:35:12Z

This works great when dealing with standard context size - when I load llama 3.1:70b it detects all 4 gpus, all 81 layers are offloaded and everything works blazing fast
But when I try to set context size to 128k

ollama show --modelfile llama3.1:70b > Modelfile3.1-70b
echo 'PARAMETER num_ctx 131072' >> Modelfile3.1-70b
ollama create llama3.1:70b-128k -f Modelfile3.1-70b

only 10 layers of 81 are offloaded
96Gb are sufficient for all layers and 128K context when running on CPU in LMStudio
I expect same 96Gb of vRam to be sufficient via ollama

dmatora · 2024-09-09T19:59:11Z

Btw, I just tested ollama on 4xA100 160Gb total, and while it did offload all 81 layer to gpu, it still gives me 'CUDA error: out of memory' when trying to run 128k request, so something is really messed up here

dhiltgen mentioned this pull request May 18, 2024

Enable concurrency by default #4218

Merged

dhiltgen force-pushed the gpu_incremental branch 4 times, most recently from 05ba1ca to 91be1fa Compare May 20, 2024 20:50

dhiltgen marked this pull request as ready for review May 20, 2024 23:44

dhiltgen force-pushed the gpu_incremental branch 4 times, most recently from ecde7d9 to d788717 Compare May 28, 2024 21:29

dhiltgen force-pushed the gpu_incremental branch 2 times, most recently from f02b076 to 076450a Compare May 30, 2024 20:13

dhiltgen marked this pull request as draft May 30, 2024 20:45

dhiltgen force-pushed the gpu_incremental branch from 076450a to 6b78c76 Compare May 30, 2024 21:37

dhiltgen marked this pull request as ready for review May 30, 2024 22:01

dhiltgen mentioned this pull request May 30, 2024

feat: enable OLLAMA Arc GPU support with SYCL backend #3796

Closed

dhiltgen force-pushed the gpu_incremental branch 4 times, most recently from bfbb50e to 137b4d9 Compare June 1, 2024 19:32

This was referenced Jun 1, 2024

Support GPU runners on CPUs without AVX #2187

Closed

Support GPU runners with AVX2 #2281

Closed

ROCM setup with two 7900 XTX outputs generate irrelevant content. #3158

Closed

dual GPU 8G/16G - CUDA error: out of memory with dolphin-mixtral #3460

Closed

jmorganca reviewed Jun 2, 2024

View reviewed changes

gpu/types.go Outdated Show resolved Hide resolved

jmorganca reviewed Jun 2, 2024

View reviewed changes

gpu/types.go Outdated Show resolved Hide resolved

jmorganca reviewed Jun 2, 2024

View reviewed changes

llm/ext_server/server.cpp Show resolved Hide resolved

jmorganca reviewed Jun 2, 2024

View reviewed changes

jmorganca reviewed Jun 14, 2024

View reviewed changes

server/sched.go Outdated Show resolved Hide resolved

jmorganca reviewed Jun 14, 2024

View reviewed changes

server/sched.go Outdated Show resolved Hide resolved

jmorganca reviewed Jun 14, 2024

View reviewed changes

llm/server.go Outdated Show resolved Hide resolved

jmorganca reviewed Jun 14, 2024

View reviewed changes

gpu/gpu_info_oneapi.c Outdated Show resolved Hide resolved

dhiltgen added 13 commits June 14, 2024 14:51

Revert "Limit GPU lib search for now (ollama#4777)"

efac488

This reverts commit 476fb8e.

Fix server.cpp for the new cuda build macros

fb9cdfa

Use DRM driver for VRAM info for amd

b32ebb4

The amdgpu drivers free VRAM reporting omits some other apps, so leverage the upstream DRM driver which keeps better tabs on things

Refine GPU discovery to bootstrap once

43ed358

Now that we call the GPU discovery routines many times to update memory, this splits initial discovery from free memory updating.

Fix concurrency integration test to work locally

206797b

This worked remotely but wound up trying to spawn multiple servers locally which doesn't work

Support forced spreading for multi GPU

5e8ff55

Our default behavior today is to try to fit into a single GPU if possible. Some users would prefer the old behavior of always spreading across multiple GPUs even if the model can fit into one. This exposes that tunable behavior.

refined test timing

68dfc62

adjust timing on some tests so they don't timeout on small/slow GPUs

Harden unload for empty runners

48702dd

Refactor intel gpu discovery

4e2b7e1

Reintroduce nvidia nvml library for windows

434dfe3

This library will give us the most reliable free VRAM reporting on windows to enable concurrent model scheduling.

Refine CPU load behavior with system memory visibility

fc37c19

Prevent multiple concurrent loads on the same gpus

ff4f0cb

While models are loading, the VRAM metrics are dynamic, so try to load on a GPU that doesn't have a model actively loading, or wait to avoid races that lead to OOMs

dhiltgen force-pushed the gpu_incremental branch 2 times, most recently from 468530a to 4a79bad Compare June 14, 2024 21:52

dhiltgen added 2 commits June 14, 2024 14:55

review comments and coverage

6f351bf

Remove mmap related output calc logic

17df652

dhiltgen force-pushed the gpu_incremental branch from 4a79bad to 17df652 Compare June 14, 2024 21:55

jmorganca approved these changes Jun 14, 2024

View reviewed changes

dhiltgen merged commit 45cacba into ollama:main Jun 14, 2024
15 checks passed

dhiltgen deleted the gpu_incremental branch June 14, 2024 22:35

mizzlefeng mentioned this pull request Jul 22, 2024

How to force the use of two GPUs to run a model? #5849

Closed

dmatora mentioned this pull request Sep 9, 2024

Improving the efficiency of using multiple GPU cards. #4198

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhanced GPU discovery and multi-gpu support with concurrency #4517

Enhanced GPU discovery and multi-gpu support with concurrency #4517

Uh oh!

dhiltgen commented May 18, 2024 •

edited

Loading

Uh oh!

jmorganca Jun 2, 2024 •

edited

Loading

Uh oh!

dhiltgen Jun 2, 2024

Uh oh!

jmorganca Jun 2, 2024

Uh oh!

dhiltgen Jun 3, 2024

Uh oh!

Uh oh!

Uh oh!

jmorganca Jun 2, 2024 •

edited

Loading

Uh oh!

Uh oh!

jmorganca left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmorganca Jun 14, 2024

Uh oh!

dhiltgen Jun 14, 2024

Uh oh!

Uh oh!

Uh oh!

dmatora commented Sep 9, 2024

Uh oh!

dmatora commented Sep 9, 2024 •

edited

Loading

Uh oh!

Uh oh!

Enhanced GPU discovery and multi-gpu support with concurrency #4517

Enhanced GPU discovery and multi-gpu support with concurrency #4517

Uh oh!

Conversation

dhiltgen commented May 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmorganca Jun 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhiltgen Jun 2, 2024

Choose a reason for hiding this comment

Uh oh!

jmorganca Jun 2, 2024

Choose a reason for hiding this comment

Uh oh!

dhiltgen Jun 3, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jmorganca Jun 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jmorganca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmorganca Jun 14, 2024

Choose a reason for hiding this comment

Uh oh!

dhiltgen Jun 14, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dmatora commented Sep 9, 2024

Uh oh!

dmatora commented Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dhiltgen commented May 18, 2024 •

edited

Loading

jmorganca Jun 2, 2024 •

edited

Loading

jmorganca Jun 2, 2024 •

edited

Loading

dmatora commented Sep 9, 2024 •

edited

Loading