Skip to content

Enhanced GPU discovery and multi-gpu support with concurrency #4517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jun 14, 2024

Conversation

dhiltgen
Copy link
Collaborator

@dhiltgen dhiltgen commented May 18, 2024

Carries (and obsoletes if we move this one forward first) #4266 and #4441

This refines our GPU discovery to split it into bootstrapping where we discover information about the GPUs once at startup, and then incrementally refresh just free space information, instead of fully rediscovering the GPUs over and over.

Fixes #3158
Fixes #4198
Fixes #3765

@dhiltgen dhiltgen force-pushed the gpu_incremental branch 4 times, most recently from 05ba1ca to 91be1fa Compare May 20, 2024 20:50
@dhiltgen dhiltgen marked this pull request as ready for review May 20, 2024 23:44
@dhiltgen dhiltgen force-pushed the gpu_incremental branch 4 times, most recently from ecde7d9 to d788717 Compare May 28, 2024 21:29
@dhiltgen dhiltgen force-pushed the gpu_incremental branch 2 times, most recently from f02b076 to 076450a Compare May 30, 2024 20:13
@dhiltgen dhiltgen marked this pull request as draft May 30, 2024 20:45
@dhiltgen dhiltgen marked this pull request as ready for review May 30, 2024 22:01
@dhiltgen dhiltgen force-pushed the gpu_incremental branch 4 times, most recently from bfbb50e to 137b4d9 Compare June 1, 2024 19:32
gpu/gpu.go Outdated

switch runtime.GOOS {
case "windows":
oneapiMgmtName = "ze_intel_gpu64.dll"
Copy link
Member

@jmorganca jmorganca Jun 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This DLL gets installed on Windows with Intel iGPUs as part of the OS base install and doesn't always open reliably – it seems to be causing some crashes on both Win10 and Win11 and so we may want to put this behind a flag until we resolve those issues

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm thinking is I'll add a temporary check to see if we have a oneapi runner available, and if not, disable gpu discovery for the oneapi library that way it can still be built from source and theoretically work but be truly a no-op for the official builds until we can test it more fully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind - this would lead to circular dependencies since the llm package with the payloads depends on gpu.

I'm pretty sure I fixed the bug that lead to the crash on oneapi initialization, so I think we'll be Ok leaving this in place.

@@ -232,6 +228,10 @@ func NewLlamaServer(gpus gpu.GpuInfoList, model string, ggml *GGML, adapters, pr

params = append(params, "--parallel", fmt.Sprintf("%d", numParallel))

if estimate.TensorSplit != "" {
params = append(params, "--tensor-split", estimate.TensorSplit)
Copy link
Member

@jmorganca jmorganca Jun 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super cool! Can't wait to try it more on 2x, 4x and 8x gpu systems

Copy link
Member

@jmorganca jmorganca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great! Small comment RE some oneapi dll open panics we are seeing on Windows boxes with iGPUs - we'd want to avoid making that part of the critical path until we resolve this

"github.com/ollama/ollama/api"
"github.com/ollama/ollama/envconfig"
"github.com/ollama/ollama/gpu"
"github.com/stretchr/testify/assert"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not critical for this PR but if only simple checks, it would be awesome to use t.Fatal as the rest of the codebase sticks as close to stdlib as possible

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look at this in a follow up

dhiltgen added 13 commits June 14, 2024 14:51
The amdgpu drivers free VRAM reporting omits some other apps, so leverage the
upstream DRM driver which keeps better tabs on things
Now that we call the GPU discovery routines many times to
update memory, this splits initial discovery from free memory
updating.
This worked remotely but wound up trying to spawn multiple servers
locally which doesn't work
Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block
Our default behavior today is to try to fit into a single GPU if possible.
Some users would prefer the old behavior of always spreading across
multiple GPUs even if the model can fit into one.  This exposes that
tunable behavior.
adjust timing on some tests so they don't timeout on small/slow GPUs
This library will give us the most reliable free VRAM reporting on windows
to enable concurrent model scheduling.
While models are loading, the VRAM metrics are dynamic, so try
to load on a GPU that doesn't have a model actively loading, or wait
to avoid races that lead to OOMs
@dhiltgen dhiltgen force-pushed the gpu_incremental branch 2 times, most recently from 468530a to 4a79bad Compare June 14, 2024 21:52
@dhiltgen dhiltgen merged commit 45cacba into ollama:main Jun 14, 2024
15 checks passed
@dhiltgen dhiltgen deleted the gpu_incremental branch June 14, 2024 22:35
@dmatora
Copy link

dmatora commented Sep 9, 2024

This works great when dealing with standard context size - when I load llama 3.1:70b it detects all 4 gpus, all 81 layers are offloaded and everything works blazing fast
But when I try to set context size to 128k

ollama show --modelfile llama3.1:70b > Modelfile3.1-70b
echo 'PARAMETER num_ctx 131072' >> Modelfile3.1-70b
ollama create llama3.1:70b-128k -f Modelfile3.1-70b

only 10 layers of 81 are offloaded
96Gb are sufficient for all layers and 128K context when running on CPU in LMStudio
I expect same 96Gb of vRam to be sufficient via ollama

@dmatora
Copy link

dmatora commented Sep 9, 2024

Btw, I just tested ollama on 4xA100 160Gb total, and while it did offload all 81 layer to gpu, it still gives me 'CUDA error: out of memory' when trying to run 128k request, so something is really messed up here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants