Enable concurrency by default #4218

dhiltgen · 2024-05-07T01:00:28Z

This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.

Corresponding Doc update to merge after this #5364

This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.

Until ROCm v6.2 ships, we wont be able to get accurate free memory reporting on windows, which makes automatic concurrency too risky. Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes. All other platforms and GPUs have accurate VRAM reporting wired up now, so we can turn on concurrency by default.

Provide consistent ordering for the ps command - longest duration listed first

jmorganca · 2024-06-29T05:12:28Z

envconfig/config.go

@@ -85,13 +85,13 @@ func AsMap() map[string]EnvVar {
 		"OLLAMA_HOST":              {"OLLAMA_HOST", Host, "IP Address for the ollama server (default 127.0.0.1:11434)"},
 		"OLLAMA_KEEP_ALIVE":        {"OLLAMA_KEEP_ALIVE", KeepAlive, "The duration that models stay loaded in memory (default \"5m\")"},
 		"OLLAMA_LLM_LIBRARY":       {"OLLAMA_LLM_LIBRARY", LLMLibrary, "Set LLM library to bypass autodetection"},
-		"OLLAMA_MAX_LOADED_MODELS": {"OLLAMA_MAX_LOADED_MODELS", MaxRunners, "Maximum number of loaded models (default 1)"},
+		"OLLAMA_MAX_LOADED_MODELS": {"OLLAMA_MAX_LOADED_MODELS", MaxRunners, "Maximum number of loaded models per GPU (default auto)"},


nit: Is the default 0? might be ok to put that, so folks don't do OLLAMA_MAX_LOADED_MODELS=auto

jmorganca · 2024-06-29T05:28:17Z

server/sched.go

@@ -23,6 +23,7 @@ type LlmRequest struct {
 	ctx             context.Context //nolint:containedctx
 	model           *Model
 	opts            api.Options
+	origNumCTX      int // Track the initial ctx request


I think we case it as NumCtx in other parts of the codebase

jmorganca

LGTM

dhiltgen force-pushed the auto_parallel branch 2 times, most recently from f846c36 to 3913acf Compare June 6, 2024 20:07

dhiltgen force-pushed the auto_parallel branch 3 times, most recently from b84f3cc to f4742e1 Compare June 14, 2024 22:37

dhiltgen marked this pull request as ready for review June 14, 2024 22:40

dhiltgen force-pushed the auto_parallel branch 2 times, most recently from 085ae40 to b1e7a74 Compare June 20, 2024 23:50

dhiltgen added 2 commits June 21, 2024 15:45

dhiltgen force-pushed the auto_parallel branch from b1e7a74 to 9929751 Compare June 21, 2024 22:45

Sort the ps output

642cee1

Provide consistent ordering for the ps command - longest duration listed first

dhiltgen mentioned this pull request Jun 28, 2024

Document concurrent behavior and settings #5364

Merged

jmorganca reviewed Jun 29, 2024

View reviewed changes

jmorganca approved these changes Jun 29, 2024

View reviewed changes

dhiltgen merged commit 3518aae into ollama:main Jul 1, 2024
12 checks passed

dhiltgen deleted the auto_parallel branch July 1, 2024 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable concurrency by default #4218

Enable concurrency by default #4218

dhiltgen commented May 7, 2024 •

edited

Loading

jmorganca Jun 29, 2024

jmorganca Jun 29, 2024

jmorganca left a comment

Enable concurrency by default #4218

Enable concurrency by default #4218

Conversation

dhiltgen commented May 7, 2024 • edited Loading

jmorganca Jun 29, 2024

Choose a reason for hiding this comment

jmorganca Jun 29, 2024

Choose a reason for hiding this comment

jmorganca left a comment

Choose a reason for hiding this comment

dhiltgen commented May 7, 2024 •

edited

Loading