Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable concurrency by default #4218

Merged
merged 3 commits into from
Jul 1, 2024
Merged

Enable concurrency by default #4218

merged 3 commits into from
Jul 1, 2024

Conversation

dhiltgen
Copy link
Collaborator

@dhiltgen dhiltgen commented May 7, 2024

This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.

Corresponding Doc update to merge after this #5364

@dhiltgen dhiltgen force-pushed the auto_parallel branch 2 times, most recently from f846c36 to 3913acf Compare June 6, 2024 20:07
@dhiltgen dhiltgen force-pushed the auto_parallel branch 3 times, most recently from b84f3cc to f4742e1 Compare June 14, 2024 22:37
@dhiltgen dhiltgen marked this pull request as ready for review June 14, 2024 22:40
@dhiltgen dhiltgen force-pushed the auto_parallel branch 2 times, most recently from 085ae40 to b1e7a74 Compare June 20, 2024 23:50
This adjusts our default settings to enable multiple models and parallel
requests to a single model.  Users can still override these by the same
env var settings as before.  Parallel has a direct impact on
num_ctx, which in turn can have a significant impact on small VRAM GPUs
so this change also refines the algorithm so that when parallel is not
explicitly set by the user, we try to find a reasonable default that fits
the model on their GPU(s).  As before, multiple models will only load
concurrently if they fully fit in VRAM.
Until ROCm v6.2 ships, we wont be able to get accurate free memory
reporting on windows, which makes automatic concurrency too risky.
Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes.
All other platforms and GPUs have accurate VRAM reporting wired
up now, so we can turn on concurrency by default.
Provide consistent ordering for the ps command - longest duration listed first
@@ -85,13 +85,13 @@ func AsMap() map[string]EnvVar {
"OLLAMA_HOST": {"OLLAMA_HOST", Host, "IP Address for the ollama server (default 127.0.0.1:11434)"},
"OLLAMA_KEEP_ALIVE": {"OLLAMA_KEEP_ALIVE", KeepAlive, "The duration that models stay loaded in memory (default \"5m\")"},
"OLLAMA_LLM_LIBRARY": {"OLLAMA_LLM_LIBRARY", LLMLibrary, "Set LLM library to bypass autodetection"},
"OLLAMA_MAX_LOADED_MODELS": {"OLLAMA_MAX_LOADED_MODELS", MaxRunners, "Maximum number of loaded models (default 1)"},
"OLLAMA_MAX_LOADED_MODELS": {"OLLAMA_MAX_LOADED_MODELS", MaxRunners, "Maximum number of loaded models per GPU (default auto)"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is the default 0? might be ok to put that, so folks don't do OLLAMA_MAX_LOADED_MODELS=auto

@@ -23,6 +23,7 @@ type LlmRequest struct {
ctx context.Context //nolint:containedctx
model *Model
opts api.Options
origNumCTX int // Track the initial ctx request
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we case it as NumCtx in other parts of the codebase

Copy link
Member

@jmorganca jmorganca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dhiltgen dhiltgen merged commit 3518aae into ollama:main Jul 1, 2024
12 checks passed
@dhiltgen dhiltgen deleted the auto_parallel branch July 1, 2024 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants