Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make maximum pending request configurable #4144

Merged
merged 2 commits into from
May 5, 2024
Merged

Conversation

dhiltgen
Copy link
Collaborator

@dhiltgen dhiltgen commented May 3, 2024

Bump the maximum queued requests to 512 (from 10)
Make it configurable with a new env var OLLAMA_MAX_QUEUE
Return a 503 when the server is too busy instead of more generic 500.

Fixes #4124

With the added integration test, here are some quick memory stats on linux:

  • Just starting ollama RSS 429.0m
  • Load orca-mini: RSS 456.8m. (just the Go process, not the child runner)
  • During my stress test where I push >512 connections: RSS 489.0m

server/sched.go Outdated Show resolved Hide resolved
This also bumps up the default to be 50 queued requests
instead of 10.
@dhiltgen dhiltgen merged commit 0268699 into ollama:main May 5, 2024
12 checks passed
@dhiltgen dhiltgen deleted the max_queue branch May 5, 2024 17:53
@kraileth kraileth mentioned this pull request May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

/api/embeddings responds with 500 before Ollama is initialized - handle max queued requests failure better
2 participants