Make maximum pending request configurable #4144

dhiltgen · 2024-05-03T23:37:51Z

Bump the maximum queued requests to 512 (from 10)
Make it configurable with a new env var OLLAMA_MAX_QUEUE
Return a 503 when the server is too busy instead of more generic 500.

Fixes #4124

With the added integration test, here are some quick memory stats on linux:

Just starting ollama RSS 429.0m
Load orca-mini: RSS 456.8m. (just the Go process, not the child runner)
During my stress test where I push >512 connections: RSS 489.0m

server/sched.go

This also bumps up the default to be 50 queued requests instead of 10.

dhiltgen force-pushed the max_queue branch from bf38271 to 81b30ea Compare May 3, 2024 23:53

jmorganca reviewed May 5, 2024

View reviewed changes

server/sched.go Outdated Show resolved Hide resolved

Make maximum pending request configurable

20f6c06

This also bumps up the default to be 50 queued requests instead of 10.

dhiltgen force-pushed the max_queue branch from 81b30ea to 20f6c06 Compare May 5, 2024 04:00

jmorganca approved these changes May 5, 2024

View reviewed changes

Add integration test to push max queue limits

45d61aa

dhiltgen merged commit 0268699 into ollama:main May 5, 2024
12 checks passed

dhiltgen deleted the max_queue branch May 5, 2024 17:53

kraileth mentioned this pull request May 28, 2024

Ollama on FreeBSD #1102

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make maximum pending request configurable #4144

Make maximum pending request configurable #4144

dhiltgen commented May 3, 2024 •

edited

Make maximum pending request configurable #4144

Make maximum pending request configurable #4144

Conversation

dhiltgen commented May 3, 2024 • edited

dhiltgen commented May 3, 2024 •

edited