add keep_alive to generate/chat/embedding api endpoints #2146

pdevine · 2024-01-22T21:47:04Z

This change adds a new keep_alive parameter to /api/generate which can control the duration for how long a model is loaded and left in memory. There are three cases:

if keep_alive is not set, the model will stay loaded for the default value (5 minutes);
if keep_alive is set to a positive duration (e.g. "20m"), it will stay loaded for the duration;
if keep_alive is set to a negative duration (e.g. "-1m"), it will stay loaded indefinitely

If you wish the model to be loaded immediately after generation, you can set it to "0m", or even just 0. Also, maybe most importantly, subsequent calls to the /api/generate will change the load duration, so even if you called it once with a negative value and the next caller omits it, it will still only stay in memory for 5 minutes after the second call.

Note that this change only applies to the /api/generate. We can either layer on the changes for /api/chat on top of this change, or push it as a separate PR.

resolves #1339

server/routes.go

api/types.go

DuckyBlender · 2024-01-24T16:39:16Z

This is amazing, very excited for this. My HDD is the main bottleneck when using ollama. (my ssd broke rip)

f0rodo · 2024-01-25T22:15:16Z

Very excited with this work. Looking forward to reduce time to first token in my applications.

pdevine · 2024-01-26T01:42:14Z

I should also mention that you can either send a duration like "5s", or also a float value in seconds. Keep in mind that subsequent requests that do not have the keep_alive parameter will revert back to 5 minutes, so you should always pass in the parameter if you want to keep it loaded or unload it immediately.

dmitrykozlov · 2024-01-26T03:30:21Z

@pdevine It may be not convenient that any generate request will control / update session duration as it's administrative (write) action, as part of consumption (read) action, i.e. generate request. This may be a problem from future-proof security standpoint.

Would be nice to have ability to update default session duration in model or server settings. This will also helps with warm up model on server hosting, when settings need to updated once for multiple clients / sessions.

pdevine · 2024-01-26T16:33:45Z

@dmitrykozlov absolutely, however there are a number of considerations here:

who has access to tell the server to keep the model in memory?
how long should they be able to leave it in memory for?
if it was set by the server instead, what models should be loaded for different durations? what if a user pulls a new model?
what if there are conflicting settings for keeping models loaded?

This change is more of a short term solution. You could imagine a much richer solution w/ role based access control and also control over how/when things are loaded into memory.

dmitrykozlov · 2024-01-26T18:06:20Z

@pdevine It looks likes, there is some misunderstanding. Let me describe real use case better:

Ollama server is used to host single model on machine in production environment.
Access to the ollama server is limited by firewall and only another (server) application running (in the same isolated environment) can access it.
Users don't have access to ollama server API directly, only the application have access.

By server settings, I mean ollama service settings. With this solution the application have to send "keep_alive" on each request.

pdevine · 2024-01-26T18:15:55Z

@dmitrykozlov yep, but if the application just sets keep_alive to -1, it will always stay in memory. It's not ideal, but it should work.

BruceMacD

I gave this some stress tests and it seems good

mxyng

Just the comment on style that's outstanding but otherwise this is great!

jmorganca · 2024-01-26T19:07:00Z

docs/api.md

@@ -43,6 +43,7 @@ Generate a response for a given prompt with a provided model. This is a streamin

 Advanced parameters (optional):

+- `keep_alive`: how long to keep the model in memory. If negative (e.g. `-1s`) the model will stay loaded. If omitted the model will stay loaded for 5 minutes


For the docs, we may want to only merge this commit once it's in an active release. Or we can flag it here as "upcoming"

I can just pull the docs from this PR and then add them back once the next version is out.

tjthejuggler · 2024-02-02T09:51:19Z

i have 0.1.22
I want to use the new keep_alive feature, so I run this in terminal:
curl http://localhost:11434/api/generate -d '{ "model": "tinyllama", "prompt": "Why is the sky blue??", "keep_alive": 0 }'
and I expect it to drop it out of memory as soon as the generation completes. However, it doesnt matter how long I wait, it just stays using the memory. Does anyone know why this might be?

It has been 10 minutes and still this single request is using 1430MiB way after it instantly produced the text

0 N/A N/A 2260 C /usr/local/bin/ollama 1430MiB

briankwest · 2024-02-06T03:08:59Z

This PR doesn't seem to work as outlined on a mac m2 ultra, I see the model get dumped the moment is done in all cases.

Pauldb8 · 2024-04-02T07:22:18Z

This doesn't work. I'm using Dify, which calls Ollama endpoint, and does not provide the keep_alive field.
I want to set it as parameter. Even if I add keep_alive: -1, it will be unloaded when Dify starts making calls. Which is completely broken behaviour. How do I keep it loaded without that ? It's a server I should be able to have a parameter in a config file and have it be kept.
Running a curl command every minute doesn't work either as it doesn't recognize it as being the same mode being used (Dify probably adds some parameter, which I don't know, nor should know about, and hence it thinks they are different).
Any solutions ?

lcolok · 2024-04-22T16:41:55Z

This doesn't work. I'm using Dify, which calls Ollama endpoint, and does not provide the keep_alive field. I want to set it as parameter. Even if I add keep_alive: -1, it will be unloaded when Dify starts making calls. Which is completely broken behaviour. How do I keep it loaded without that ? It's a server I should be able to have a parameter in a config file and have it be kept. Running a curl command every minute doesn't work either as it doesn't recognize it as being the same mode being used (Dify probably adds some parameter, which I don't know, nor should know about, and hence it thinks they are different). Any solutions ?

Same situation.

Yash-1511 · 2024-05-25T12:57:21Z

This doesn't work. I'm using Dify, which calls Ollama endpoint, and does not provide the keep_alive field. I want to set it as parameter. Even if I add keep_alive: -1, it will be unloaded when Dify starts making calls. Which is completely broken behaviour. How do I keep it loaded without that ? It's a server I should be able to have a parameter in a config file and have it be kept. Running a curl command every minute doesn't work either as it doesn't recognize it as being the same mode being used (Dify probably adds some parameter, which I don't know, nor should know about, and hence it thinks they are different). Any solutions ?

Hey @Pauldb8 , @lcolok
add: ollama keep alive parameter in dify

I have created new field called keep_alive in model parameters. so you can now set this parameter in dify also.

mxyng reviewed Jan 22, 2024

View reviewed changes

server/routes.go Show resolved Hide resolved

mxyng reviewed Jan 22, 2024

View reviewed changes

server/routes.go Outdated Show resolved Hide resolved

BruceMacD reviewed Jan 22, 2024

View reviewed changes

api/types.go Show resolved Hide resolved

jmorganca mentioned this pull request Jan 25, 2024

Ability to keep a model in memory for longer #1536

Closed

pdevine force-pushed the keepalive branch from dea6cac to d10cb57 Compare January 26, 2024 01:31

pdevine marked this pull request as ready for review January 26, 2024 01:42

pdevine changed the title ~~add keep_alive to /api/generate~~ add keep_alive to generate/chat/embedding api endpoints Jan 26, 2024

BruceMacD approved these changes Jan 26, 2024

View reviewed changes

mxyng approved these changes Jan 26, 2024

View reviewed changes

jmorganca reviewed Jan 26, 2024

View reviewed changes

mxyng mentioned this pull request Jan 26, 2024

add keep_alive ollama/ollama-python#31

Merged

pdevine added 3 commits January 26, 2024 12:09

add keep_alive to /api/generate

57cadbc

fix lint warning

7bb6ccb

fix parsed duration + add to chat/embed endpoints

0cf5815

pdevine force-pushed the keepalive branch 2 times, most recently from e5e1b28 to 0cf5815 Compare January 26, 2024 22:05

pdevine merged commit b5cf31b into main Jan 26, 2024
13 checks passed

pdevine deleted the keepalive branch January 26, 2024 22:28

mxyng mentioned this pull request Jan 26, 2024

MacOS opens kernel tasks doesn't unload model #1339

Closed

This was referenced Jan 27, 2024

Feature request: control session duration of loaded models #2121

Closed

Keep models in RAM #2210

Closed

Prevent offloding #1928

Closed

Model kept unloading no matter what #1782

Closed

pdevine mentioned this pull request Jan 28, 2024

Is there any option to unload a model from memory? #1600

Closed

noahhaon mentioned this pull request Jan 29, 2024

Support ollama's new keep_alive request parameter to prevent model unloading continuedev/continue#783

Closed

2 tasks

zehDonut mentioned this pull request Jan 29, 2024

feat: Support ollama's keep_alive request parameter open-webui/open-webui#596

Closed

zabirauf mentioned this pull request Feb 13, 2024

Added support for setting the new KeepAlive property to be sent to Ollama open-webui/open-webui#721

Merged

1 task

wrapss mentioned this pull request Feb 14, 2024

[Question] Do not offload to CPU RAM #2490

Closed

uxfion mentioned this pull request Feb 15, 2024

OLLAMA_KEEP_ALIVE ENV feature #2508

Closed

Chris-AS1 mentioned this pull request Feb 15, 2024

Added OLLAMA_DEFAULT_KEEPALIVE, OLLAMA_KEEPALIVE environment variables #2523

Closed

uneuro mentioned this pull request Feb 25, 2024

llava13b memory access faults on api/chat (firts call fine, fail on second one) #2713

Open

tannisroot mentioned this pull request Mar 2, 2024

Add "keep_alive" parameter for the Ollama API acon96/home-llm#76

Closed

bogan-FMA mentioned this pull request Apr 30, 2024

support for ollama keep_alive parameter langgenius/dify#4024

Closed

4 tasks

prabirshrestha mentioned this pull request May 10, 2024

Pull missing model for Ollama LLM generation Abraxas-365/langchain-rust#148

Closed

philwinder mentioned this pull request Jun 7, 2024

[llama] Sessions constantly report being busy despite not being busy (and slow time to first token times) helixml/helix#324

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add keep_alive to generate/chat/embedding api endpoints #2146

add keep_alive to generate/chat/embedding api endpoints #2146

pdevine commented Jan 22, 2024 •

edited by mxyng

DuckyBlender commented Jan 24, 2024

f0rodo commented Jan 25, 2024

pdevine commented Jan 26, 2024

dmitrykozlov commented Jan 26, 2024 •

edited

pdevine commented Jan 26, 2024

dmitrykozlov commented Jan 26, 2024

pdevine commented Jan 26, 2024

BruceMacD left a comment

mxyng left a comment

jmorganca Jan 26, 2024

pdevine Jan 26, 2024

tjthejuggler commented Feb 2, 2024

briankwest commented Feb 6, 2024

Pauldb8 commented Apr 2, 2024

lcolok commented Apr 22, 2024

Yash-1511 commented May 25, 2024

		@@ -43,6 +43,7 @@ Generate a response for a given prompt with a provided model. This is a streamin

		Advanced parameters (optional):

		- `keep_alive`: how long to keep the model in memory. If negative (e.g. `-1s`) the model will stay loaded. If omitted the model will stay loaded for 5 minutes

add keep_alive to generate/chat/embedding api endpoints #2146

add keep_alive to generate/chat/embedding api endpoints #2146

Conversation

pdevine commented Jan 22, 2024 • edited by mxyng

DuckyBlender commented Jan 24, 2024

f0rodo commented Jan 25, 2024

pdevine commented Jan 26, 2024

dmitrykozlov commented Jan 26, 2024 • edited

pdevine commented Jan 26, 2024

dmitrykozlov commented Jan 26, 2024

pdevine commented Jan 26, 2024

BruceMacD left a comment

Choose a reason for hiding this comment

mxyng left a comment

Choose a reason for hiding this comment

jmorganca Jan 26, 2024

Choose a reason for hiding this comment

pdevine Jan 26, 2024

Choose a reason for hiding this comment

tjthejuggler commented Feb 2, 2024

briankwest commented Feb 6, 2024

Pauldb8 commented Apr 2, 2024

lcolok commented Apr 22, 2024

Yash-1511 commented May 25, 2024

pdevine commented Jan 22, 2024 •

edited by mxyng

dmitrykozlov commented Jan 26, 2024 •

edited