Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add keep_alive to generate/chat/embedding api endpoints #2146

Merged
merged 3 commits into from
Jan 26, 2024
Merged

Conversation

pdevine
Copy link
Contributor

@pdevine pdevine commented Jan 22, 2024

This change adds a new keep_alive parameter to /api/generate which can control the duration for how long a model is loaded and left in memory. There are three cases:

  1. if keep_alive is not set, the model will stay loaded for the default value (5 minutes);
  2. if keep_alive is set to a positive duration (e.g. "20m"), it will stay loaded for the duration;
  3. if keep_alive is set to a negative duration (e.g. "-1m"), it will stay loaded indefinitely

If you wish the model to be loaded immediately after generation, you can set it to "0m", or even just 0. Also, maybe most importantly, subsequent calls to the /api/generate will change the load duration, so even if you called it once with a negative value and the next caller omits it, it will still only stay in memory for 5 minutes after the second call.

Note that this change only applies to the /api/generate. We can either layer on the changes for /api/chat on top of this change, or push it as a separate PR.

resolves #1339

server/routes.go Outdated Show resolved Hide resolved
@DuckyBlender
Copy link

This is amazing, very excited for this. My HDD is the main bottleneck when using ollama. (my ssd broke rip)

@f0rodo
Copy link

f0rodo commented Jan 25, 2024

Very excited with this work. Looking forward to reduce time to first token in my applications.

@pdevine
Copy link
Contributor Author

pdevine commented Jan 26, 2024

I should also mention that you can either send a duration like "5s", or also a float value in seconds. Keep in mind that subsequent requests that do not have the keep_alive parameter will revert back to 5 minutes, so you should always pass in the parameter if you want to keep it loaded or unload it immediately.

@pdevine pdevine marked this pull request as ready for review January 26, 2024 01:42
@dmitrykozlov
Copy link

dmitrykozlov commented Jan 26, 2024

@pdevine It may be not convenient that any generate request will control / update session duration as it's administrative (write) action, as part of consumption (read) action, i.e. generate request. This may be a problem from future-proof security standpoint.

Would be nice to have ability to update default session duration in model or server settings. This will also helps with warm up model on server hosting, when settings need to updated once for multiple clients / sessions.

@pdevine
Copy link
Contributor Author

pdevine commented Jan 26, 2024

@dmitrykozlov absolutely, however there are a number of considerations here:

  • who has access to tell the server to keep the model in memory?
  • how long should they be able to leave it in memory for?
  • if it was set by the server instead, what models should be loaded for different durations? what if a user pulls a new model?
  • what if there are conflicting settings for keeping models loaded?

This change is more of a short term solution. You could imagine a much richer solution w/ role based access control and also control over how/when things are loaded into memory.

@dmitrykozlov
Copy link

@pdevine It looks likes, there is some misunderstanding. Let me describe real use case better:

  1. Ollama server is used to host single model on machine in production environment.
  2. Access to the ollama server is limited by firewall and only another (server) application running (in the same isolated environment) can access it.
  3. Users don't have access to ollama server API directly, only the application have access.

By server settings, I mean ollama service settings. With this solution the application have to send "keep_alive" on each request.

@pdevine pdevine changed the title add keep_alive to /api/generate add keep_alive to generate/chat/embedding api endpoints Jan 26, 2024
@pdevine
Copy link
Contributor Author

pdevine commented Jan 26, 2024

@dmitrykozlov yep, but if the application just sets keep_alive to -1, it will always stay in memory. It's not ideal, but it should work.

Copy link
Contributor

@BruceMacD BruceMacD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this some stress tests and it seems good

Copy link
Contributor

@mxyng mxyng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the comment on style that's outstanding but otherwise this is great!

docs/api.md Outdated
@@ -43,6 +43,7 @@ Generate a response for a given prompt with a provided model. This is a streamin

Advanced parameters (optional):

- `keep_alive`: how long to keep the model in memory. If negative (e.g. `-1s`) the model will stay loaded. If omitted the model will stay loaded for 5 minutes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the docs, we may want to only merge this commit once it's in an active release. Or we can flag it here as "upcoming"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can just pull the docs from this PR and then add them back once the next version is out.

@pdevine pdevine force-pushed the keepalive branch 2 times, most recently from e5e1b28 to 0cf5815 Compare January 26, 2024 22:05
@pdevine pdevine merged commit b5cf31b into main Jan 26, 2024
13 checks passed
@pdevine pdevine deleted the keepalive branch January 26, 2024 22:28
@tjthejuggler
Copy link

i have 0.1.22
I want to use the new keep_alive feature, so I run this in terminal:
curl http://localhost:11434/api/generate -d '{ "model": "tinyllama", "prompt": "Why is the sky blue??", "keep_alive": 0 }'
and I expect it to drop it out of memory as soon as the generation completes. However, it doesnt matter how long I wait, it just stays using the memory. Does anyone know why this might be?

It has been 10 minutes and still this single request is using 1430MiB way after it instantly produced the text

0 N/A N/A 2260 C /usr/local/bin/ollama 1430MiB

@briankwest
Copy link

This PR doesn't seem to work as outlined on a mac m2 ultra, I see the model get dumped the moment is done in all cases.
Screenshot 2024-02-05 at 21 07 44

@Pauldb8
Copy link

Pauldb8 commented Apr 2, 2024

This doesn't work. I'm using Dify, which calls Ollama endpoint, and does not provide the keep_alive field.
I want to set it as parameter. Even if I add keep_alive: -1, it will be unloaded when Dify starts making calls. Which is completely broken behaviour. How do I keep it loaded without that ? It's a server I should be able to have a parameter in a config file and have it be kept.
Running a curl command every minute doesn't work either as it doesn't recognize it as being the same mode being used (Dify probably adds some parameter, which I don't know, nor should know about, and hence it thinks they are different).
Any solutions ?

@lcolok
Copy link

lcolok commented Apr 22, 2024

This doesn't work. I'm using Dify, which calls Ollama endpoint, and does not provide the keep_alive field. I want to set it as parameter. Even if I add keep_alive: -1, it will be unloaded when Dify starts making calls. Which is completely broken behaviour. How do I keep it loaded without that ? It's a server I should be able to have a parameter in a config file and have it be kept. Running a curl command every minute doesn't work either as it doesn't recognize it as being the same mode being used (Dify probably adds some parameter, which I don't know, nor should know about, and hence it thinks they are different). Any solutions ?

Same situation.

@Yash-1511
Copy link

This doesn't work. I'm using Dify, which calls Ollama endpoint, and does not provide the keep_alive field. I want to set it as parameter. Even if I add keep_alive: -1, it will be unloaded when Dify starts making calls. Which is completely broken behaviour. How do I keep it loaded without that ? It's a server I should be able to have a parameter in a config file and have it be kept. Running a curl command every minute doesn't work either as it doesn't recognize it as being the same mode being used (Dify probably adds some parameter, which I don't know, nor should know about, and hence it thinks they are different). Any solutions ?

Hey @Pauldb8 , @lcolok
add: ollama keep alive parameter in dify

I have created new field called keep_alive in model parameters. so you can now set this parameter in dify also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MacOS opens kernel tasks doesn't unload model