-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add keep_alive to generate/chat/embedding api endpoints #2146
Conversation
This is amazing, very excited for this. My HDD is the main bottleneck when using ollama. (my ssd broke rip) |
Very excited with this work. Looking forward to reduce time to first token in my applications. |
I should also mention that you can either send a duration like "5s", or also a float value in seconds. Keep in mind that subsequent requests that do not have the |
@pdevine It may be not convenient that any generate request will control / update session duration as it's administrative (write) action, as part of consumption (read) action, i.e. generate request. This may be a problem from future-proof security standpoint. Would be nice to have ability to update default session duration in model or server settings. This will also helps with warm up model on server hosting, when settings need to updated once for multiple clients / sessions. |
@dmitrykozlov absolutely, however there are a number of considerations here:
This change is more of a short term solution. You could imagine a much richer solution w/ role based access control and also control over how/when things are loaded into memory. |
@pdevine It looks likes, there is some misunderstanding. Let me describe real use case better:
By server settings, I mean ollama service settings. With this solution the application have to send "keep_alive" on each request. |
/api/generate
@dmitrykozlov yep, but if the application just sets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this some stress tests and it seems good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the comment on style that's outstanding but otherwise this is great!
docs/api.md
Outdated
@@ -43,6 +43,7 @@ Generate a response for a given prompt with a provided model. This is a streamin | |||
|
|||
Advanced parameters (optional): | |||
|
|||
- `keep_alive`: how long to keep the model in memory. If negative (e.g. `-1s`) the model will stay loaded. If omitted the model will stay loaded for 5 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the docs, we may want to only merge this commit once it's in an active release. Or we can flag it here as "upcoming"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can just pull the docs from this PR and then add them back once the next version is out.
e5e1b28
to
0cf5815
Compare
i have 0.1.22 It has been 10 minutes and still this single request is using 1430MiB way after it instantly produced the text 0 N/A N/A 2260 C /usr/local/bin/ollama 1430MiB |
This doesn't work. I'm using Dify, which calls Ollama endpoint, and does not provide the keep_alive field. |
Same situation. |
Hey @Pauldb8 , @lcolok I have created new field called keep_alive in model parameters. so you can now set this parameter in dify also. |
This change adds a new
keep_alive
parameter to/api/generate
which can control the duration for how long a model is loaded and left in memory. There are three cases:keep_alive
is not set, the model will stay loaded for the default value (5 minutes);keep_alive
is set to a positive duration (e.g. "20m"), it will stay loaded for the duration;keep_alive
is set to a negative duration (e.g. "-1m"), it will stay loaded indefinitelyIf you wish the model to be loaded immediately after generation, you can set it to "0m", or even just
0
. Also, maybe most importantly, subsequent calls to the/api/generate
will change the load duration, so even if you called it once with a negative value and the next caller omits it, it will still only stay in memory for 5 minutes after the second call.Note that this change only applies to the
/api/generate
. We can either layer on the changes for/api/chat
on top of this change, or push it as a separate PR.resolves #1339