Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: initial watchdog implementation #1341

Merged
merged 9 commits into from Nov 26, 2023
Merged

feat: initial watchdog implementation #1341

merged 9 commits into from Nov 26, 2023

Conversation

mudler
Copy link
Owner

@mudler mudler commented Nov 25, 2023

Description

This PR fixes #1339 and fixes #1202. Besides should alleviate also issues like #1017 and ggerganov/llama.cpp#3969 once for good

The WatchDog implementation (disabled by default) is designed to monitor and manage multiple backends. It keeps track of the last active times and idle times of each backend, and can stop them if a backend has been busy or idle for too long.

Key components of the WatchDog struct include:

  • timetable: A map that stores the last active time of each backend.
  • idleTime: A map that stores the last idle time of each backend.
  • timeout and idletimeout: Duration values that represent the maximum allowed busy and idle times for a backend, respectively.

To turn on the watchdog, configure the following environment variables:

### Watchdog settings
###
# Enables watchdog to kill backends that are inactive for too much time
# WATCHDOG_IDLE=true
#
# Enables watchdog to kill backends that are busy for too much time
# WATCHDOG_BUSY=true
#
# Time in duration format (e.g. 1h30m) after which a backend is considered idle
# WATCHDOG_IDLE_TIMEOUT=5m
#
# Time in duration format (e.g. 1h30m) after which a backend is considered busy
# WATCHDOG_BUSY_TIMEOUT=5m

With the CLI: --enable-watchdog-idle, --enable-watchdog-busy, --watchdog-busy-timeout, --watchdog-idle-timeout.

Notes for Reviewers

Signed commits

  • Yes, I signed my commits.

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Copy link

netlify bot commented Nov 25, 2023

Deploy Preview for localai canceled.

Name Link
🔨 Latest commit de1abfc
🔍 Latest deploy log https://app.netlify.com/sites/localai/deploys/65635d914fee5d0008fbbd2b

@dave-gray101
Copy link
Collaborator

Can you give a quick example of how this is supposed to interact with multiple backends? Currently, it looks like the timeout setting isn't backend specific, which is probably fine for now - but since I'd like to leverage this to improve the monitoring endpoints, I want to make sure I understand when the timer is being reset and what that interval is measuring. Thanks!

@mudler
Copy link
Owner Author

mudler commented Nov 26, 2023

Can you give a quick example of how this is supposed to interact with multiple backends? Currently, it looks like the timeout setting isn't backend specific, which is probably fine for now - but since I'd like to leverage this to improve the monitoring endpoints, I want to make sure I understand when the timer is being reset and what that interval is measuring. Thanks!

It currently monitor all active connections, connections are recorded by the GRPC client, and when a backend becomes busy (starts processing a request), the current time is recorded in timetable for that backend. If the backend remains busy for longer than timeout, an action (like logging a warning or shutting down the backend) could be triggered (like now stops the backend directly).

Similarly, when a backend becomes idle (finishes processing a request), the current time is recorded in idleTime for that backend. If the backend remains idle for longer than idletimeout, it gets killed. This was asked in #1202 and took the occasion to implement it here as most of the logic applies to as well.

At the moment is possible to define timeout durations, enable and/or disable it (defaults to disabled), keeping it very simple to have a starting point.

@mudler
Copy link
Owner Author

mudler commented Nov 26, 2023

one enhancement for later: the current implementation - if a backend is stale - will cut the request. it should be possible instead to keep it alive and try again after the backend was shutdown

@dave-gray101
Copy link
Collaborator

It currently monitor all active connections, connections are recorded by the GRPC client, and when a backend becomes busy (starts processing a request), the current time is recorded in timetable for that backend. If the backend remains busy for longer than timeout, an action (like logging a warning or shutting down the backend) could be triggered (like now stops the backend directly).

Similarly, when a backend becomes idle (finishes processing a request), the current time is recorded in idleTime for that backend. If the backend remains idle for longer than idletimeout, it gets killed. This was asked in #1202 and took the occasion to implement it here as most of the logic applies to as well.

At the moment is possible to define timeout durations, enable and/or disable it (defaults to disabled), keeping it very simple to have a starting point.

Thanks for confirming that Mudler! That's pretty close to what I thought but it's good to check. The one feature request I have (even if it's not in the very first pr) is to expose that timetable from Watchdog, so that the monitoring endpoints can dig up data like when a backend was last used. Thanks!!

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler
Copy link
Owner Author

mudler commented Nov 26, 2023

It currently monitor all active connections, connections are recorded by the GRPC client, and when a backend becomes busy (starts processing a request), the current time is recorded in timetable for that backend. If the backend remains busy for longer than timeout, an action (like logging a warning or shutting down the backend) could be triggered (like now stops the backend directly).
Similarly, when a backend becomes idle (finishes processing a request), the current time is recorded in idleTime for that backend. If the backend remains idle for longer than idletimeout, it gets killed. This was asked in #1202 and took the occasion to implement it here as most of the logic applies to as well.
At the moment is possible to define timeout durations, enable and/or disable it (defaults to disabled), keeping it very simple to have a starting point.

Thanks for confirming that Mudler! That's pretty close to what I thought but it's good to check. The one feature request I have (even if it's not in the very first pr) is to expose that timetable from Watchdog, so that the monitoring endpoints can dig up data like when a backend was last used. Thanks!!

make totally sense, not exposing it now as it would not be used in the code and would be confusing but should be easy to iterate on it

@mudler
Copy link
Owner Author

mudler commented Nov 26, 2023

I'm not super-satisfied, but it's ok for a first stab at it. It works locally, and it's disabled by default, so should be good to go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: backend watchdog Allow grpc backend services to exit after idling for a while
2 participants