Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement health checks for C++ and Python components. #1731

Open
jrobble opened this issue Dec 2, 2023 · 0 comments
Open

Implement health checks for C++ and Python components. #1731

jrobble opened this issue Dec 2, 2023 · 0 comments

Comments

@jrobble
Copy link
Member

jrobble commented Dec 2, 2023

Update the component executors to run an optional health check.

The executor works by creating an instance of the component class and uses that instance throughout the lifetime of the Docker container. The executor reads messages off the ActiveMQ queue for that component type and passes a job object to the component instance via a GetDetections API call. Our goal is to perform a health check using this same component instance.

The advantage of using the same component instance is that we can avoid situations where running a separate health check process in the container loads a separate set of models into CPU / GPU memory, which is undesirable. Also, if the component instance gets into a bad state over time, performing a health check on a fresh instance will not reveal the issue.

Update the component executor to determine if it should perform a health check on the component instance every time the executor determines that there is another job on the job queue. Specifically, if there is a job in the queue and the executor has not performed a health check for this component yet, it will perform the first check before processing the first job. After the job is complete, the executor will wait until there is another job in the queue. When there is one, the executor will determine if the health check cool down period has elapsed. If so, it will run a check before processing the next job. If not, it will just process the next job.

Note that we have the option to either pull the next job off the queue, or simply check if a job is on the queue, before performing the health check. If we pull the job off the queue, and the health check fails, then the message is returned to the queue and the attempt counts against the max attempts allowed (6) before the job ends up in the DLQ. This may be desirable and is how we should implement this feature.

Also, it's important to note that we cannot perform the health check on the component instance while it's processing a job. This makes it impossible to perform the health check at an exact time interval. Additionally, performing the health check right before processing a job is desirable since there is less time for environmental factors to cause the instance to get into a bad state before processing the job. Furthermore, performing health checks when there are no jobs in the queue is wasted effort.

There are multiple options for the behavior when the health check fails:

  1. The executor process logs the issue and self-terminates with a non-zero exit code. If the Docker service is set to auto-restart then it will do so.

  2. The component executor logs the issue but does not self-terminate. This could be useful for monitoring the service over time for issues and checking to see if the component self-recovers.

  3. The executor process logs the issue, does not self-terminate, but also does not process jobs until after the next successful health check. This assumes that it's possible for the component to self-recover. Optionally, the component can self-terminate after N failed health checks.

Note that behavior # 1 can be implemented using behavior # 3 where N (the number of failed health checks before self-terminating) is set to 1.

To enable health checks, each component can support the following env. vars:

  • HEALTH_CHECK : If set to 'true' the health check is enabled.
  • HEALTH_CHECK_TIMEOUT : The cooldown period between health checks. At least this amount of time will pass before the next health check is performed.
  • HEALTH_CHECK_RETRY_MAX_ATTEMPTS : How many attempted health checks are allowed before the component self-terminates.

A user can configure the health check job by providing an .ini file at a known location with the component container (for example, /opt/mpf/plugins/health-check.ini):

media=<somewhere in the container or shared dir>
min_num_tracks=<N>
media_type=<IMAGE, VIDEO, AUDIO, or GENERIC>

[job_properties]
<alg_prop_a>=<some_val>
<alg_prop_b>=<some_val>

[media_properties]
<alg_prop_a>=<some_val>
<alg_prop_b>=<some_val>

The component executor will use the information in the .ini file to send a job to the component instance and check that at least the specified number of tracks are generated.

@jrobble jrobble added the feature label Dec 2, 2023
@jrobble jrobble added this to the Milestone 3 milestone Dec 2, 2023
@jrobble jrobble self-assigned this Dec 2, 2023
@jrobble jrobble added this to To do in OpenMPF: Development via automation Dec 2, 2023
@jrobble jrobble moved this from To do to Planned in OpenMPF: Development Dec 2, 2023
@brosenberg42 brosenberg42 changed the title Implement component health checks Implement health checks for C++ and Python components. Dec 8, 2023
@jrobble jrobble moved this from Planned to To do in OpenMPF: Development Dec 14, 2023
@jrobble jrobble assigned brosenberg42 and unassigned jrobble Dec 14, 2023
@jrobble jrobble moved this from To do to In Progress in OpenMPF: Development Dec 14, 2023
@jrobble jrobble added the hotfix label Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
OpenMPF: Development
  
In Progress
Development

No branches or pull requests

2 participants