Implement health checks for C++ and Python components. #1731

jrobble · 2023-12-02T03:09:49Z

Update the component executors to run an optional health check.

The executor works by creating an instance of the component class and uses that instance throughout the lifetime of the Docker container. The executor reads messages off the ActiveMQ queue for that component type and passes a job object to the component instance via a GetDetections API call. Our goal is to perform a health check using this same component instance.

The advantage of using the same component instance is that we can avoid situations where running a separate health check process in the container loads a separate set of models into CPU / GPU memory, which is undesirable. Also, if the component instance gets into a bad state over time, performing a health check on a fresh instance will not reveal the issue.

Update the component executor to determine if it should perform a health check on the component instance every time the executor determines that there is another job on the job queue. Specifically, if there is a job in the queue and the executor has not performed a health check for this component yet, it will perform the first check before processing the first job. After the job is complete, the executor will wait until there is another job in the queue. When there is one, the executor will determine if the health check cool down period has elapsed. If so, it will run a check before processing the next job. If not, it will just process the next job.

Note that we have the option to either pull the next job off the queue, or simply check if a job is on the queue, before performing the health check. If we pull the job off the queue, and the health check fails, then the message is returned to the queue and the attempt counts against the max attempts allowed (6) before the job ends up in the DLQ. This may be desirable and is how we should implement this feature.

Also, it's important to note that we cannot perform the health check on the component instance while it's processing a job. This makes it impossible to perform the health check at an exact time interval. Additionally, performing the health check right before processing a job is desirable since there is less time for environmental factors to cause the instance to get into a bad state before processing the job. Furthermore, performing health checks when there are no jobs in the queue is wasted effort.

There are multiple options for the behavior when the health check fails:

The executor process logs the issue and self-terminates with a non-zero exit code. If the Docker service is set to auto-restart then it will do so.
The component executor logs the issue but does not self-terminate. This could be useful for monitoring the service over time for issues and checking to see if the component self-recovers.
The executor process logs the issue, does not self-terminate, but also does not process jobs until after the next successful health check. This assumes that it's possible for the component to self-recover. Optionally, the component can self-terminate after N failed health checks.

Note that behavior # 1 can be implemented using behavior # 3 where N (the number of failed health checks before self-terminating) is set to 1.

To enable health checks, each component can support the following env. vars:

HEALTH_CHECK : If set to 'true' the health check is enabled.
HEALTH_CHECK_TIMEOUT : The cooldown period between health checks. At least this amount of time will pass before the next health check is performed.
HEALTH_CHECK_RETRY_MAX_ATTEMPTS : How many attempted health checks are allowed before the component self-terminates.

A user can configure the health check job by providing an .ini file at a known location with the component container (for example, /opt/mpf/plugins/health-check.ini):

media=<somewhere in the container or shared dir>
min_num_tracks=<N>
media_type=<IMAGE, VIDEO, AUDIO, or GENERIC>

[job_properties]
<alg_prop_a>=<some_val>
<alg_prop_b>=<some_val>

[media_properties]
<alg_prop_a>=<some_val>
<alg_prop_b>=<some_val>

The component executor will use the information in the .ini file to send a job to the component instance and check that at least the specified number of tracks are generated.

The text was updated successfully, but these errors were encountered:

jrobble added the feature label Dec 2, 2023

jrobble added this to the Milestone 3 milestone Dec 2, 2023

jrobble self-assigned this Dec 2, 2023

jrobble added this to To do in OpenMPF: Development via automation Dec 2, 2023

jrobble moved this from To do to Planned in OpenMPF: Development Dec 2, 2023

brosenberg42 changed the title ~~Implement component health checks~~ Implement health checks for C++ and Python components. Dec 8, 2023

brosenberg42 mentioned this issue Dec 8, 2023

Implement health checks for Java components. #1734

Open

jrobble moved this from Planned to To do in OpenMPF: Development Dec 14, 2023

jrobble assigned brosenberg42 and unassigned jrobble Dec 14, 2023

jrobble moved this from To do to In Progress in OpenMPF: Development Dec 14, 2023

jrobble added the hotfix label Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement health checks for C++ and Python components. #1731

Implement health checks for C++ and Python components. #1731

jrobble commented Dec 2, 2023 •

edited

Implement health checks for C++ and Python components. #1731

Implement health checks for C++ and Python components. #1731

Comments

jrobble commented Dec 2, 2023 • edited

jrobble commented Dec 2, 2023 •

edited