Switch from curl-based default Docker healthcheck to a CLI-based one #5342

glennmatthews · 2024-02-23T22:27:32Z

Closes N/A (but relates to #5340)

What's Changed

Switch default Docker healthcheck from a curl based one (which can fail if all request-processing workers/processes/threads are busy with other requests) to a CLI based one calling nautobot-server health_check.
- Because of nautobot-server commands always calling import_jobs_as_celery_tasks when called #4292 and related startup code, the nautobot-server health_check takes about 6 seconds to execute, so I increased the default interval and timeout for the healthcheck from 5s to 10s. We should be able to bring this back down once we improve the performance of nautobot-server startup time.
- I changed the default start-period from 5s to 5m since we know that initial migrations may take several minutes to complete.
- With the changes to the default health-check, we no longer need to override it with a custom health-check in docker-compose.yml and docker-compose.final.yml.
Add a new healthcheck backend based on health_check.contrib.migrations implementation. This fails if there are any un-applied migrations detected and passes if all migrations are in effect. This is needed because while the /health/ URL endpoint won't start responding until the nautobot-server process is serving responses, and therefore implicitly will fail while migrations are in process, the nautobot-server health_check CLI spins up its own process to report back ASAP. We don't want the container to report as healthy before migrations are completed, as dependent containers/processes (celery worker, celery beat) are likely to encounter errors if they try to start up while migrations are in-flight.

QUESTION: should this go into `develop` as a bug fix or `next` as a feature/behavior-change?

Screenshots

# nautobot-server health_check
DatabaseBackend          ... working
DefaultFileStorageHealthCheck ... working
MigrationsBackend        ... unavailable: There are migrations not yet applied
RedisBackend             ... working

# nautobot-server health_check
DatabaseBackend          ... working
DefaultFileStorageHealthCheck ... working
MigrationsBackend        ... working
RedisBackend             ... working

TODO

Explanation of Change(s)
Added change log fragment(s) (for more information see the documentation)
Attached Screenshots, Payload Example
n/a Unit, Integration Tests
n/a Documentation Updates (when adding/changing features)
n/a Example Plugin Updates (when adding/changing features)
Outline Remaining Work, Constraints from Design

nautobot/extras/health_checks.py

gsnider2195 · 2024-02-26T16:22:53Z

QUESTION: should this go into develop as a bug fix or next as a feature/behavior-change?

I would say if this was a development environment change, then develop would be ok but since this is changing the Dockerfile, it should go to next.

glennmatthews · 2024-02-26T22:33:03Z

QUESTION: should this go into develop as a bug fix or next as a feature/behavior-change?

I would say if this was a development environment change, then develop would be ok but since this is changing the Dockerfile, it should go to next.

Agreed. I'll retarget it.

bryanculver · 2024-02-27T04:20:33Z

Could any of the Celery commands be used to get the status/availability of workers?

glennmatthews · 2024-02-27T13:19:20Z

Could any of the Celery commands be used to get the status/availability of workers?

Sure, but I don't think we want the server container to report as unhealthy when no workers are running? Would cause a chicken-and-egg problem between the server container and the worker container(s).

… Add migrations check to healthcheck.

gsnider2195 · 2024-02-27T13:59:13Z

Could any of the Celery commands be used to get the status/availability of workers?

Sure, but I don't think we want the server container to report as unhealthy when no workers are running? Would cause a chicken-and-egg problem between the server container and the worker container(s).

Also, the celery inspect commands are slow. They send out a ping and wait for a specified timeout for workers to report back.

nautobot/extras/apps.py

nautobot/extras/health_checks.py

gsnider2195 reviewed Feb 23, 2024

View reviewed changes

nautobot/extras/health_checks.py Show resolved Hide resolved

glennmatthews added the emergent Unplanned work that is brought into a sprint after it's started. label Feb 26, 2024

Switch from curl-based default Docker healthcheck to a CLI-based one.…

4b1b4d9

… Add migrations check to healthcheck.

glennmatthews force-pushed the u/glennmatthews-cli-healthcheck branch from 14b56a9 to 4b1b4d9 Compare February 27, 2024 13:31

glennmatthews changed the base branch from develop to next February 27, 2024 13:31

gsnider2195 reviewed Feb 27, 2024

View reviewed changes

nautobot/extras/apps.py Outdated Show resolved Hide resolved

gsnider2195 reviewed Feb 27, 2024

View reviewed changes

nautobot/extras/health_checks.py Show resolved Hide resolved

glennmatthews self-assigned this Feb 27, 2024

Review feedback

525c0cc

gsnider2195 approved these changes Feb 27, 2024

View reviewed changes

glennmatthews merged commit ccd7843 into next Feb 28, 2024
17 checks passed

glennmatthews deleted the u/glennmatthews-cli-healthcheck branch February 28, 2024 19:41

glennmatthews mentioned this pull request Mar 18, 2024

Add documentation about docker-compose/k8s health checks #5449

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch from curl-based default Docker healthcheck to a CLI-based one #5342

Switch from curl-based default Docker healthcheck to a CLI-based one #5342

glennmatthews commented Feb 23, 2024 •

edited

gsnider2195 commented Feb 26, 2024

glennmatthews commented Feb 26, 2024

bryanculver commented Feb 27, 2024

glennmatthews commented Feb 27, 2024

gsnider2195 commented Feb 27, 2024

Switch from curl-based default Docker healthcheck to a CLI-based one #5342

Switch from curl-based default Docker healthcheck to a CLI-based one #5342

Conversation

glennmatthews commented Feb 23, 2024 • edited

Closes N/A (but relates to #5340)

What's Changed

QUESTION: should this go into develop as a bug fix or next as a feature/behavior-change?

Screenshots

TODO

gsnider2195 commented Feb 26, 2024

glennmatthews commented Feb 26, 2024

bryanculver commented Feb 27, 2024

glennmatthews commented Feb 27, 2024

gsnider2195 commented Feb 27, 2024

glennmatthews commented Feb 23, 2024 •

edited

QUESTION: should this go into `develop` as a bug fix or `next` as a feature/behavior-change?