Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from curl-based default Docker healthcheck to a CLI-based one #5342

Merged
merged 2 commits into from Feb 28, 2024

Conversation

glennmatthews
Copy link
Contributor

@glennmatthews glennmatthews commented Feb 23, 2024

Closes N/A (but relates to #5340)

What's Changed

  • Switch default Docker healthcheck from a curl based one (which can fail if all request-processing workers/processes/threads are busy with other requests) to a CLI based one calling nautobot-server health_check.
    • Because of nautobot-server commands always calling import_jobs_as_celery_tasks when called #4292 and related startup code, the nautobot-server health_check takes about 6 seconds to execute, so I increased the default interval and timeout for the healthcheck from 5s to 10s. We should be able to bring this back down once we improve the performance of nautobot-server startup time.
    • I changed the default start-period from 5s to 5m since we know that initial migrations may take several minutes to complete.
    • With the changes to the default health-check, we no longer need to override it with a custom health-check in docker-compose.yml and docker-compose.final.yml.
  • Add a new healthcheck backend based on health_check.contrib.migrations implementation. This fails if there are any un-applied migrations detected and passes if all migrations are in effect. This is needed because while the /health/ URL endpoint won't start responding until the nautobot-server process is serving responses, and therefore implicitly will fail while migrations are in process, the nautobot-server health_check CLI spins up its own process to report back ASAP. We don't want the container to report as healthy before migrations are completed, as dependent containers/processes (celery worker, celery beat) are likely to encounter errors if they try to start up while migrations are in-flight.

QUESTION: should this go into develop as a bug fix or next as a feature/behavior-change?

Screenshots

# nautobot-server health_check
DatabaseBackend          ... working
DefaultFileStorageHealthCheck ... working
MigrationsBackend        ... unavailable: There are migrations not yet applied
RedisBackend             ... working
# nautobot-server health_check
DatabaseBackend          ... working
DefaultFileStorageHealthCheck ... working
MigrationsBackend        ... working
RedisBackend             ... working

TODO

  • Explanation of Change(s)
  • Added change log fragment(s) (for more information see the documentation)
  • Attached Screenshots, Payload Example
  • n/a Unit, Integration Tests
  • n/a Documentation Updates (when adding/changing features)
  • n/a Example Plugin Updates (when adding/changing features)
  • Outline Remaining Work, Constraints from Design

@glennmatthews glennmatthews added the emergent Unplanned work that is brought into a sprint after it's started. label Feb 26, 2024
@gsnider2195
Copy link
Contributor

QUESTION: should this go into develop as a bug fix or next as a feature/behavior-change?

I would say if this was a development environment change, then develop would be ok but since this is changing the Dockerfile, it should go to next.

@glennmatthews
Copy link
Contributor Author

QUESTION: should this go into develop as a bug fix or next as a feature/behavior-change?

I would say if this was a development environment change, then develop would be ok but since this is changing the Dockerfile, it should go to next.

Agreed. I'll retarget it.

@bryanculver
Copy link
Member

Could any of the Celery commands be used to get the status/availability of workers?

@glennmatthews
Copy link
Contributor Author

Could any of the Celery commands be used to get the status/availability of workers?

Sure, but I don't think we want the server container to report as unhealthy when no workers are running? Would cause a chicken-and-egg problem between the server container and the worker container(s).

@glennmatthews glennmatthews changed the base branch from develop to next February 27, 2024 13:31
@gsnider2195
Copy link
Contributor

Could any of the Celery commands be used to get the status/availability of workers?

Sure, but I don't think we want the server container to report as unhealthy when no workers are running? Would cause a chicken-and-egg problem between the server container and the worker container(s).

Also, the celery inspect commands are slow. They send out a ping and wait for a specified timeout for workers to report back.

@glennmatthews glennmatthews self-assigned this Feb 27, 2024
@glennmatthews glennmatthews merged commit ccd7843 into next Feb 28, 2024
17 checks passed
@glennmatthews glennmatthews deleted the u/glennmatthews-cli-healthcheck branch February 28, 2024 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
emergent Unplanned work that is brought into a sprint after it's started.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants