Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Docker Compose Update #57

Closed
wants to merge 5 commits into from
Closed

Conversation

seoester
Copy link
Contributor

@seoester seoester commented Feb 3, 2018

This PR contains new two new docker-compose.yml files, one for the backend (mariadb and redis) and one for Promgen, Prometheus, etc.

The docker docs are also updated to describe how to use Docker Compose to quickly get started with promgen.

I split the docker-compose.yml file into two so that the backend can be started separately, and only when ready promgen is started. A better solution would be to integrate wait-for-db functionality into the promgen image.
However, I have not looked into this yet.

The old compose file has been split into two, one is managing only
the backend (databases: mysql (mariadb) and redis), the other one
contains promgen (web and worker) as well as prometheus, blackbox
exporter and alertmanager.

By splitting it is possible to start the backends first and wait
until setup finishes. Only then promgen can be started.
@codecov-io
Copy link

codecov-io commented Feb 3, 2018

Codecov Report

Merging #57 into master will increase coverage by 0.66%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #57      +/-   ##
==========================================
+ Coverage   47.14%   47.81%   +0.66%     
==========================================
  Files          76       72       -4     
  Lines        2507     2405     -102     
==========================================
- Hits         1182     1150      -32     
+ Misses       1325     1255      -70
Impacted Files Coverage Δ
views.py
notification/user.py
notification/linenotify.py
migrations/0013_auto_20161027_0307.py
celery.py
discovery/default.py
util.py
shortcuts.py
migrations/0007_auto_20161020_0253.py
migrations/0029_auto_20170406_0236.py
... and 138 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fef9b9c...2a9e3da. Read the comment docs.

@seoester seoester mentioned this pull request Feb 4, 2018
@kfdm
Copy link
Member

kfdm commented Feb 5, 2018

I suppose my biggest question (again due to my lack of knowledge regarding docker best practices) but I'm a bit unsure why to split things up into two docker files. My initial impression was that the docker-compose file gave a base starting point as a demo or to get started but for most production deployments it would always require some adjustments. It's unfortunate that docker does not have a built in way to wait-for a container and Django also does not have a better way to have it reconnect so perhaps keeping it as a single docker-compose file and including a wait-for-it type check may be the way to go.

Previously I used the sentry onpremise script as a base, since Sentry is also a django application
https://github.com/getsentry/onpremise/blob/master/docker-compose.yml

Looks like there may be a default way built into newer docker versions but I have not tested it myself
https://stackoverflow.com/questions/31746182/docker-compose-wait-for-container-x-before-starting-y/41854997#41854997

@seoester
Copy link
Contributor Author

seoester commented Feb 5, 2018

I agree that applications (and docker containers as examples for microservices especially) should be resilient against failure of external dependencies / services.

I've read both about django supporting database restarts and django stalling when the database is shut down and then started again.
I would like to test this out, but might not get around to it this week. Until then, I'd like to keep this PR open.

By splitting into two files the backend can be treated as an external service which is assumed to always be available.

HEALTHCHECK is not an adequate method for waiting on other containers, it is meant for monitoring and automated life-cycle management.

@seoester
Copy link
Contributor Author

Hey, long time without any progress... But I finally got around to testing out some scenarios of db unavailability :) and mostly, it seemed to work out of the box. Put together with some additional wait-for-db functionality and the new docker-compose-bootstrap command from #92 we got a decent docker-compose configuration!

Together with some further changes, I'd like to finally remove the WIP label from this PR next week.

I'll paste my raw notes on different scenarios:

Experiments with promgen and db unavailability

Tested using the docker compose file with a recent version of the Dockerfile and

Context: Set-up and web app and worker running

  • mysql
    • When taken down
      • Web app returns error on request.
      • Worker is not impacted directly.
    • When coming back up
      • Web app works again (serves requests).
      • Worker is not impacted directly.
    • Unclear: Effect of CONN_MAX_AGE (django docs) but does not seem to affect the behaviour described above.
  • redis
    • When taken down
      • Requests requiring queue interaction hang forever
      • Celery enters retry loop (with backoff), worker exits (code 1) after a while
    • When coming back up
      • Hanging requests recover
      • If worker process has not exited yet, celery recovers

Context: Set-up and web app and worker not running

  • mysql
    • When down
      • Web app does not start properly, OperationalError is raised during system checks
      • HTTP listener is not started
      • Exits after a while? (not within 10 min)
      • Note: What check tries to connect to the database? Seems to be a generic field check: source code django.db.models.fields method _check_backend_specific_checks (Error message indicates self.features.supports_microsecond_precision)
      • Worker starts up, no errors logged. Presumably it would have problems performing configuration write tasks, but those are unlikely to be issued as the web app is unavailable. (Task retry mechanism?)
    • When coming back up
      • Web app does not re-check connection. Doesn't recover.
      • No direct impact on worker.
  • redis
    • When down
      • Web app starts up correctly, see above section (Context: Set-up and web app and worker running)
      • Worker starts but does not start up celery worker
        • No retrying visible in the logs
    • When coming back up
      • For web app, see above section (Context: Set-up and web app and worker running)
      • Worker starts celery as soon as redis is available

@seoester
Copy link
Contributor Author

This would be much improved by pending PR #95.

@@ -16,6 +16,17 @@ web)
shift
set -- gunicorn "promgen.wsgi:application" "$@"
;;
wait-runserver)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a separate wait-runserver block, I wonder if this would be better added to the existing 'web' block above here. Runserver is sufficient for development but perhaps the docker container should use gunicorn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that my Dockerfile entrypoint uses web though my original docker-compose uses runserver. It's probably best to make both as web since runserver is mostly for development use. What do you think?

I also thought a bit about having the promgen check ; sleep check apply to all of the commands by putting it at the top, but there are also likely valid cases or not forcing that check while debugging the system. Do you have any thoughts about this as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gunicorn is running a WSGI server, so this would require an additional container running nginx (or similar) to be able to serve HTTP requests.

I think it is sensible to have two different entrypoints, both with a wait mechanism:

  • An entrypoint starting up a HTTP server using the runserver command .
  • An entrypoint starting up a WSGI server using gunicorn.

I think commands should be regarded individually when considering whether to wait for db. With one-off commands run manually, it is in the user's interest to get an immediate response. For example worker does not really need to wait for db, as it's basic functionality does not depend on the availability of mysql and resilience to redis outages is already built into celery.

Perhaps for some one-off commands it would be useful to have optional wait for db functionality, for example when running it from a docker-compose file. In this case either a new command wait-… could be introduced using case fallthrough or an env var could be checked to decide whether to wait for db or try immediately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gunicorn can speak HTTP so nginx is not required though it is often used in a production environment. I do not have a strong opinion since I tend to consider docker-compose a bit more of a demo than a fully ready production environment. When deploying for real it's likely that most would want to tweak their setup a bit.

Maybe best to keep things simple in this pass. Can always add additional wait- commands or make it global if required in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, what do you think about these entrypoints, both protected by promgen check:

  • web - gunicorn running the application. Used as the default for docker-compose.
  • web-dev - django runserver. Meant to be used for testing and development.

web could then be used for a more production-y setup together with nginx. Additional documentation could be compiled on how to achieve this (e.g. using two containers sharing a volume for hosting of the static files).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. I think those names are more descriptive than wait-runserver for those who may not be as familiar with django directly.

PROMGEN_REGISTER_HOST: prometheus
PROMGEN_REGISTER_PORT: 9090
DATABASE_URL: mysql://root:secretrootpassword@mysql:3306/promgen
SECRET_KEY: some-secret-key-AhvSh+xGkm9kA,ghfi4?w8K
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with django commands and best practices, should this stanza also include DEBUG: 1, the CELERY_BROKER_URL? Is SECRET_KEY even required?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEBUG would only be necessary for times when you want extra debugging information, typically just for development. Could drop it from the docker-compose file and be fine.

SECRET_KEY in theory is only required for http but I believe the django startup code errors out if it is not set somewhere. Could set a default SECRET_KEY again in settings.py so that we could remove it from the docker-compose file.

The easiest way to get started is by using
`docker-compose <https://docs.docker.com/compose/>`__.

Promgen ships with two ``docker-compose.yml`` files:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the two files were combined, I think these docs need to also be updated :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, true 😄 I'll update the documentation when this PR is about to settle.

@github-actions
Copy link

github-actions bot commented Nov 8, 2019

This pr is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

@github-actions github-actions bot added the Stale label Nov 8, 2019
@seoester
Copy link
Contributor Author

I'm interested in revisiting this, as we are finally deploying Promgen in a Docker Swarm.
I'm closing this now as the numerous changes to Promgen and new Docker versions require a re-write of the compose file.

@seoester seoester closed this Nov 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants