-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug-1847235: emit tecken.gunicorn_worker_abort incr on timeout #2843
Conversation
This uses gunicorn server hooks to emit a tecken.gunicorn_worker_abort incr when the gunicorn manager terminates a gunicorn worker because it's exceeded the timeout. This happens when it's taking too long to process an upload API request.
bb3e704
to
61d9815
Compare
@smarnach Can you review this? It'd be great to land this and then push it to production so we have some data before we change the instance size. |
@@ -7,6 +7,9 @@ | |||
# Tecken | |||
# ------ | |||
|
|||
# Gunicorn things | |||
GUNICORN_TIMEOUT=60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This drops the GUNICORN_TIMEOUT
in the local dev environment to 60 seconds. It reduces the number of steps it takes to simulate the gunicorn worker timeout scenario and 60 seconds is probably a better value for a local dev environment anyhow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks reasonable and simple. I don't understand how the initialization works, but I'm approving the PR so we can deploy it if you decide to go ahead.
tecken/gunicornhooks.py
Outdated
metrics = markus.get_metrics("tecken") | ||
|
||
|
||
def configure_markus(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is calling this function? It doesn't seem to be one of the predefined Gunicorn hooks, so I don't understand how it gets called. Maybe this should be called post_worker_init()
, or whatever the appropriate hook is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops--it doesn't get called. But it works when I manually test things plus the logging output is what we configure in the Django webapp and not what Gunicorn is emitting, so I wonder if worker_abort
gets called in the gunicorn worker in the SIGABRT handling and not the Gunicorn master process.
I checked the code and that's what's going on. So we don't need the configure_markus
and some of that other stuff at all. I'll remove it and add a note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. In that case, I'd move the import for markus
and the initialization of metrics
into the worker_abort
hook. It probably doesn't matter much, since markus looks pretty lightweight and doesn't pull in any modules outside of the standard library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about slack and agreed that in gunicornhooks.py
, we should delay any import/usage of Tecken webapp things until the point where they're used.
I'll fix that in a follow-up PR.
worker_abort is run in the sigabrt handling in the gunicorn worker. That process has markus and logging configured already, so we don't need to additionally configure it in this file. Further, this adds comments clarify the context things are run in and the rules for adding things to the gunicornhooks.py file.
Thank you! |
This uses gunicorn server hooks to emit a tecken.gunicorn_worker_abort incr when the gunicorn manager terminates a gunicorn worker because it's exceeded the timeout. This happens when it's taking too long to process an upload API request.
The test for this is contrived at best and only verifies that the metric we expect is emitted.
It's difficult to do a real test because that requires Tecken to be running with gunicorn and for a gunicorn worker to exceed the timeout such that the gunicorn manager kills the process. We can manually test it using the steps in this blog post:
https://bluesock.org/~willkg/blog/mozilla/tecken_worker_exit.html
When the Gunicorn manager sends a SIGABRT to the Gunicorn worker, we'll see the metric emitted: