Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug-1847235: emit tecken.gunicorn_worker_abort incr on timeout #2843

Merged
merged 2 commits into from Dec 1, 2023

Conversation

willkg
Copy link
Collaborator

@willkg willkg commented Nov 30, 2023

This uses gunicorn server hooks to emit a tecken.gunicorn_worker_abort incr when the gunicorn manager terminates a gunicorn worker because it's exceeded the timeout. This happens when it's taking too long to process an upload API request.

The test for this is contrived at best and only verifies that the metric we expect is emitted.

It's difficult to do a real test because that requires Tecken to be running with gunicorn and for a gunicorn worker to exceed the timeout such that the gunicorn manager kills the process. We can manually test it using the steps in this blog post:

https://bluesock.org/~willkg/blog/mozilla/tecken_worker_exit.html

When the Gunicorn manager sends a SIGABRT to the Gunicorn worker, we'll see the metric emitted:

tecken-web-1           | INFO 2023-11-30 20:26:08,248 - webapp - markus METRICS|2023-11-30 20:26:08|timing|tecken.upload_file_exists|6.280567002249882|
tecken-web-1           | DEBUG 2023-11-30 20:26:08,249 - webapp - tecken key_existing cache miss on publicbucket:v1/libglib-2.0.so.0/441450BE1030C69946EC77871CA83C5D0/libglib-2.0.so.0
tecken-web-1           | [2023-11-30 20:27:08 +0000] [16] [CRITICAL] WORKER TIMEOUT (pid:22)
tecken-web-1           | INFO 2023-11-30 20:27:08,968 - webapp - markus METRICS|2023-11-30 20:27:08|incr|tecken.gunicorn_worker_abort|1|
tecken-web-1           | [2023-11-30 20:27:09 +0000] [16] [ERROR] Worker (pid:22) was sent SIGKILL! Perhaps out of memory?
tecken-web-1           | [2023-11-30 20:27:09 +0000] [34] [INFO] Booting worker with pid: 34
tecken-web-1           | WARNING 2023-11-30 20:27:10,365 - webapp - tecken Auth0 blocked check disabled

This uses gunicorn server hooks to emit a tecken.gunicorn_worker_abort
incr when the gunicorn manager terminates a gunicorn worker because it's
exceeded the timeout. This happens when it's taking too long to process
an upload API request.
@willkg
Copy link
Collaborator Author

willkg commented Nov 30, 2023

@smarnach Can you review this? It'd be great to land this and then push it to production so we have some data before we change the instance size.

@@ -7,6 +7,9 @@
# Tecken
# ------

# Gunicorn things
GUNICORN_TIMEOUT=60
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This drops the GUNICORN_TIMEOUT in the local dev environment to 60 seconds. It reduces the number of steps it takes to simulate the gunicorn worker timeout scenario and 60 seconds is probably a better value for a local dev environment anyhow.

Copy link
Contributor

@smarnach smarnach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks reasonable and simple. I don't understand how the initialization works, but I'm approving the PR so we can deploy it if you decide to go ahead.

metrics = markus.get_metrics("tecken")


def configure_markus():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is calling this function? It doesn't seem to be one of the predefined Gunicorn hooks, so I don't understand how it gets called. Maybe this should be called post_worker_init(), or whatever the appropriate hook is?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops--it doesn't get called. But it works when I manually test things plus the logging output is what we configure in the Django webapp and not what Gunicorn is emitting, so I wonder if worker_abort gets called in the gunicorn worker in the SIGABRT handling and not the Gunicorn master process.

I checked the code and that's what's going on. So we don't need the configure_markus and some of that other stuff at all. I'll remove it and add a note.

https://github.com/benoitc/gunicorn/blob/ca9162d9cd0f2403a4b65a54ddb996eb855e8b58/gunicorn/workers/base.py#L200-L203

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. In that case, I'd move the import for markus and the initialization of metrics into the worker_abort hook. It probably doesn't matter much, since markus looks pretty lightweight and doesn't pull in any modules outside of the standard library.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about slack and agreed that in gunicornhooks.py, we should delay any import/usage of Tecken webapp things until the point where they're used.

I'll fix that in a follow-up PR.

tecken/gunicornhooks.py Outdated Show resolved Hide resolved
worker_abort is run in the sigabrt handling in the gunicorn worker. That
process has markus and logging configured already, so we don't need to
additionally configure it in this file.

Further, this adds comments clarify the context things are run in and
the rules for adding things to the gunicornhooks.py file.
@willkg
Copy link
Collaborator Author

willkg commented Dec 1, 2023

Thank you!

@willkg willkg merged commit 220f626 into main Dec 1, 2023
1 check passed
@willkg willkg deleted the bug-1847235-upload-metrics branch December 1, 2023 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants