-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a /metrics endpoint for Prometheus Metrics #3490
Conversation
/cc @minrk @choldgraf @betatim @willingc who have come to like the Grafana / Prometheus integration we have in JupyterHub / BinderHub. /cc @rgbkrk @ivanov who had conversations about exposing 'resource use metrics' as part of the kernel (IIRC). This PR is orthogonal to that, since it only deals with operational and performance metrics, rather than things like 'here is what is happening to your spark cluster!' |
[Prometheus](https://prometheus.io/) provides a standard metrics format that can be collected and used in many contexts. - From the browser to drive 'current resource usage' displays, such as https://github.com/yuvipanda/nbresuse - From a prometheus server to collect historical data for operational analysis and performance monitoring Example: https://grafana.mybinder.org/dashboard/db/1-overview?refresh=1m&orgId=1 for mybinder.org metrics from JupyterHub and BinderHub, via prometheus server at https://prometheus.mybinder.org The JupyterHub and BinderHub projects already expose Prometheus metrics natively. Adding this to the Jupyter notebook server allows us to instrument the code easily and in a standard format that has lots of 3rd party tooling for it. This commit does the following: - Introduce the `prometheus_client` library as a dependency. This library has no dependencies of its own and is pure python. - Add an authenticated `/metrics` endpoint to the server, which returns metrics in Prometheus Text Format - Expose the default process metrics from `prometheus_client`, which include memory usage and CPU usage info (for just the notebook process)
3f6cf36
to
a764f90
Compare
The appveyor failure seems unrelated? |
Code adapted from JupyterHub
I stole the code for implementing RED HTTP metrics from JupyterHub and added them here. With this, I can answer questions like 'how many times was the Tree handler called and what is the 90th percentile of response time for it?' |
Works in JupyterHub because python3, fails python2 test here.
Is it standard practice to put it at |
Yep, /metrics is the standard endpoint. I eventually want to have an option
to make /metrics unauthenticated too (as an explicit opt in).
Since /api/ is all JSON, I feel ok keeping /metrics out of it. We could
also expose a JSON endpoint under /api/ later if people wish.
…On Mon, Apr 2, 2018 at 12:22 PM, Kyle Kelley ***@***.***> wrote:
Is it standard practice to put it at /metrics instead of some /api
endpoint?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3490 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAB23jTaLFZPkmPW5oaVT-0A8DkZlzRzks5tknp9gaJpZM4TD5iZ>
.
--
Yuvi Panda T
http://yuvi.in/blog
|
notebook/metrics.py
Outdated
method=handler.request.method, | ||
handler='{}.{}'.format(handler.__class__.__module__, type(handler).__name__), | ||
code=handler.get_status() | ||
).observe(handler.request.request_time()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is low overhead, since it's being called on every request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, quite. It's just incrementing a local counter based on a few strings and a number:
In [12]: %timeit prometheus_log_method(handler)
5.88 µs ± 87.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The only network activity occurs when a prometheus server retrieves the metrics via the /metrics
handler.
This seems reasonable at a quick look. How much information does it store? Is there any risk that if there's nowhere to hand the data off to, the memory use could continually grow as long as the server is left running? |
notebook/metrics.py
Outdated
conventions for metrics & labels. We generally prefer naming them | ||
`<noun>_<verb>_<type_suffix>`. So a histogram that's tracking | ||
the duration (in seconds) of servers spawning would be called | ||
SERVER_SPAWN_DURATION_SECONDS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy/paste. REQUEST_DURATION_SECONDS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an FYI, this particular example breaks the naming rule in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps better to remove the preference sentence with noun/verb/type.
Consider renaming to NOTEBOOK_REQUEST_DURATION_SECONDS
based on Prometheus docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it's not clear to me what a 'request duration' is - is that the time from the request being sent to it being received? The time from receiving the first byte to receiving the last? The time from receiving the request to sending the response?
If this is a standard term in web metrics, it doesn't matter that it's not familiar to me. But if it's a term we're creating, maybe we can create something less ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @yuvipanda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heya!
I removed the naming convention recommendation, and just directly linked only to the page instead. This should hopefully reduce confusion.
I've also renamed this metric to http_request_duration_seconds. I think that is pretty standard for what we are doing here, which is indiscriminately recording metric info for all http requests. Operators usually use job and instance labels automatically added by prometheus to differentiate various applications & instances of applications. So I think in this case, it's ok to not use a prefix.
I think the only bits people wanted changed here are in the docstring, and as an example of the naming I think it makes sense already (albeit that we don't actually use the example name here). So shall we merge this? |
@takluyver I've responded! Thank y'all for your patience :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yuvipanda
Prometheus provides a standard
metrics format that can be collected and used in many contexts.
to drive 'current resource usage' displays, such
as https://github.com/yuvipanda/nbresuse
to collect historical data for operational analysis and
performance monitoring
Example: https://grafana.mybinder.org/dashboard/db/1-overview?refresh=1m&orgId=1
for mybinder.org metrics from JupyterHub and BinderHub,
via prometheus server at https://prometheus.mybinder.org
The JupyterHub and BinderHub projects already expose Prometheus
metrics natively. Adding this to the Jupyter notebook server
allows us to instrument the code easily and in
a standard format that has lots of 3rd party tooling for it.
This commit does the following:
prometheus_client
library as a dependency.This library has no dependencies of its own and is pure python.
/metrics
endpoint to the server,which returns metrics in Prometheus Text Format
prometheus_client
,which include memory usage and CPU usage info (for just the
notebook process)