Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a /metrics endpoint for Prometheus Metrics #3490

Merged
merged 5 commits into from
Jun 13, 2018

Conversation

yuvipanda
Copy link
Contributor

@yuvipanda yuvipanda commented Apr 2, 2018

Prometheus provides a standard
metrics format that can be collected and used in many contexts.

The JupyterHub and BinderHub projects already expose Prometheus
metrics natively. Adding this to the Jupyter notebook server
allows us to instrument the code easily and in
a standard format that has lots of 3rd party tooling for it.

This commit does the following:

  • Introduce the prometheus_client library as a dependency.
    This library has no dependencies of its own and is pure python.
  • Add an authenticated /metrics endpoint to the server,
    which returns metrics in Prometheus Text Format
  • Expose the default process metrics from prometheus_client,
    which include memory usage and CPU usage info (for just the
    notebook process)
  • Expose per-handler HTTP metrics using the RED method

@yuvipanda
Copy link
Contributor Author

/cc @minrk @choldgraf @betatim @willingc who have come to like the Grafana / Prometheus integration we have in JupyterHub / BinderHub.

/cc @rgbkrk @ivanov who had conversations about exposing 'resource use metrics' as part of the kernel (IIRC). This PR is orthogonal to that, since it only deals with operational and performance metrics, rather than things like 'here is what is happening to your spark cluster!'

[Prometheus](https://prometheus.io/) provides a standard
metrics format that can be collected and used in many contexts.

- From the browser
  to drive 'current resource usage' displays, such
  as https://github.com/yuvipanda/nbresuse
- From a prometheus server
  to collect historical data for operational analysis and
  performance monitoring
  Example: https://grafana.mybinder.org/dashboard/db/1-overview?refresh=1m&orgId=1
  for mybinder.org metrics from JupyterHub and BinderHub,
  via prometheus server at https://prometheus.mybinder.org

The JupyterHub and BinderHub projects already expose Prometheus
metrics natively. Adding this to the Jupyter notebook server
allows us to instrument the code easily and in
a standard format that has lots of 3rd party tooling for it.

This commit does the following:

- Introduce the `prometheus_client` library as a dependency.
  This library has no dependencies of its own and is pure python.
- Add an authenticated `/metrics` endpoint to the server,
  which returns metrics in Prometheus Text Format
- Expose the default process metrics from `prometheus_client`,
  which include memory usage and CPU usage info (for just the
  notebook process)
@yuvipanda
Copy link
Contributor Author

The appveyor failure seems unrelated?

@yuvipanda
Copy link
Contributor Author

I stole the code for implementing RED HTTP metrics from JupyterHub and added them here. With this, I can answer questions like 'how many times was the Tree handler called and what is the 90th percentile of response time for it?'

Works in JupyterHub because python3, fails python2 test here.
@rgbkrk
Copy link
Member

rgbkrk commented Apr 2, 2018

Is it standard practice to put it at /metrics instead of some /api endpoint?

@yuvipanda
Copy link
Contributor Author

yuvipanda commented Apr 2, 2018 via email

method=handler.request.method,
handler='{}.{}'.format(handler.__class__.__module__, type(handler).__name__),
code=handler.get_status()
).observe(handler.request.request_time())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is low overhead, since it's being called on every request?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, quite. It's just incrementing a local counter based on a few strings and a number:

In [12]: %timeit prometheus_log_method(handler)
5.88 µs ± 87.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The only network activity occurs when a prometheus server retrieves the metrics via the /metrics handler.

@takluyver
Copy link
Member

This seems reasonable at a quick look.

How much information does it store? Is there any risk that if there's nowhere to hand the data off to, the memory use could continually grow as long as the server is left running?

conventions for metrics & labels. We generally prefer naming them
`<noun>_<verb>_<type_suffix>`. So a histogram that's tracking
the duration (in seconds) of servers spawning would be called
SERVER_SPAWN_DURATION_SECONDS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy/paste. REQUEST_DURATION_SECONDS

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an FYI, this particular example breaks the naming rule in the docstring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps better to remove the preference sentence with noun/verb/type.

Consider renaming to NOTEBOOK_REQUEST_DURATION_SECONDS based on Prometheus docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's not clear to me what a 'request duration' is - is that the time from the request being sent to it being received? The time from receiving the first byte to receiving the last? The time from receiving the request to sending the response?

If this is a standard term in web metrics, it doesn't matter that it's not familiar to me. But if it's a term we're creating, maybe we can create something less ambiguous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @yuvipanda

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya!

I removed the naming convention recommendation, and just directly linked only to the page instead. This should hopefully reduce confusion.

I've also renamed this metric to http_request_duration_seconds. I think that is pretty standard for what we are doing here, which is indiscriminately recording metric info for all http requests. Operators usually use job and instance labels automatically added by prometheus to differentiate various applications & instances of applications. So I think in this case, it's ok to not use a prefix.

@takluyver
Copy link
Member

I think the only bits people wanted changed here are in the docstring, and as an example of the naming I think it makes sense already (albeit that we don't actually use the example name here). So shall we merge this?

@yuvipanda
Copy link
Contributor Author

@takluyver I've responded! Thank y'all for your patience :)

Copy link
Member

@willingc willingc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yuvipanda

@minrk minrk merged commit 1918856 into jupyter:master Jun 13, 2018
@minrk minrk added this to the 5.6 milestone Jun 13, 2018
@yuvipanda yuvipanda deleted the prometheus-intro branch June 13, 2018 18:10
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants