Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a /metrics endpoint for Prometheus Metrics #3490

Merged
merged 5 commits into from Jun 13, 2018

Conversation

Projects
None yet
5 participants
@yuvipanda
Copy link
Contributor

yuvipanda commented Apr 2, 2018

Prometheus provides a standard
metrics format that can be collected and used in many contexts.

The JupyterHub and BinderHub projects already expose Prometheus
metrics natively. Adding this to the Jupyter notebook server
allows us to instrument the code easily and in
a standard format that has lots of 3rd party tooling for it.

This commit does the following:

  • Introduce the prometheus_client library as a dependency.
    This library has no dependencies of its own and is pure python.
  • Add an authenticated /metrics endpoint to the server,
    which returns metrics in Prometheus Text Format
  • Expose the default process metrics from prometheus_client,
    which include memory usage and CPU usage info (for just the
    notebook process)
  • Expose per-handler HTTP metrics using the RED method
@yuvipanda

This comment has been minimized.

Copy link
Contributor Author

yuvipanda commented Apr 2, 2018

/cc @minrk @choldgraf @betatim @willingc who have come to like the Grafana / Prometheus integration we have in JupyterHub / BinderHub.

/cc @rgbkrk @ivanov who had conversations about exposing 'resource use metrics' as part of the kernel (IIRC). This PR is orthogonal to that, since it only deals with operational and performance metrics, rather than things like 'here is what is happening to your spark cluster!'

Add a /metrics endpoint for Prometheus Metrics
[Prometheus](https://prometheus.io/) provides a standard
metrics format that can be collected and used in many contexts.

- From the browser
  to drive 'current resource usage' displays, such
  as https://github.com/yuvipanda/nbresuse
- From a prometheus server
  to collect historical data for operational analysis and
  performance monitoring
  Example: https://grafana.mybinder.org/dashboard/db/1-overview?refresh=1m&orgId=1
  for mybinder.org metrics from JupyterHub and BinderHub,
  via prometheus server at https://prometheus.mybinder.org

The JupyterHub and BinderHub projects already expose Prometheus
metrics natively. Adding this to the Jupyter notebook server
allows us to instrument the code easily and in
a standard format that has lots of 3rd party tooling for it.

This commit does the following:

- Introduce the `prometheus_client` library as a dependency.
  This library has no dependencies of its own and is pure python.
- Add an authenticated `/metrics` endpoint to the server,
  which returns metrics in Prometheus Text Format
- Expose the default process metrics from `prometheus_client`,
  which include memory usage and CPU usage info (for just the
  notebook process)

@yuvipanda yuvipanda force-pushed the yuvipanda:prometheus-intro branch from 3f6cf36 to a764f90 Apr 2, 2018

@yuvipanda

This comment has been minimized.

Copy link
Contributor Author

yuvipanda commented Apr 2, 2018

The appveyor failure seems unrelated?

Log HTTP request codes & timings to Prometheus
Code adapted from JupyterHub
@yuvipanda

This comment has been minimized.

Copy link
Contributor Author

yuvipanda commented Apr 2, 2018

I stole the code for implementing RED HTTP metrics from JupyterHub and added them here. With this, I can answer questions like 'how many times was the Tree handler called and what is the 90th percentile of response time for it?'

Use regular dashes, not em (or en?) dashes
Works in JupyterHub because python3, fails python2 test here.
@rgbkrk

This comment has been minimized.

Copy link
Member

rgbkrk commented Apr 2, 2018

Is it standard practice to put it at /metrics instead of some /api endpoint?

@yuvipanda

This comment has been minimized.

Copy link
Contributor Author

yuvipanda commented Apr 2, 2018

@rgbkrk

rgbkrk approved these changes Apr 2, 2018

method=handler.request.method,
handler='{}.{}'.format(handler.__class__.__module__, type(handler).__name__),
code=handler.get_status()
).observe(handler.request.request_time())

This comment has been minimized.

@takluyver

takluyver Apr 3, 2018

Member

I assume this is low overhead, since it's being called on every request?

This comment has been minimized.

@minrk

minrk Apr 3, 2018

Member

Yeah, quite. It's just incrementing a local counter based on a few strings and a number:

In [12]: %timeit prometheus_log_method(handler)
5.88 µs ± 87.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The only network activity occurs when a prometheus server retrieves the metrics via the /metrics handler.

@takluyver

This comment has been minimized.

Copy link
Member

takluyver commented Apr 3, 2018

This seems reasonable at a quick look.

How much information does it store? Is there any risk that if there's nowhere to hand the data off to, the memory use could continually grow as long as the server is left running?

conventions for metrics & labels. We generally prefer naming them
`<noun>_<verb>_<type_suffix>`. So a histogram that's tracking
the duration (in seconds) of servers spawning would be called
SERVER_SPAWN_DURATION_SECONDS.

This comment has been minimized.

@minrk

minrk Apr 3, 2018

Member

copy/paste. REQUEST_DURATION_SECONDS

This comment has been minimized.

@willingc

willingc Apr 3, 2018

Member

As an FYI, this particular example breaks the naming rule in the docstring.

This comment has been minimized.

@willingc

willingc Apr 3, 2018

Member

Perhaps better to remove the preference sentence with noun/verb/type.

Consider renaming to NOTEBOOK_REQUEST_DURATION_SECONDS based on Prometheus docs.

This comment has been minimized.

@takluyver

takluyver Apr 3, 2018

Member

Actually, it's not clear to me what a 'request duration' is - is that the time from the request being sent to it being received? The time from receiving the first byte to receiving the last? The time from receiving the request to sending the response?

If this is a standard term in web metrics, it doesn't matter that it's not familiar to me. But if it's a term we're creating, maybe we can create something less ambiguous.

This comment has been minimized.

@takluyver

This comment has been minimized.

@yuvipanda

yuvipanda Jun 11, 2018

Author Contributor

Heya!

I removed the naming convention recommendation, and just directly linked only to the page instead. This should hopefully reduce confusion.

I've also renamed this metric to http_request_duration_seconds. I think that is pretty standard for what we are doing here, which is indiscriminately recording metric info for all http requests. Operators usually use job and instance labels automatically added by prometheus to differentiate various applications & instances of applications. So I think in this case, it's ok to not use a prefix.

@takluyver

This comment has been minimized.

Copy link
Member

takluyver commented Jun 10, 2018

I think the only bits people wanted changed here are in the docstring, and as an example of the naming I think it makes sense already (albeit that we don't actually use the example name here). So shall we merge this?

@yuvipanda

This comment has been minimized.

Copy link
Contributor Author

yuvipanda commented Jun 11, 2018

@takluyver I've responded! Thank y'all for your patience :)

@rgbkrk

rgbkrk approved these changes Jun 11, 2018

@willingc
Copy link
Member

willingc left a comment

Thanks @yuvipanda

@minrk minrk merged commit 1918856 into jupyter:master Jun 13, 2018

4 checks passed

codecov/patch 81.81% of diff hit (target 0%)
Details
codecov/project Absolute coverage decreased by -1.66% but relative coverage increased by +5.74% compared to faa0cab
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@minrk minrk added this to the 5.6 milestone Jun 13, 2018

@yuvipanda yuvipanda deleted the yuvipanda:prometheus-intro branch Jun 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.