New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring Grafana #3302

Closed
jaimegago opened this Issue Nov 21, 2015 · 34 comments

Comments

Projects
None yet
@jaimegago
Copy link
Contributor

jaimegago commented Nov 21, 2015

It's time to monitor the monitoring! It'd be great to have a /status or /health endpoint that returns grafana health data as json.

Things I'd like to get from a status endpoint are:

  • configured sources are reachable (when I configure a new graphite source I can test the connection, I'd love to have that via the /status API)
  • DB is available
  • configured authorization sources are reachable
  • version

e.g:

/status

{ "date_sources_ok": True, "database_ok": True, "authorization_ok": True, "grafana_version": "2.5.1" }

@anryko

This comment has been minimized.

Copy link
Contributor

anryko commented Nov 21, 2015

++

@kjedamzik

This comment has been minimized.

Copy link
Contributor

kjedamzik commented Nov 27, 2015

👍

@torkelo

This comment has been minimized.

Copy link
Member

torkelo commented Dec 8, 2015

make sure the health url does not generate sessions

@mattttt

This comment has been minimized.

Copy link
Contributor

mattttt commented Jan 8, 2016

👍

@williamjoy

This comment has been minimized.

Copy link
Contributor

williamjoy commented Jan 11, 2016

+1 , this would be very useful to run grafana behind loadbalancer, loadbalancer will call the /health HTTP to verify is grafana returns HTTP 200 OK.

@theangryangel

This comment has been minimized.

Copy link
Contributor

theangryangel commented Jun 4, 2016

I've put together something dead simple, but I'm not particularly happy with it at the moment.

If anyone would like to take a look at current state vs master: master...theangryangel:feature/health_check

It returns something like:

{"current_timestamp":"2016-06-04T18:43:49+01:00","database_ok":true,"session_ok":true,"version":{"built":1464981754,"commit":"v3.0.4+158-g7cbaf06-dirty","version":"3.1.0"}}

The database check I was originally returning some stats, but I've cut that out. I could switch the query to something much simpler like "select 1" and checking it doesn't error. Not sure if it's worth it.

The session check I'm not particularly happy with either. There doesn't seem to be an easy to test without standing up a test macaron server and recover()ing from the panic that it would throw when starting a session provider, or modifying macaron/session to add a test feature to each of the providers. As it is right now it irritating returns a Set-Cookie header, which I don't particularly want. I'd appreciate some input where to take this from someone more experienced with macaron 😞

Checking for data sources doesn't seem particularly sane to try through this given how grafana is written. Probably more sane to add to your regular monitoring system.

@wpt1313

This comment has been minimized.

Copy link

wpt1313 commented Jun 10, 2016

I was facing the same issue and as a workaround, I use an API call from the load balancer with a dedicated authentication API key. I'm using HAProxy, which has some useful "hidden" feature of setting custom HTTP headers in option httpchk:

option httpchk GET /api/org HTTP/1.0\r\nAccept:\ application/json\r\nContent-Type:\ application/json\r\nAuthorization:\ Bearer\ your_api_key\r\n

(I need to use HTTP/1.0 rather than 1.1, since the latter requires setting Host header and I can't get it dynamically in HAProxy config).

/api/org seems to be the simplest request with little overhead and returns HTTP 200, which is exactly what the load balancer needs -- and does not create any new sessions.

@iceycake

This comment has been minimized.

Copy link

iceycake commented Jul 7, 2016

Any progress or PR on this issue?

@tuxtek

This comment has been minimized.

Copy link

tuxtek commented Sep 29, 2016

+1

@JorritSalverda

This comment has been minimized.

Copy link

JorritSalverda commented Sep 29, 2016

I would split this into a separate /liveness and /readiness endpoint as is best practice in kubernetes. /liveness only indicates whether grafana itself is up and running, /readiness indicates whether its ready to receive traffic and will check whether its dependencies are reachable.

In kubernetes the liveness endpoint will be probed and when failing a number of times to respond with 200 ok the container will be killed and replaced with a new one. The readiness endpoint is used to make the container part of a service and send traffic its way. Like adding and removing it from a load balancer.

@marco-hoyer

This comment has been minimized.

Copy link

marco-hoyer commented Oct 12, 2016

+1

@bigkraig

This comment has been minimized.

Copy link

bigkraig commented Nov 3, 2016

what about adding a /metrics Prometheus endpoint?

@bergquist bergquist added this to the 4.1.0 milestone Nov 3, 2016

@vinhlh

This comment has been minimized.

Copy link

vinhlh commented Nov 8, 2016

+1

@vinhlh

This comment has been minimized.

Copy link

vinhlh commented Nov 8, 2016

For whoever needs health checks on some services like Amazon ECS:
Use this hack: Path /public/img/grafana_icon.svg, HTTP Code: 200.

@philip-wernersbach

This comment has been minimized.

Copy link

philip-wernersbach commented Nov 14, 2016

+1

@envintus

This comment has been minimized.

Copy link

envintus commented Dec 5, 2016

In the mean time if you're only looking for a simple HTTP code: 200, then just use /login. My colleague and I just deployed Grafana to a Kubernetes cluster and using that endpoint worked just fine for the liveness/readiness probes. Also works for the Google Compute Engine load balancer.

@andyfeller

This comment has been minimized.

Copy link

andyfeller commented Dec 5, 2016

@philip-wernersbach

This comment has been minimized.

Copy link

philip-wernersbach commented Dec 6, 2016

I'd like to add our specific use case: we need a simple HTTP endpoint for checking if a user can login and display graphs. I know that we can use the static resources and endpoints such as /login to work around the absence of this, but we really need something that checks that the Grafana internals are running as expected. We don't necessarily need status checks for retrieving data from data sources, as we have separate health checks for those.

@envintus

This comment has been minimized.

Copy link

envintus commented Dec 6, 2016

@torkelo torkelo removed this from the 4.1.0 milestone Dec 14, 2016

@torkelo

This comment has been minimized.

Copy link
Member

torkelo commented Dec 14, 2016

So there is currently in 4.0 a /api/metrics endpoint with some internal metrics.

But the issue requests something like this

{ "date_sources_ok": True, "database_ok": True, "authorization_ok": True, "grafana_version": "2.5.1" }

Would be good with a more detailed description for what is expected here. Should the API health call do a live check with all data sources in all orgs? should it be done on the fly as the /health api call is made?
What does authorization ok mean?

@andyfeller

This comment has been minimized.

Copy link

andyfeller commented Dec 14, 2016

@torkelo going to toss out an idea but definitely think /health should allow for both grafana-server as well as installed plugins to register arbitrary things to report on:

{
	"ok": false,
	"items": [
		"datasources": {
			"ok": true,
		},
		"database": {
			"ok": false,
			"msg": "Cannot communicate ###.###.###.###/XXXXXXX"
		},
		...
	]
}

By default, health checks perform live checks of all things when endpoint is called. If people want to isolate health checks to specific things, you can do something like elasticsearch does for cluster health. When thing is an external service (authorization, database, etc), then connectivity test is done at the minimum and any other sanity check that is reasonable for thing (e.g. SELECT 1 for database, LDAP bind test for authorization, etc).

Having output like this will allow monitoring checks to check holistically for issues while finding specific problems and output accordingly.

@aseppala

This comment has been minimized.

Copy link

aseppala commented Jan 24, 2017

+1

@jaimegago

This comment has been minimized.

Copy link
Contributor

jaimegago commented Jan 24, 2017

@torkelo sorry for the delayed answer just saw your questions.

TL;DR
@andyfeller Did a good job in his comment and it's pretty much what I had in mind

The end point (or end points) used to monitor Grafana should answer 2 questions with details:
A) Is this Grafana instance up and ready ?
B) Is this Grafana instance running as expected according to its configuration intents?

"configuration intents" is key here, what I mean by intent is that when for example the admin adds as a data source she expects it to be available regardless of whether or not the saved configuration is right. Thus if a configured data source is not available to Grafana the monitoring end point should say so and why, in the same fashion the extremely useful "test" button works.

It helps me think in terms of a plane taking off, first I need to know the plane has finished taking off and is in the air, then I need to know the plane is flying towards its destination as expected (let's not get into what "reaching cruise altitude" means ;-) )

This can be somewhat be compared to the /live /ready others have pointed out or /health (1) /state (2) of the Elasticsearch model or /health and /info of Sensu (3).
IMHO one endpoint is enough but seeing 2 endpoints in most modern tools is kinda changing my mind; let's just say I'm not persuaded yet as I think B is a subset of A so I'd make the JSON returned reflect that instead of having 2 end points. Then one day when Grafana can be clustered a "/cluster_state" can be added.

Now regarding the details of each answer, here are my -non exhaustive- initial thoughts:
A details :

  • Status (e.g. red/yellow/green)
  • Status comment (e.g. "All is good"/"Couldn't start component Foo"/"Starting")
  • Version (e.g. v4.1.1-1)

B details:

  • DB Status (e.g. red/yellow/green)
  • DB details (e.g. "couldn't connect, bad auth", or connection ok to mySQL v4.1 at xxx.yyy.zzz:3306, schema version v34132, yes SQL schemas should be versioned (4) )
  • Authentication/Authorization (e.g. LDAP connection to xx.xx.xx:389 ok)
  • Data sources (e.g. Datasource 1, type Graphite, status Red, status comment "auth failure, Datasource 2, type Elasticsearch, status Green, status comment "all good")

There is much more that can go in B which is why breaking the monitoring into 2 end points might make more sense, meh.

As to how to go about what happens when the end point is being queried (on the fly, APIs ,etc), I would defer to who ever ends up implementing.

A couple of - obvious?- advices though:

  • be very mindful of resources used to collect monitoring data and be very "protective" with the instrumentation code, help Grafana admins avoid "my monitoring of Grafana took Grafana down" or "Grafana has slowed down by X % since I started monitoring it" situations.

  • be as certain as you can on the provided monitoring data, alert fatigue is a plague

(1) https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html
(2) https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state.html
(3) https://sensuapp.org/docs/0.23/api/health-and-info-api.html#the-info-api-endpoint
(4) https://blog.codinghorror.com/get-your-database-under-version-control/

@dynek

This comment has been minimized.

Copy link

dynek commented Mar 23, 2017

So 4.2.0 just came out and there still is no way to probe the service? (think k8s cluster)

@jaimegago

This comment has been minimized.

Copy link
Contributor

jaimegago commented Mar 23, 2017

@torkelo I think @dynek has a point, this is not optional anymore. Whether it's a new section in the docs dedicated to "how to monitor Grafana" where what can be done today with the existing instrumentation (e.g. leverage admin or metrics page) is documented or a full fleshed dedicated API like in this proposal we need something yesterday.
Please don't take this the wrong way, I don't mean to tell you what the priorities should be, It's just that it's a tough sell for an application to be "Enterprise Ready" without a dedicated part to how to monitor it.

@torkelo torkelo added this to the 4.3.0 milestone Mar 27, 2017

@al-joshwilliams

This comment has been minimized.

Copy link

al-joshwilliams commented Apr 7, 2017

+1

torkelo added a commit that referenced this issue Apr 25, 2017

feat: added api health endpoint that does not require auth and never …
…creates sessions, returns db status as well. #3302
@torkelo

This comment has been minimized.

Copy link
Member

torkelo commented Apr 25, 2017

Added a simple http endpoint to check grafana health:

GET /api/health 
{
  "commit": "349f3eb",
  "database": "ok",
  "version": "4.1.0"
}

If database (mysql/postgres/sqlite3) is not reachable it will return "failing" in the database field. Grafana will still answer with status code 200. Not sure what is correct in that case.

The most important thing about this endpoint is that it will never cause sessions to be created (Something other api calls might do if you do not call them with an api key or basic auth).

@torkelo torkelo closed this Apr 25, 2017

@ConorNevin

This comment has been minimized.

Copy link

ConorNevin commented Apr 25, 2017

Wouldn't it be best to return with status code 503 when the database is unreachable?

@adamcstephens

This comment has been minimized.

Copy link

adamcstephens commented Apr 25, 2017

Kubernetes uses:

Any code greater than or equal to 200 and less than 400 indicates success. Any other code indicates failure.

@torkelo

This comment has been minimized.

Copy link
Member

torkelo commented Apr 25, 2017

Yes, I think 503 status code when db status failed is best, will update

daniellee added a commit that referenced this issue May 10, 2017

Update CHANGELOG.md
ref #8277, ref #8250, ref #8262, ref #8165, ref #8093, ref #8056, ref #8043, ref #7970, ref #7914, ref #7864, ref #7750, ref #7740, ref #7697, ref #7619, ref #5619, ref #4030, ref #5278, ref #3302, ref #2524

daniellee added a commit that referenced this issue May 10, 2017

@JorritSalverda

This comment has been minimized.

Copy link

JorritSalverda commented Oct 26, 2017

The 503 means the /api/health endpoint is best only used for the readiness check in Kubernetes. If this check is used for liveness a database issue will lead to all pods getting killed. Is there a query parameter to leave out the database check?

@bedrin

This comment has been minimized.

Copy link
Contributor

bedrin commented Nov 1, 2017

@JorritSalverda you could probably use tcpSocket check in livenessProbe

@bergquist

This comment has been minimized.

Copy link
Contributor

bergquist commented Nov 1, 2017

/metrics will not create sessions or issue a db request.

@micachen

This comment has been minimized.

Copy link

micachen commented Aug 21, 2018

we typically have agressive readiness checks and relaxed liveness checks, 1 second, 1 fail for readiness, while it's 60 seconds 10 fails 1 success for liveness, this allows for responsive rerouting when there is an issue, but at the same time if self recovery is possible, prevents unnecessary pod restarts. But a persistent DB issue would cause restart which might actually help if it was due to some bad container state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment