Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring Grafana #3302

Closed
jaimegago opened this issue Nov 21, 2015 · 36 comments
Closed

Monitoring Grafana #3302

jaimegago opened this issue Nov 21, 2015 · 36 comments

Comments

@jaimegago
Copy link
Contributor

@jaimegago jaimegago commented Nov 21, 2015

It's time to monitor the monitoring! It'd be great to have a /status or /health endpoint that returns grafana health data as json.

Things I'd like to get from a status endpoint are:

  • configured sources are reachable (when I configure a new graphite source I can test the connection, I'd love to have that via the /status API)
  • DB is available
  • configured authorization sources are reachable
  • version

e.g:

/status

{ "date_sources_ok": True, "database_ok": True, "authorization_ok": True, "grafana_version": "2.5.1" }

@anryko
Copy link
Contributor

@anryko anryko commented Nov 21, 2015

++

@kjedamzik
Copy link
Contributor

@kjedamzik kjedamzik commented Nov 27, 2015

👍

@torkelo
Copy link
Member

@torkelo torkelo commented Dec 8, 2015

make sure the health url does not generate sessions

@mattttt
Copy link
Contributor

@mattttt mattttt commented Jan 8, 2016

👍

@williamjoy
Copy link
Contributor

@williamjoy williamjoy commented Jan 11, 2016

+1 , this would be very useful to run grafana behind loadbalancer, loadbalancer will call the /health HTTP to verify is grafana returns HTTP 200 OK.

@theangryangel
Copy link
Contributor

@theangryangel theangryangel commented Jun 4, 2016

I've put together something dead simple, but I'm not particularly happy with it at the moment.

If anyone would like to take a look at current state vs master: master...theangryangel:feature/health_check

It returns something like:

{"current_timestamp":"2016-06-04T18:43:49+01:00","database_ok":true,"session_ok":true,"version":{"built":1464981754,"commit":"v3.0.4+158-g7cbaf06-dirty","version":"3.1.0"}}

The database check I was originally returning some stats, but I've cut that out. I could switch the query to something much simpler like "select 1" and checking it doesn't error. Not sure if it's worth it.

The session check I'm not particularly happy with either. There doesn't seem to be an easy to test without standing up a test macaron server and recover()ing from the panic that it would throw when starting a session provider, or modifying macaron/session to add a test feature to each of the providers. As it is right now it irritating returns a Set-Cookie header, which I don't particularly want. I'd appreciate some input where to take this from someone more experienced with macaron 😞

Checking for data sources doesn't seem particularly sane to try through this given how grafana is written. Probably more sane to add to your regular monitoring system.

@wpt1313
Copy link

@wpt1313 wpt1313 commented Jun 10, 2016

I was facing the same issue and as a workaround, I use an API call from the load balancer with a dedicated authentication API key. I'm using HAProxy, which has some useful "hidden" feature of setting custom HTTP headers in option httpchk:

option httpchk GET /api/org HTTP/1.0\r\nAccept:\ application/json\r\nContent-Type:\ application/json\r\nAuthorization:\ Bearer\ your_api_key\r\n

(I need to use HTTP/1.0 rather than 1.1, since the latter requires setting Host header and I can't get it dynamically in HAProxy config).

/api/org seems to be the simplest request with little overhead and returns HTTP 200, which is exactly what the load balancer needs -- and does not create any new sessions.

@iceycake
Copy link

@iceycake iceycake commented Jul 7, 2016

Any progress or PR on this issue?

@tuxtek
Copy link

@tuxtek tuxtek commented Sep 29, 2016

+1

@JorritSalverda
Copy link

@JorritSalverda JorritSalverda commented Sep 29, 2016

I would split this into a separate /liveness and /readiness endpoint as is best practice in kubernetes. /liveness only indicates whether grafana itself is up and running, /readiness indicates whether its ready to receive traffic and will check whether its dependencies are reachable.

In kubernetes the liveness endpoint will be probed and when failing a number of times to respond with 200 ok the container will be killed and replaced with a new one. The readiness endpoint is used to make the container part of a service and send traffic its way. Like adding and removing it from a load balancer.

@marco-hoyer
Copy link

@marco-hoyer marco-hoyer commented Oct 12, 2016

+1

@bigkraig
Copy link

@bigkraig bigkraig commented Nov 3, 2016

what about adding a /metrics Prometheus endpoint?

@bergquist bergquist added this to the 4.1.0 milestone Nov 3, 2016
@vinhlh
Copy link

@vinhlh vinhlh commented Nov 8, 2016

+1

@vinhlh
Copy link

@vinhlh vinhlh commented Nov 8, 2016

For whoever needs health checks on some services like Amazon ECS:
Use this hack: Path /public/img/grafana_icon.svg, HTTP Code: 200.

@philip-wernersbach
Copy link

@philip-wernersbach philip-wernersbach commented Nov 14, 2016

+1

@envintus
Copy link

@envintus envintus commented Dec 5, 2016

In the mean time if you're only looking for a simple HTTP code: 200, then just use /login. My colleague and I just deployed Grafana to a Kubernetes cluster and using that endpoint worked just fine for the liveness/readiness probes. Also works for the Google Compute Engine load balancer.

@andyfeller
Copy link

@andyfeller andyfeller commented Dec 5, 2016

@philip-wernersbach
Copy link

@philip-wernersbach philip-wernersbach commented Dec 6, 2016

I'd like to add our specific use case: we need a simple HTTP endpoint for checking if a user can login and display graphs. I know that we can use the static resources and endpoints such as /login to work around the absence of this, but we really need something that checks that the Grafana internals are running as expected. We don't necessarily need status checks for retrieving data from data sources, as we have separate health checks for those.

@envintus
Copy link

@envintus envintus commented Dec 6, 2016

@torkelo torkelo removed this from the 4.1.0 milestone Dec 14, 2016
@torkelo
Copy link
Member

@torkelo torkelo commented Dec 14, 2016

So there is currently in 4.0 a /api/metrics endpoint with some internal metrics.

But the issue requests something like this

{ "date_sources_ok": True, "database_ok": True, "authorization_ok": True, "grafana_version": "2.5.1" }

Would be good with a more detailed description for what is expected here. Should the API health call do a live check with all data sources in all orgs? should it be done on the fly as the /health api call is made?
What does authorization ok mean?

@andyfeller
Copy link

@andyfeller andyfeller commented Dec 14, 2016

@torkelo going to toss out an idea but definitely think /health should allow for both grafana-server as well as installed plugins to register arbitrary things to report on:

{
	"ok": false,
	"items": [
		"datasources": {
			"ok": true,
		},
		"database": {
			"ok": false,
			"msg": "Cannot communicate ###.###.###.###/XXXXXXX"
		},
		...
	]
}

By default, health checks perform live checks of all things when endpoint is called. If people want to isolate health checks to specific things, you can do something like elasticsearch does for cluster health. When thing is an external service (authorization, database, etc), then connectivity test is done at the minimum and any other sanity check that is reasonable for thing (e.g. SELECT 1 for database, LDAP bind test for authorization, etc).

Having output like this will allow monitoring checks to check holistically for issues while finding specific problems and output accordingly.

@aseppala
Copy link

@aseppala aseppala commented Jan 24, 2017

+1

@jaimegago
Copy link
Contributor Author

@jaimegago jaimegago commented Jan 24, 2017

@torkelo sorry for the delayed answer just saw your questions.

TL;DR
@andyfeller Did a good job in his comment and it's pretty much what I had in mind

The end point (or end points) used to monitor Grafana should answer 2 questions with details:
A) Is this Grafana instance up and ready ?
B) Is this Grafana instance running as expected according to its configuration intents?

"configuration intents" is key here, what I mean by intent is that when for example the admin adds as a data source she expects it to be available regardless of whether or not the saved configuration is right. Thus if a configured data source is not available to Grafana the monitoring end point should say so and why, in the same fashion the extremely useful "test" button works.

It helps me think in terms of a plane taking off, first I need to know the plane has finished taking off and is in the air, then I need to know the plane is flying towards its destination as expected (let's not get into what "reaching cruise altitude" means ;-) )

This can be somewhat be compared to the /live /ready others have pointed out or /health (1) /state (2) of the Elasticsearch model or /health and /info of Sensu (3).
IMHO one endpoint is enough but seeing 2 endpoints in most modern tools is kinda changing my mind; let's just say I'm not persuaded yet as I think B is a subset of A so I'd make the JSON returned reflect that instead of having 2 end points. Then one day when Grafana can be clustered a "/cluster_state" can be added.

Now regarding the details of each answer, here are my -non exhaustive- initial thoughts:
A details :

  • Status (e.g. red/yellow/green)
  • Status comment (e.g. "All is good"/"Couldn't start component Foo"/"Starting")
  • Version (e.g. v4.1.1-1)

B details:

  • DB Status (e.g. red/yellow/green)
  • DB details (e.g. "couldn't connect, bad auth", or connection ok to mySQL v4.1 at xxx.yyy.zzz:3306, schema version v34132, yes SQL schemas should be versioned (4) )
  • Authentication/Authorization (e.g. LDAP connection to xx.xx.xx:389 ok)
  • Data sources (e.g. Datasource 1, type Graphite, status Red, status comment "auth failure, Datasource 2, type Elasticsearch, status Green, status comment "all good")

There is much more that can go in B which is why breaking the monitoring into 2 end points might make more sense, meh.

As to how to go about what happens when the end point is being queried (on the fly, APIs ,etc), I would defer to who ever ends up implementing.

A couple of - obvious?- advices though:

  • be very mindful of resources used to collect monitoring data and be very "protective" with the instrumentation code, help Grafana admins avoid "my monitoring of Grafana took Grafana down" or "Grafana has slowed down by X % since I started monitoring it" situations.

  • be as certain as you can on the provided monitoring data, alert fatigue is a plague

(1) https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html
(2) https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state.html
(3) https://sensuapp.org/docs/0.23/api/health-and-info-api.html#the-info-api-endpoint
(4) https://blog.codinghorror.com/get-your-database-under-version-control/

@dynek
Copy link

@dynek dynek commented Mar 23, 2017

So 4.2.0 just came out and there still is no way to probe the service? (think k8s cluster)

@jaimegago
Copy link
Contributor Author

@jaimegago jaimegago commented Mar 23, 2017

@torkelo I think @dynek has a point, this is not optional anymore. Whether it's a new section in the docs dedicated to "how to monitor Grafana" where what can be done today with the existing instrumentation (e.g. leverage admin or metrics page) is documented or a full fleshed dedicated API like in this proposal we need something yesterday.
Please don't take this the wrong way, I don't mean to tell you what the priorities should be, It's just that it's a tough sell for an application to be "Enterprise Ready" without a dedicated part to how to monitor it.

@torkelo torkelo added this to the 4.3.0 milestone Mar 27, 2017
@al-joshwilliams
Copy link

@al-joshwilliams al-joshwilliams commented Apr 7, 2017

+1

torkelo added a commit that referenced this issue Apr 25, 2017
…creates sessions, returns db status as well. #3302
@torkelo
Copy link
Member

@torkelo torkelo commented Apr 25, 2017

Added a simple http endpoint to check grafana health:

GET /api/health 
{
  "commit": "349f3eb",
  "database": "ok",
  "version": "4.1.0"
}

If database (mysql/postgres/sqlite3) is not reachable it will return "failing" in the database field. Grafana will still answer with status code 200. Not sure what is correct in that case.

The most important thing about this endpoint is that it will never cause sessions to be created (Something other api calls might do if you do not call them with an api key or basic auth).

@torkelo torkelo closed this Apr 25, 2017
@ConorNevin
Copy link

@ConorNevin ConorNevin commented Apr 25, 2017

Wouldn't it be best to return with status code 503 when the database is unreachable?

@adamcstephens
Copy link

@adamcstephens adamcstephens commented Apr 25, 2017

Kubernetes uses:

Any code greater than or equal to 200 and less than 400 indicates success. Any other code indicates failure.

@torkelo
Copy link
Member

@torkelo torkelo commented Apr 25, 2017

Yes, I think 503 status code when db status failed is best, will update

daniellee added a commit that referenced this issue May 10, 2017
ref #8277, ref #8250, ref #8262, ref #8165, ref #8093, ref #8056, ref #8043, ref #7970, ref #7914, ref #7864, ref #7750, ref #7740, ref #7697, ref #7619, ref #5619, ref #4030, ref #5278, ref #3302, ref #2524
daniellee added a commit that referenced this issue May 10, 2017
@JorritSalverda
Copy link

@JorritSalverda JorritSalverda commented Oct 26, 2017

The 503 means the /api/health endpoint is best only used for the readiness check in Kubernetes. If this check is used for liveness a database issue will lead to all pods getting killed. Is there a query parameter to leave out the database check?

@bedrin
Copy link
Contributor

@bedrin bedrin commented Nov 1, 2017

@JorritSalverda you could probably use tcpSocket check in livenessProbe

@bergquist
Copy link
Contributor

@bergquist bergquist commented Nov 1, 2017

/metrics will not create sessions or issue a db request.

@micachen
Copy link

@micachen micachen commented Aug 21, 2018

we typically have agressive readiness checks and relaxed liveness checks, 1 second, 1 fail for readiness, while it's 60 seconds 10 fails 1 success for liveness, this allows for responsive rerouting when there is an issue, but at the same time if self recovery is possible, prevents unnecessary pod restarts. But a persistent DB issue would cause restart which might actually help if it was due to some bad container state.

finkr added a commit to finkr/grafana that referenced this issue Jan 25, 2019
Document the health check implemented in grafana#3302 (and grafana#935), see  grafana#3302 (comment)
This was referenced Jan 25, 2019
jschill added a commit that referenced this issue Jan 28, 2019
Document the health check implemented in #3302 (and #935), see  #3302 (comment)
dghubble added a commit to poseidon/typhoon that referenced this issue Mar 24, 2019
@cnouguier cnouguier mentioned this issue May 10, 2019
5 of 6 tasks complete
@suridaddy
Copy link

@suridaddy suridaddy commented Mar 26, 2021

@finkr /api/health take too long to response 503. Is there any way to make it reponse in a short term?

@andyfeller
Copy link

@andyfeller andyfeller commented Mar 26, 2021

@finkr /api/health take too long to response 503. Is there any way to make it reponse in a short term?

@suridaddy : it might be easier to visit the Grafana community forums or the more interactive support channels along with more information to troubleshoot your problem. This issue is for feature / improvement and is closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet