Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/graph returns 503 but /-/healthy returns 200 #4624

Open
geekdave opened this Issue Sep 17, 2018 · 8 comments

Comments

Projects
None yet
7 participants
@geekdave
Copy link

geekdave commented Sep 17, 2018

Bug Report

What did you do?

I started up a new Prometheus server, and copied over snapshot data from an old server. It ran fine for about a day, and then started crashlooping with:

level=error ts=2018-09-17T20:49:33.422356518Z caller=main.go:596 err="Opening storage failed invalid block sequence: block time ranges overlap: [mint: 1536948000000, maxt: 1536953793542, range: 1h36m33s, blocks: 304]: <ulid: 01CQD3N1Z6BEY5CMJ2X666C9AB, mint: 1536948000000, maxt: 1536955200000, range: 2h0m0s>, <ulid: 01CQD3N4Q1PEF617RSKN4XJBEG, mint: 1536948000000, maxt: 1536955200000, range: 2h0m0s>, <ulid: 01CQD3N8FTCA791FC4TXWY2AJH, mint: 1536948000000,

I understand that I probably did something wrong, procedure-wise, with how I copied historical snapshot data into an existing Prometheus server. I started a thread on the mailing list for this, and would be grateful for any advice about how to avoid this in the future.

This issue is about something different though. The above crashloop is causing the Prometheus /graph route to be unavailable:

$ curl -v localhost:9090/graph
[...snip...]
< HTTP/1.1 503 Service Unavailable

Yet the Prometheus health check endpoint returns "healthy":

$ curl -v localhost:9090/-/healthy
[...snip...]
< HTTP/1.1 200 OK
Prometheus is Healthy.

What did you expect to see?

Health check fail if Prometheus /graph endpoint returns 503

What did you see instead? Under which circumstances?

Health check returned 200

Environment

  • System information:
$ uname -srm
Linux 4.4.0-1066-aws x86_64
  • Prometheus version:
$ prometheus --version
prometheus, version 2.3.2 (branch: HEAD, revision: 71af5e29e815795e9dd14742ee7725682fa14b7b)
  build user:       root@5258e0bd9cc1
  build date:       20180712-14:02:52
  go version:       go1.10.3
  • Alertmanager version:

N/A

  • Prometheus configuration file:

N/A

  • Alertmanager configuration file:

N/A

  • Logs:

See above

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Sep 17, 2018

Probably curl -v localhost:9090/-/ready would have returned 503 too as it is wrapped by the same middleware handler.

prometheus/web/web.go

Lines 370 to 380 in f2d43af

// Checks if server is ready, calls f if it is, returns 503 if it is not.
func (h *Handler) testReady(f http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
if h.isReady() {
f(w, r)
} else {
w.WriteHeader(http.StatusServiceUnavailable)
fmt.Fprintf(w, "Service Unavailable")
}
}
}

Looking at the code, /-/healthy endpoint always returns 200.

EDIT: the initial version mentioned localhost:9090/-/healthy while I meant localhost:9090/-/ready

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Sep 18, 2018

See #3650 and #3816.

@sevagh

This comment has been minimized.

Copy link

sevagh commented Sep 27, 2018

Just adding my two cents here. the /graph endpoint is a better measure. Here's a log from a very slow poller startup:

Starting 08:37:38, the moment the poller was running (and doing some tsdb crunching), it was /-/healthy:

Sep 27 08:37:38 clip prometheus[47977]: level=info ts=2018-09-27T15:37:38.01670204Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1538028000000 maxt=1538049600000 ulid=01CRDRX6XH4XG0TCFFAWKKQ0G8








Sep 27 09:04:04 clip prometheus[47977]: level=warn ts=2018-09-27T16:04:04.379424045Z caller=head.go:320 component=tsdb msg="unknown series references in WAL samples" count=30

Sep 27 09:04:29 clip prometheus[47977]: level=info ts=2018-09-27T16:04:29.460790665Z caller=main.go:543 msg="TSDB started"
Sep 27 09:04:29 clip prometheus[47977]: level=info ts=2018-09-27T16:04:29.461029004Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus2/prometheus.yml
Sep 27 09:04:30 clip prometheus[47977]: level=info ts=2018-09-27T16:04:30.811795863Z caller=main.go:629 msg="Completed loading of configuration file" filename=/etc/prometheus2/prometheus.yml
Sep 27 09:04:30 clip prometheus[47977]: level=info ts=2018-09-27T16:04:30.812430812Z caller=main.go:502 msg="Server is ready to receive web requests."

It's only at 09:04:30 that it was truly healthy, at which point /graph started working.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Sep 28, 2018

@sevagh in your case, /-/ready should return the same status as /graph.

@Starefossen

This comment has been minimized.

Copy link

Starefossen commented Oct 23, 2018

Same issue here using Promethues v2.4.3; /graph and /-/ready returns 503 while /-/healthy returns 200. This is what I get from the log, no errors though:

level=info ts=2018-10-23T11:43:10.072191358Z caller=main.go:238 msg="Starting Prometheus" version="(version=2.4.3, branch=HEAD, revision=167a4b4e73a8eca8df648d2d2043e21bdb9a7449)"
level=info ts=2018-10-23T11:43:10.072279818Z caller=main.go:239 build_context="(go=go1.11.1, user=root@1e42b46043e9, date=20181004-08:42:02)"
level=info ts=2018-10-23T11:43:10.072310302Z caller=main.go:240 host_details="(Linux 4.4.0-137-generic #163-Ubuntu SMP Mon Sep 24 13:14:43 UTC 2018 x86_64 prometheus-v2-dccf4b8fb-wbzdg (none))"
level=info ts=2018-10-23T11:43:10.072338089Z caller=main.go:241 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-10-23T11:43:10.072365978Z caller=main.go:242 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2018-10-23T11:43:10.073647975Z caller=web.go:397 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-10-23T11:43:10.073159772Z caller=main.go:554 msg="Starting TSDB ..."
level=info ts=2018-10-23T11:43:12.443841325Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539151200000 maxt=1539172800000 ulid=01CSF7W1XP7R4EP4A8KJTPGTGZ
level=info ts=2018-10-23T11:43:12.445919405Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539172800000 maxt=1539194400000 ulid=01CSFWF6A44700VKWH6QXCQCMA
level=info ts=2018-10-23T11:43:12.447519087Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539194400000 maxt=1539216000000 ulid=01CSGH2BQTBN8EKP7MZ6EGD0QN
level=info ts=2018-10-23T11:43:12.449876475Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539216000000 maxt=1539237600000 ulid=01CSH5NK16BHR5GFSVJ4GZF6GY
level=info ts=2018-10-23T11:43:12.451505191Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539237600000 maxt=1539259200000 ulid=01CSHT8RY1WCSN5YARHMX8FG69
level=info ts=2018-10-23T11:43:12.453299674Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539259200000 maxt=1539280800000 ulid=01CSJEVZ12VF40YJHMFY60Q8QH
level=info ts=2018-10-23T11:43:12.455097513Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539280800000 maxt=1539302400000 ulid=01CSK3F2CS08MKX20J3D0C3NZ1
level=info ts=2018-10-23T11:43:12.456697014Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539302400000 maxt=1539324000000 ulid=01CSKR2AP0B78YHKE3XJCGFQYY
level=info ts=2018-10-23T11:43:12.45838266Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539324000000 maxt=1539345600000 ulid=01CSMCNEMHXAQB1QG1HYBKKFYE
level=info ts=2018-10-23T11:43:12.460698718Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539345600000 maxt=1539367200000 ulid=01CSN18MP5S3118RW1A7N4J8SJ
level=info ts=2018-10-23T11:43:12.462381043Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539367200000 maxt=1539388800000 ulid=01CSNNVSV8QSC8RS8DGZ6RQE0T
level=info ts=2018-10-23T11:43:12.463871598Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539388800000 maxt=1539410400000 ulid=01CSPAF02ZQG5RG22A7T7AH2WK
level=info ts=2018-10-23T11:43:12.465443166Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539410400000 maxt=1539432000000 ulid=01CSPZ25JHAKZ2NXZMT6BHC27Z
level=info ts=2018-10-23T11:43:12.467058109Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539432000000 maxt=1539453600000 ulid=01CSQKNBDH3JEWJQKT17ZDWAE2
level=info ts=2018-10-23T11:43:12.468843178Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539453600000 maxt=1539475200000 ulid=01CSR88HJNFN05RFFWE4ZABWPM
level=info ts=2018-10-23T11:43:12.470416737Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539475200000 maxt=1539496800000 ulid=01CSRWVPXEP9SAV5AGF77P856X
level=info ts=2018-10-23T11:43:12.471972953Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539496800000 maxt=1539518400000 ulid=01CSSHEVZ1WSRBFMWTE84GVAVD
level=info ts=2018-10-23T11:43:12.473762356Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539518400000 maxt=1539540000000 ulid=01CST621SG1ZE5TV6TKW8A4VN6
level=info ts=2018-10-23T11:43:12.475417471Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539540000000 maxt=1539561600000 ulid=01CSTTN7NWAGPG789DEW0EJQFD
level=info ts=2018-10-23T11:43:12.477107527Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539583200000 maxt=1539590400000 ulid=01CSVF86FCDXCJJHRRN8RR5549
level=info ts=2018-10-23T11:43:12.478721738Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539561600000 maxt=1539583200000 ulid=01CSVF8DT8V5J7CF0MWF0CMMH0
level=info ts=2018-10-23T11:43:12.480309711Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1539590400000 maxt=1539597600000 ulid=01CSVP3XS3NHHVKZCTSCHASJ1T
level=info ts=2018-10-23T11:43:12.486722289Z caller=wal.go:1262 component=tsdb msg="migrating WAL format"
@shamimgeek

This comment has been minimized.

Copy link

shamimgeek commented Dec 14, 2018

@ALL: i am also hitting this error /-/healthy endpoint returns. Prometheus is Healthy. but /-ready/ endpoint returns Service Unavailable.

But there is no error reported. What is the issue ? how can i solve this ?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Dec 14, 2018

@shamimgeek can you share the logs? If your Prometheus isn't ready, it means that it is still in the initialization phase (usually checking the TSDB).

@frateralexander

This comment has been minimized.

Copy link

frateralexander commented Mar 29, 2019

Hi Dears

I have this error when a try to remove a line in my ../node_exporter/nodes.json and restart my service

curl -v localhost:9090/graph
Trying 127.0.0.1...
< HTTP/1.1 503 Service Unavailable

I don't try -/healthy
Could someone help me ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.