Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus 2 failing to start after delete_series API call #3788

Closed
kerneljack opened this Issue Feb 2, 2018 · 8 comments

Comments

Projects
None yet
4 participants
@kerneljack
Copy link

kerneljack commented Feb 2, 2018

What did you do?
We used curl to call the following endpoint:

curl --silent -H "Content-Type: application/json" -X POST -d '{"matchers":[{"type":"EQ","name":"service","value":"our-service"}]}' 'http://prometheus:9090/api/v2/admin/tsdb/delete_series' -vvv

Prometheus immediately crashed, and is now outputting the following to the logs:

level=info ts=2018-02-02T17:02:39.973235181Z caller=main.go:215 msg="Starting Prometheus" version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)"
level=info ts=2018-02-02T17:02:39.973330921Z caller=main.go:216 build_context="(go=go1.9.2, user=root@615b82cb36b6, date=20171108-07:11:59)"
level=info ts=2018-02-02T17:02:39.973358073Z caller=main.go:217 host_details="(Linux 4.4.86+ #1 SMP Tue Nov 21 22:05:47 PST 2017 x86_64 prometheus-server-67d547c877-djv5w (none))"
level=info ts=2018-02-02T17:02:39.9758817Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-02-02T17:02:39.976424177Z caller=main.go:314 msg="Starting TSDB"
level=info ts=2018-02-02T17:02:39.976684574Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
level=error ts=2018-02-02T17:02:40.248346254Z caller=main.go:323 msg="Opening storage failed" err="open block /data/01C4X4EWET82S507XRJ0EXPG7B: open /data/01C4X4EWET82S507XRJ0EXPG7B/tombstones: no such file or directory"

What did you expect to see?
Prometheus should have deleted the relevant data and continued working as normal.

What did you see instead? Under which circumstances?
The Prometheus pod is now continuously crashing (on Kubernetes).

Environment
Kubernetes pod on GKE (1.8.5)

@kerneljack

This comment has been minimized.

Copy link
Author

kerneljack commented Feb 5, 2018

Any updates on this issue? I managed to get it started up again by deleting the offending directory, but isn't this a bug if an API call would cause Prometheus to crash in this way? I'm not confident anymore about running any more delete_series calls now.

Would updating to version 2.1 help with this?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 5, 2018

@gouthamve I think this one is for you :)

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Feb 5, 2018

Hmm, sorry for not responding earlier. I need to see the logs of the service on which the delete_series was run. The above case can happen only if the rename (mv) calls fails and I am not sure why that would ever happen.

Are you running on NFS or something similar?

@kerneljack

This comment has been minimized.

Copy link
Author

kerneljack commented Feb 6, 2018

Hi @gouthamve - when you refer to service, do you mean the Prometheus service or the application that was generating the metrics? I am not sure if I can get the logs for the old Prometheus pod now as it was destroyed and had to be recreated.

We are not using NFS, no. The prometheus pod is using an attached PV thought, which is just a standard type external volume mounted into it.

Where is this rename call happening in Prometheus?

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Feb 6, 2018

When you call delete we re-write the tombstones file (which is essentially missing now causing the error): https://github.com/prometheus/tsdb/blob/master/block.go#L463-L465

Which calls the rename: https://github.com/prometheus/tsdb/blob/d0982ac4d5057a45050c6cedf2d730ed50e5a19d/tombstones.go#L100 which essentially does mv.

@kerneljack

This comment has been minimized.

Copy link
Author

kerneljack commented Feb 6, 2018

I wasn't able to get the logs of the previous pod directly, but I found some of the relevant entries in Stackdriver and I'll paste them here.

I think these are the first errors that were thrown when that API call was made. The quoting might look a bit strange because I'm copy/pasting from Stackdriver:

"panic: runtime error: index out of range

goroutine 3443 [running]:
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*memSeries).minTime(...)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/head.go:1150
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*Head).Delete(0xc420144280, 0x80003883122cdf10, 0x7fffc77cedd324d7, 0xc43a547800, 0x1, 0x1, 0xc44ca23790, 0x0)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/head.go:593 +0x5fc
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*DB).Delete.func2(0x8, 0x1c3bb88)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/db.go:681 +0x5e
github.com/prometheus/prometheus/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1(0xc460079a80, 0xc4691d4040)
	/go/src/github.com/prometheus/prometheus/vendor/golang.org/x/sync/errgroup/errgroup.go:58 +0x57

  created by github.com/prometheus/prometheus/vendor/golang.org/x/sync/errgroup.(*Group).Go
	/go/src/github.com/prometheus/prometheus/vendor/golang.org/x/sync/errgroup/errgroup.go:55 +0x66

level=info ts=2018-02-02T16:05:40.968045039Z caller=main.go:215 msg="Starting Prometheus" version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)

level=info ts=2018-02-02T16:05:40.968141984Z caller=main.go:216 build_context="(go=go1.9.2, user=root@615b82cb36b6, date=20171108-07:11:59)

level=info ts=2018-02-02T16:05:40.968194025Z caller=main.go:217 host_details="(Linux 4.4.86+ #1 SMP Tue Nov 21 22:05:47 PST 2017 x86_64 prometheus-server-67d547c877-t5dh5 (none))

level=info ts=2018-02-02T16:05:40.971042922Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090

level=info ts=2018-02-02T16:05:40.97111811Z caller=main.go:314 msg="Starting TSDB"

level=info ts=2018-02-02T16:05:40.971144013Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager...

level=error ts=2018-02-02T16:05:41.199404951Z caller=main.go:323 msg="Opening storage failed" err="open block /data/01C4S8VS454P8RKPBCP5NPCVQ5: open /data/01C4S8VS454P8RKPBCP5NPCVQ5/tombstones: no such file or directory"

It looks like it crashed as soon as the API call was made, then when it restarted it couldn't find the tombstones file. I can confirm that when I looked inside the directories at that time, the tombstones.tmp file was indeed present - as if it was left there mid-stream before the move was able to complete. I don't remember the size of it, etc.

So it's possible something went wrong inside the writeTombstoneFile function but I'm not sure.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Feb 7, 2018

I have a test reproducing it. Will send a PR soon.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.