Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign uppanic: runtime error: invalid memory address or nil pointer dereference #2955
Comments
brian-brazil
added
dev-2.0
kind/bug
labels
Jul 17, 2017
gouthamve
self-assigned this
Jul 21, 2017
tomwilkie
added
component/remote storage
and removed
component/remote storage
labels
Jul 25, 2017
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil This is coming from staleness specifically panicking here: https://github.com/prometheus/prometheus/blob/dev-2.0/retrieval/scrape.go#L536-L538 Not able to come up with an explanation as to how that could be nil though. Do you have any pointers? |
This comment has been minimized.
This comment has been minimized.
|
That panic looks to be from https://github.com/prometheus/prometheus/blob/dev-2.0/retrieval/scrape.go#L759 I'd say that the lsets cache isn't being properly maintained at some point, I don't see anything wrong from a quick look. |
This comment has been minimized.
This comment has been minimized.
bkupidura
commented
Aug 22, 2017
|
Similar issue (not sure if same). Nothing special in config, static targets + remote write to remote storage adapter. prometheus, version 2.0.0-beta.2 (branch: HEAD, revision: a52f082)
|
This comment has been minimized.
This comment has been minimized.
tudor
commented
Aug 29, 2017
|
@igorcanadi and I are hitting (what looks like) the same thing in
Relevant code:
|
This comment has been minimized.
This comment has been minimized.
tudor
commented
Aug 29, 2017
|
(also, |
This comment has been minimized.
This comment has been minimized.
johrstrom
commented
Sep 1, 2017
|
I'm not sure if this is other users' experience, but I get this every 2 hours like clockwork at the top of the hour. It's almost surely something to do with the kubernetes discovery and/or just high load on the server because it only happens in our large environment and not the on smaller clusters. |
This comment has been minimized.
This comment has been minimized.
igorcanadi
commented
Sep 1, 2017
|
Our prometheus died 120 times in the last 2 days with this stack trace, so it's a bit more often than every two hours:
|
This comment has been minimized.
This comment has been minimized.
aecolley
commented
Sep 4, 2017
|
I'm getting this every 2-3 minutes. This prometheus (2.0.0-beta.2) is doing nothing but scraping a more-busy v1.3.0 prometheus on localhost. Here's its config file in its entirety:
The stack trace is practically identical:
The only thing out of the ordinary is that the v1.3.0 prometheus (the federation source for the crashing prometheus) regularly logs "Error on ingesting out-of-order samples". That's expected because it's federating data from two fully-redundant prometheus servers, so some merging of the timeseries is normal. Other than the crashes, beta.2 looks great. |
This comment has been minimized.
This comment has been minimized.
igorcanadi
commented
Sep 5, 2017
|
@tudor and I traced this down to Here is a sample of our logging that we added:
As you can see both We also looked at where the uniqueness of |
igorcanadi
referenced this issue
Sep 6, 2017
Closed
headAppender.Add() produces duplicate refs for different label-sets, causing prometheus to crash #137
This comment has been minimized.
This comment has been minimized.
|
Thanks for the extensive debugging! That sounds like a very probably cause though I cannot think of a race in that particular case from the top of my head. Regardless, current work in prometheus/tsdb reworks that section entirely. So this bug should disappear. Let's hope it doesn't get replaced with a similarly severe one :) |
This comment has been minimized.
This comment has been minimized.
igorcanadi
commented
Sep 8, 2017
|
Awesome! When do you think it'll be ready? Let us know if you'd like some beta testers. |
This comment has been minimized.
This comment has been minimized.
aecolley
commented
Sep 13, 2017
|
Sadly, v2.0.0-beta.3 isn't faring any better:
|
This comment has been minimized.
This comment has been minimized.
igorcanadi
commented
Sep 13, 2017
|
Same here :( |
fabxc
referenced this issue
Sep 14, 2017
Closed
Prometheus 2.0: panic: runtime error: index out of range #3097
This comment has been minimized.
This comment has been minimized.
|
Okay, so after several people taking a look, the cache maintenance seems to be correct. So the issue of duplicate series IDs that you debugged @igorcanadi is still a likely reason for that. Before it was generated from the length of known series, which should've been safe but apparently wasn't. Now we have an atomic counter, which is only touched here for ID assignment. I cannot think of anything generating duplicate IDs at this point. It would be great if you could verify that it is in fact what is happening under the new implementation though @igorcanadi. (We could just make the panic disappear by adding another check of course. But I believe this should actually panic because it shouldn't be possible unless there's a bug.) |
This comment has been minimized.
This comment has been minimized.
|
@igorcanadi @aecolley updating on which configurations you are experiencing this with would be neat – just to be sure it's not some path that's only hit when using certain features. |
This comment has been minimized.
This comment has been minimized.
aecolley
commented
Sep 14, 2017
|
@fabxc my configuration is the same as the one in my comment from 10 days ago. |
fabxc
added a commit
that referenced
this issue
Sep 15, 2017
This comment has been minimized.
This comment has been minimized.
|
I made some slight changes that should prevent the panicking and log various cases that make the cache go bad. It would be really neat if people encountering this could run this version as I cannot reproduce the issue myself. Container: Or build it yourself from the debug-2955 branch. This doesn't solve anything but would at least confirm that it is an issue with duplicated IDs, even with the latest changes in prometheus/tsdb. |
This comment has been minimized.
This comment has been minimized.
igorcanadi
commented
Sep 15, 2017
|
@fabxc I no longer get the crash with your build, but now prometheus is not responsive on its HTTP port. Log:
|
This comment has been minimized.
This comment has been minimized.
|
I think you are hitting the one other known bug there, which is some form of deadlock on startup. It's not panicking because it's not even scraping :) Will have to investigate that one separately. Can you see whether it starts listening after another restart or try with a clean storage directory otherwise? |
This comment has been minimized.
This comment has been minimized.
bkupidura
commented
Sep 16, 2017
|
@fabxc I can also run "fixed" binary, but instead of docker image - can you just share prometheus bin? |
fabxc
added a commit
that referenced
this issue
Sep 18, 2017
This comment has been minimized.
This comment has been minimized.
aecolley
commented
Sep 18, 2017
|
The debug binary doesn't crash. I ran it for 21 minutes and 43 seconds, in which time it wrote 279320 lines to standard error. 236385 of them were of the form
42923 of them were of the form
The remaining 12 lines were the familiar startup/shutdown lines. |
This comment has been minimized.
This comment has been minimized.
|
Thank @aecolley. That's extremely helpful. I understand correctly that its the exact same series just in different ordering (as exposed by the target)? It's bad because it literally means the code we are logging from doesn't fully do what its docs specify it should do – at least not correctly. We definitely have to handle targets that expose the same series in inconsistent order. But we should encourage all client libs etc. to stay consistent. Do you know what kind of target/client lib the metric comes from? |
This comment has been minimized.
This comment has been minimized.
aecolley
commented
Sep 19, 2017
|
Yes, the labels are in a different order. Here's a typical log line:
The I grepped the same log file for the other lines corresponding to the same timeseries (using the
|
fabxc
added a commit
that referenced
this issue
Sep 19, 2017
This comment has been minimized.
This comment has been minimized.
|
I think I've got this fixed – essentially just getting rid of some complexity we no longer need. The changes so far should also make us a bit more resilient against misbehaving targets like cAdvisor and – in my benchmark run so far – save another ~10% memory for a scraping-only Prometheus server. Will file a PR once the everything looks good. This also contains other important fixes we've done since beta.4. |
fabxc
referenced this issue
Sep 19, 2017
Merged
Fix cache maintenance on changing metric representations #3192
This comment has been minimized.
This comment has been minimized.
aecolley
commented
Sep 19, 2017
|
After only 15 minutes of running the new build ( |
This comment has been minimized.
This comment has been minimized.
|
Awesome, thanks everyone for helping with debugging! |
This comment has been minimized.
This comment has been minimized.
igorcanadi
commented
Sep 19, 2017
|
I'm also deploying this build to our clusters, will report back in 24h! |
This comment has been minimized.
This comment has been minimized.
aecolley
commented
Sep 20, 2017
|
24h later, still no crashes, here's what the log looks like:
|
This comment has been minimized.
This comment has been minimized.
igorcanadi
commented
Sep 20, 2017
|
No crashes for us either, great stuff, thanks @fabxc ! |
This comment has been minimized.
This comment has been minimized.
|
Preemptively closing here. All issues listed here were addressed in separate PRs and should be fixed in beta.5. |
fabxc
closed this
Sep 22, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
iofxl commentedJul 17, 2017
What did you do?
What did you expect to see?
What did you see instead? Under which circumstances?
I see the panic in the title.
Environment
System information:
Linux 2.6.32-358.el6.x86_64 x86_64
Prometheus version:
2.0.0-beta.0, official build.
Alertmanager version:
insert output of
alertmanager -versionhere (if relevant to the issue)Prometheus configuration file: