Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign uppanic: runtime error: integer divide by zero #2953
Comments
brian-brazil
added
component/local storage
kind/bug
labels
Jul 16, 2017
This comment has been minimized.
This comment has been minimized.
beorn7
self-assigned this
Jul 17, 2017
This comment has been minimized.
This comment has been minimized.
|
That's caused by data corruption. I can certainly introduce a guard against it. |
This comment has been minimized.
This comment has been minimized.
|
That guard is already in place. That panic is truly a "must never happen" thing. There must be something weirder going on here. |
beorn7
added a commit
that referenced
this issue
Jul 17, 2017
This comment has been minimized.
This comment has been minimized.
|
@Cplo This is odd as it appears to happen in your case in a reproducible way (thus it doesn't seem to be a random data corruption), but it isn't seen anywhere else (or at least not frequently enough that we would see other reports). I have created https://github.com/prometheus/prometheus/tree/beorn7/storage with a guard that quarantines the series in the case you have encountered above, together with an error message. Could you build a binary from that branch and run it in the same conditions as above? When the error occurs, you will get a message like the following instead of a panic:
If you could then post that message here, that would be great. |
This comment has been minimized.
This comment has been minimized.
|
@beorn7 Very pleased to receive your reply. so I will build the binary from branch beorn7/storage. If prometheus crashes again, I will post the related logs here. |
This comment has been minimized.
This comment has been minimized.
|
hi , @beorn7 the error occurred again , the related logs:
|
This comment has been minimized.
This comment has been minimized.
|
Now you tickled the same bug in a different way. Perhaps that will actually give us a hint what's going on. I'll investigate. Maybe I have to give you yet another debug binary. We'll see... :o/ |
beorn7
added a commit
that referenced
this issue
Jul 25, 2017
beorn7
added a commit
that referenced
this issue
Jul 25, 2017
This comment has been minimized.
This comment has been minimized.
|
@Cplo I have pushed another commit to https://github.com/prometheus/prometheus/tree/beorn7/storage . Could you build from there again and try it out? This time, the server will panic again when it encounters the problem, but it will dump the chunk as part of the error message. In that way, we'll at least get an idea if it is a chunk with corrupted data or a zero'd chunk that somehow snug in. |
This comment has been minimized.
This comment has been minimized.
|
@beorn7 ok |
This comment has been minimized.
This comment has been minimized.
|
@beorn7 The latest crash logs
|
This comment has been minimized.
This comment has been minimized.
|
Thanks a lot. We got a smoking gun, but without the gunner. Or in other word: Thanks to your dump above, we know it's an uninitialized chunk rather than a chunk with corrupted data. That's valuable information. I have no clue at the moment how an uninitialized chunk can slip in. It might be something very specific to your setup, as we haven't got any other report of this kind of crash. I'll give it another try to stare at code at my next convenience. In the meantime, please let me know if there is anything special about your setup. This could be weird stuff like this problem only occurring on one special machine or one type of machine. |
This comment has been minimized.
This comment has been minimized.
|
Sooo… I did some more thorough code analysis using all my Go Guru skills (slowly becoming a Go Guru guru ;). I couldn't find any entry point that would let slip in a completely null'd @Cplo Did you think about my previous question, i.e. "Please let me know if there is anything special about your setup. This could be weird stuff like this problem only occurring on one special machine or one type of machine." I'm at a point where I start suspecting faulty hardware. Of course, it would be helpful if anybody else in the world saw the same problem. |
brian-brazil
added
kind/more-info-needed
priority/P3
labels
Aug 21, 2017
This comment has been minimized.
This comment has been minimized.
|
Closing this due the release of 2.0, only a single report of this panic, still no idea how this can happen, and no response for 3 months. |
grobie
closed this
Nov 12, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Cplo commentedJul 16, 2017
What did you do?
Run Prometheus on k8s
What did you expect to see?
No more crash
What did you see instead? Under which circumstances?
This is happening in both k8s clusters when running for a few days
Environment
System information:
uname -srm
Linux 4.4.64-1.el7.elrepo.x86_64 x86_64
Prometheus version:
prometheus, version 1.7.1 (branch: master, revision: 3afb3ff)
build user: root@0aa1b7fc430d
build date: 20170612-11:44:05
go version: go1.8.3
Alertmanager version:
N/A
Prometheus configuration file:
Alertmanager configuration file:
N/A
Logs: