Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upFatal Error : concurrent map read and map write #1786
Comments
brian-brazil
added
the
kind/bug
label
Jul 4, 2016
This comment has been minimized.
This comment has been minimized.
|
Thanks for reporting! Does it always crash with the same error ( This seems very odd. Each query gets its own copy of the |
This comment has been minimized.
This comment has been minimized.
|
Does it always crash in |
This comment has been minimized.
This comment has been minimized.
|
The only source to me seems that we modify the underlying |
This comment has been minimized.
This comment has been minimized.
|
Right - that's why I'm suspecting either general memory corruption or a Go bug. If the error message is always the same, it's unlikely to be the former, but could still be the latter. |
This comment has been minimized.
This comment has been minimized.
|
FWIW, I cannot imagine this being a Go bug. Maps are heavily used and the panic on unlocked concurrent access has been around for a while now. |
This comment has been minimized.
This comment has been minimized.
|
Yeah, that'd be really unlikely too. Let's see what @prfalken says about the error messages. |
This comment has been minimized.
This comment has been minimized.
|
But we only do a copy of each metric on returning the result or if we have to modify the metric during querying. Otherwise during querying we directly reference the map from the storage if I recall/checked correctly. |
This comment has been minimized.
This comment has been minimized.
|
I'm digging in logs. Crash recoveries on 1.2M metrics makes them difficult to find :) |
This comment has been minimized.
This comment has been minimized.
|
Another crash today, totally different trace :
|
This comment has been minimized.
This comment has been minimized.
|
@beorn7 ^ |
This comment has been minimized.
This comment has been minimized.
|
In theory, this could be a result of a crash recovery gone wild. @prfalken could you check more crashes? If each shows you a different trace, the odds are you are having a corrupt binary, broken hardware, or wildly corrupted storage data. (Among those, arguably the latter should not cause crashes.) |
beorn7
referenced this issue
Jul 5, 2016
Closed
storage: memorySeries.headChunkClosed not set correctly? #1790
This comment has been minimized.
This comment has been minimized.
|
I'll filed #1790 to investigate the assumed crash recovery issue. |
This comment has been minimized.
This comment has been minimized.
|
Yeah, have you tried running the same Prometheus server on different hardware? And then, also try the latest version, 0.20.0. |
This comment has been minimized.
This comment has been minimized.
|
New crash today :
I'm going to upgrade to 0.20. |
This comment has been minimized.
This comment has been minimized.
|
These are 3 very different crashes and none of them have been observed anywhere before, even on high load servers. I wouldn't be surprised if something outside of our code and the Go compiler is broken here. |
This comment has been minimized.
This comment has been minimized.
|
I found a plausible reason for the 2nd stack trace reported above. See #1798. |
This comment has been minimized.
This comment has been minimized.
|
So the first crash could have caused the 2nd one via broken crash recovery. Still doesn't explain the 3rd. |
This comment has been minimized.
This comment has been minimized.
|
Now prometheus finishes its crash recovery after more than half an hour and crashes again over and over :
Is there any way to fix this without trashing the whole data directory ? |
This comment has been minimized.
This comment has been minimized.
|
It looks your data is corrupted beyond hope. The above can only happen if the string length read from the checkpoint is negative. Arguably, we should not crash in that case. (I filed #1800 about it.) However, with this kind of corruption, there is little hope anything useful will come out of a crash-free decoding. (As well as giving you a negative number for the string length, it could have given you a string length of one trillion or something, and you would simply run out of memory. Also, after reading a wrong string length, all the other data is misaligned, and you will essentially read noise from the checkpoint.) |
This comment has been minimized.
This comment has been minimized.
|
@prfalken did this problem reoccur with a server that was not most likely suffering from data corruption? |
This comment has been minimized.
This comment has been minimized.
|
Ran a new prometheus server on older hardware, other manufacturer. Prometheus works like a charm and never crashes. Looks like our Dell servers have hardware issues, and prometheus is not the only one to crash. How disappointing. Sorry for wasting your time. Hope I could help to discover new errors to catch. |
This comment has been minimized.
This comment has been minimized.
|
Thanks! Sorry about your servers... |
juliusv
closed this
Aug 16, 2016
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
prfalken commentedJul 4, 2016
Runing prometheus for a while, 1.2M metrics.
The server randomly crashes with a huge trace. Occurs about every 10 hours or so.
Note : This is a federation server but the crash occured the same way when there was a single server scraping everything
process runs in an RKT container on Linux CoreOS (Linux 4.5.0-coreos-r1 x86_64)
prometheus, version 0.19.2 (branch: master, revision: 23ca13c)
build user: root@134dc6bbc274
build date: 20160529-18:58:00
go version: go1.6.2