Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upRestart-corruption-restart 30-minute OOM cycle in 2.2.1 #4018
Comments
This comment has been minimized.
This comment has been minimized.
|
Why was Prometheus sent a SIGKILL? Can you indicate how many samples/s you're ingesting, and roughly how much CPU power that machine has? |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil I don't know why it was killed forcibly. I am using a default systemd unit file, with
|
This comment has been minimized.
This comment has been minimized.
|
@fabxc Sounds like this might be prometheus/tsdb#21 again. The SIGKILL is likely something else on your system interfering. |
This comment has been minimized.
This comment has been minimized.
|
The machine is overpowered, over 30 CPUs and over 150 GB RAM. There is no I/O or CPU or memory overload that I can see. |
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
|
Hmm, yep, this is after the TSDB started and is trying to compact a block. Are you there is no memory overload or an external process killing prometheus? It's just ultra weird that it is dying with no crash message. This line is appearing in both logs making me curious if another process in killing prometheus: |
This comment has been minimized.
This comment has been minimized.
That's quite a busy Prometheus at 1.3M samples/s. That'd be a ~10GB WAL worst case. |
This comment has been minimized.
This comment has been minimized.
|
@gouthamve I found the restarter, it was the OOM killer... I thought it was a normal configuration reload and/or a Salt+systemd misfire. |
This comment has been minimized.
This comment has been minimized.
In normal operation I saw Prometheus stable below 150GB memory before 2.2.1. That's not bad on a 250GB system. With 2.2.1 it looks like it has started to grow beyond that and it's getting OOM killed when anon-rss hits 250GB. So I have the option of splitting up Prometheus to alleviate the memory issues. I'm not sure if that's sufficient to close this issue or if you think a OOM kill situation should be handled better. |
tzz
changed the title
Restart-corruption-restart 30-minute cycle in 2.2.1
Restart-corruption-restart 30-minute OOM cycle in 2.2.1
Mar 28, 2018
This comment has been minimized.
This comment has been minimized.
|
It's likely that prometheus is not able to recover due to reading the WAL in too fast. @fabxc Any chance we can optimise (make it slower, really) the WAL reading? |
This comment has been minimized.
This comment has been minimized.
|
Reading the WAL shouldn't in principle take more memory than normal ingestion. |
This comment has been minimized.
This comment has been minimized.
|
Over the last few days I dug into this more. I found one nasty issue: the NOFILE limit in the systemd unit was being ignored. This was not obvious because I had
so the comment line was continued and that limit was ignored silently. I noticed it accidentally in the Prometheus logs (the limit it logged was 4K or something like that) and fixed it 2 days ago. I also set the retention to 24h. I've also cut down the number of metrics and lowered the scraping interval, following advice in the IRC channel (thanks @brian-brazil @SuperQ and others) Since then, the memory usage has been stable and I'll keep watching it and raising the retention. So it's possible that the NOFILE limit was the cause of this problem. I'll report back after a few more days. |
This comment has been minimized.
This comment has been minimized.
|
I think this has been resolved for us. The file limit issue was the only thing notable. With that, plus collecting fewer metrics less frequently, we seem to have stabilized. Thanks to everyone for the assistance. |
tzz
closed this
Apr 5, 2018
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
tzz commentedMar 27, 2018
•
edited
What did you do?
Run Prometheus normally.
What did you expect to see?
Normal behavior.
What did you see instead? Under which circumstances?
Prometheus grows RSS memory wildly. The OOM killer visits and kills it.
I have Prometheus TSDB lockups periodically, about once every 2-5 days, since 2.2.1 (I was having daily lockups with 2.0.x due to the index reader overflow bug).
The symptom is, paraphrasing the logs below: prom service gets killed by the kernel on OOM. No errors, "Start listening for connections", "Starting TSDB".
25-35 minutes later "WAL corruption detected; truncating" followed immediately by "unknown series references in WAL samples" and 8 secs later "TSDB started", "Loading configuration file", "Server is ready to receive web requests."
So the TSDB startup is taking a very long time and seems to crash out due to WAL corruption. Once it happens I have no recourse but to wipe data.
I can't trigger this behavior explicitly. I also can't share the data files because they contain proprietary information. I'd like to be able to inject metrics into Prometheus quickly to try and trigger this bug, or a way to anonymize the metrics so I can share the originals.
Environment
System information:
Linux 4.4.70 x86_64
Prometheus version:
2.2.1
Alertmanager version: N/A
Prometheus configuration file:
Alertmanager configuration file: N/A
Logs:
I've changed the hostname and domain. The below sequence has happened twice, once starting on March 22 as you see below, and once starting on March 26 with the exact same behavior and logs.
Stage 1: normal restart causes corruption
Stage 2: the server is restarted and dies again. This repeats every 30 minutes or so.
Stage 3: just restarts noted, they repeat with the same corruption message until I wipe the data.