Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart-corruption-restart 30-minute OOM cycle in 2.2.1 #4018

Closed
tzz opened this Issue Mar 27, 2018 · 14 comments

Comments

Projects
None yet
3 participants
@tzz
Copy link

tzz commented Mar 27, 2018

What did you do?

Run Prometheus normally.

What did you expect to see?

Normal behavior.

What did you see instead? Under which circumstances?

Prometheus grows RSS memory wildly. The OOM killer visits and kills it.

I have Prometheus TSDB lockups periodically, about once every 2-5 days, since 2.2.1 (I was having daily lockups with 2.0.x due to the index reader overflow bug).

The symptom is, paraphrasing the logs below: prom service gets killed by the kernel on OOM. No errors, "Start listening for connections", "Starting TSDB".

25-35 minutes later "WAL corruption detected; truncating" followed immediately by "unknown series references in WAL samples" and 8 secs later "TSDB started", "Loading configuration file", "Server is ready to receive web requests."

So the TSDB startup is taking a very long time and seems to crash out due to WAL corruption. Once it happens I have no recourse but to wipe data.

I can't trigger this behavior explicitly. I also can't share the data files because they contain proprietary information. I'd like to be able to inject metrics into Prometheus quickly to try and trigger this bug, or a way to anonymize the metrics so I can share the originals.

Environment

  • System information:

    Linux 4.4.70 x86_64

  • Prometheus version:

    2.2.1

  • Alertmanager version: N/A

  • Prometheus configuration file:

One alertmanager, inventory loaded from static JSON file.
Scrape and evaluation intervals = 15s
  • Alertmanager configuration file: N/A

  • Logs:

I've changed the hostname and domain. The below sequence has happened twice, once starting on March 22 as you see below, and once starting on March 26 with the exact same behavior and logs.

Stage 1: normal restart causes corruption

Mar 22 12:04:10 prometheus-server.nodomain prometheus[88230]: level=info ts=2018-03-22T16:04:10.657452001Z caller=main.go:596 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
Mar 22 12:04:10 prometheus-server.nodomain systemd[1]: Reloaded The Prometheus 2 monitoring system and time series database..
Mar 22 12:51:37 prometheus-server.nodomain systemd[1]: prometheus.service: main process exited, code=killed, status=9/KILL
Mar 22 12:51:37 prometheus-server.nodomain systemd[1]: Unit prometheus.service entered failed state.
Mar 22 12:51:37 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 22 12:51:38 prometheus-server.nodomain systemd[1]: prometheus.service holdoff time over, scheduling restart.
Mar 22 12:51:38 prometheus-server.nodomain systemd[1]: Started The Prometheus 2 monitoring system and time series database..
Mar 22 12:51:38 prometheus-server.nodomain systemd[1]: Starting The Prometheus 2 monitoring system and time series database....
Mar 22 12:51:38 prometheus-server.nodomain prometheus[63357]: level=info ts=2018-03-22T16:51:38.387687723Z caller=main.go:220 msg="Starting Prometheus" version="(version=2.2.0-rc.1, branch=HEAD, revision=cd5e2fe68741a60ba90b80fbc161d0581be9e8bd)"
Mar 22 12:51:38 prometheus-server.nodomain prometheus[63357]: level=info ts=2018-03-22T16:51:38.387758677Z caller=main.go:221 build_context="(go=go1.9.2, user=root@7b1d536fcfb9, date=20180221-16:31:11)"
Mar 22 12:51:38 prometheus-server.nodomain prometheus[63357]: level=info ts=2018-03-22T16:51:38.387775937Z caller=main.go:222 host_details="(Linux 4.4.70-1.el7.elrepo.x86_64 #1 SMP Fri May 26 07:43:23 EDT 2017 x86_64 prometheus-server.nodomain prod.nodomain)"
Mar 22 12:51:38 prometheus-server.nodomain prometheus[63357]: level=info ts=2018-03-22T16:51:38.387789977Z caller=main.go:223 fd_limits="(soft=500000, hard=500000)"
Mar 22 12:51:38 prometheus-server.nodomain prometheus[63357]: level=info ts=2018-03-22T16:51:38.391583609Z caller=web.go:384 component=web msg="Start listening for connections" address=0.0.0.0:9090
Mar 22 12:51:38 prometheus-server.nodomain prometheus[63357]: level=info ts=2018-03-22T16:51:38.391567047Z caller=main.go:510 msg="Starting TSDB ..."
Mar 22 13:23:55 prometheus-server.nodomain prometheus[63357]: level=error ts=2018-03-22T17:23:55.84761499Z caller=wal.go:275 component=tsdb msg="WAL corruption detected; truncating" err="unexpected CRC32 checksum 48c5d6b, want 0" file=/export/data/prometheus/wal/037530 pos=90219722
Mar 22 13:23:57 prometheus-server.nodomain prometheus[63357]: level=warn ts=2018-03-22T17:23:57.202336531Z caller=head.go:320 component=tsdb msg="unknown series references in WAL samples" count=7390
Mar 22 13:24:10 prometheus-server.nodomain prometheus[63357]: level=info ts=2018-03-22T17:24:10.766418605Z caller=main.go:520 msg="TSDB started"
Mar 22 13:24:10 prometheus-server.nodomain prometheus[63357]: level=info ts=2018-03-22T17:24:10.766815555Z caller=main.go:596 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
Mar 22 13:24:10 prometheus-server.nodomain prometheus[63357]: level=info ts=2018-03-22T17:24:10.819765288Z caller=main.go:497 msg="Server is ready to receive web requests."
Mar 22 13:24:15 prometheus-server.nodomain prometheus[63357]: level=info ts=2018-03-22T17:24:15.904810795Z caller=compact.go:387 component=tsdb msg="compact blocks" count=1 mint=1521727200000 maxt=1521734400000

Stage 2: the server is restarted and dies again. This repeats every 30 minutes or so.

Mar 22 14:02:26 prometheus-server.nodomain systemd[1]: prometheus.service: main process exited, code=killed, status=9/KILL
Mar 22 14:02:26 prometheus-server.nodomain systemd[1]: Unit prometheus.service entered failed state.
Mar 22 14:02:26 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 22 14:02:26 prometheus-server.nodomain systemd[1]: prometheus.service holdoff time over, scheduling restart.
Mar 22 14:02:26 prometheus-server.nodomain systemd[1]: Started The Prometheus 2 monitoring system and time series database..
Mar 22 14:02:26 prometheus-server.nodomain systemd[1]: Starting The Prometheus 2 monitoring system and time series database....
Mar 22 14:02:26 prometheus-server.nodomain prometheus[152254]: level=info ts=2018-03-22T18:02:26.94263017Z caller=main.go:220 msg="Starting Prometheus" version="(version=2.2.0-rc.1, branch=HEAD, revision=cd5e2fe68741a60ba90b80fbc161d0581be9e8bd)"
Mar 22 14:02:26 prometheus-server.nodomain prometheus[152254]: level=info ts=2018-03-22T18:02:26.942714917Z caller=main.go:221 build_context="(go=go1.9.2, user=root@7b1d536fcfb9, date=20180221-16:31:11)"
Mar 22 14:02:26 prometheus-server.nodomain prometheus[152254]: level=info ts=2018-03-22T18:02:26.94274427Z caller=main.go:222 host_details="(Linux 4.4.70-1.el7.elrepo.x86_64 #1 SMP Fri May 26 07:43:23 EDT 2017 x86_64 prometheus-server.nodomain prod.nodomain)"
Mar 22 14:02:26 prometheus-server.nodomain prometheus[152254]: level=info ts=2018-03-22T18:02:26.942768677Z caller=main.go:223 fd_limits="(soft=500000, hard=500000)"
Mar 22 14:02:26 prometheus-server.nodomain prometheus[152254]: level=info ts=2018-03-22T18:02:26.94600165Z caller=web.go:384 component=web msg="Start listening for connections" address=0.0.0.0:9090
Mar 22 14:02:26 prometheus-server.nodomain prometheus[152254]: level=info ts=2018-03-22T18:02:26.945991007Z caller=main.go:510 msg="Starting TSDB ..."
Mar 22 14:35:24 prometheus-server.nodomain prometheus[152254]: level=warn ts=2018-03-22T18:35:24.243877755Z caller=head.go:320 component=tsdb msg="unknown series references in WAL samples" count=7390
Mar 22 14:35:37 prometheus-server.nodomain prometheus[152254]: level=info ts=2018-03-22T18:35:37.96359858Z caller=main.go:520 msg="TSDB started"
Mar 22 14:35:37 prometheus-server.nodomain prometheus[152254]: level=info ts=2018-03-22T18:35:37.964784428Z caller=main.go:596 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
Mar 22 14:35:38 prometheus-server.nodomain prometheus[152254]: level=info ts=2018-03-22T18:35:38.113670684Z caller=main.go:497 msg="Server is ready to receive web requests."
Mar 22 14:35:43 prometheus-server.nodomain prometheus[152254]: level=info ts=2018-03-22T18:35:43.620707604Z caller=compact.go:387 component=tsdb msg="compact blocks" count=1 mint=1521727200000 maxt=1521734400000

Stage 3: just restarts noted, they repeat with the same corruption message until I wipe the data.

Mar 22 14:37:01 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 22 15:14:23 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 22 15:50:06 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 22 16:25:30 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 22 16:59:11 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 22 17:35:13 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 22 18:09:57 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 22 18:44:00 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 22 19:20:13 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 22 19:55:42 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 22 20:30:02 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 22 21:06:10 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 22 21:45:15 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 22 22:22:56 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 22 22:56:25 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 22 23:32:44 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 23 00:08:01 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 23 00:43:25 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 23 01:24:21 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 23 02:02:11 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 23 02:37:22 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 23 03:12:11 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 23 03:47:29 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 23 04:23:21 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 23 04:58:49 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 23 05:33:39 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 23 06:08:25 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 23 06:45:25 prometheus-server.nodomain systemd[1]: prometheus.service failed.
Mar 23 07:07:43 prometheus-server.nodomain systemd[1]: prometheus.service failed.
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 27, 2018

Why was Prometheus sent a SIGKILL?

Can you indicate how many samples/s you're ingesting, and roughly how much CPU power that machine has?

@tzz

This comment has been minimized.

Copy link
Author

tzz commented Mar 27, 2018

@brian-brazil I don't know why it was killed forcibly. I am using a default systemd unit file, with ExecReload specified to do a HUP. I have Salt set to reload, not restart the service on a configuration change. It may have been another administrator or process. Below are the systemd unit and the sysconfig.

[Unit]
Description=The Prometheus 2 monitoring system and time series database.
Documentation=https://prometheus.io
After=network.target

[Service]
EnvironmentFile=-/etc/sysconfig/prometheus
User=svc_prometheus
ExecStart=/usr/local/prometheus-2.2.1.linux-amd64/prometheus \
          $PROMETHEUS_OPTS
LimitNOFILE=500000
LimitNPROC=500000
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure

[Install]
WantedBy=multi-user.target
PROMETHEUS_OPTS='--config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/export/data/prometheus --storage.tsdb.retention=15d --web.external-url prometheus-server.nodomain:9090 --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --web.enable-lifecycle --web.enable-admin-api'
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 27, 2018

@fabxc Sounds like this might be prometheus/tsdb#21 again.

The SIGKILL is likely something else on your system interfering.

@tzz

This comment has been minimized.

Copy link
Author

tzz commented Mar 27, 2018

The machine is overpowered, over 30 CPUs and over 150 GB RAM. There is no I/O or CPU or memory overload that I can see.

@tzz

This comment has been minimized.

Copy link
Author

tzz commented Mar 27, 2018

rate(prometheus_tsdb_head_samples_appended_total[1h]) = 1353272.8825662483

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Mar 27, 2018

Hmm, yep, this is after the TSDB started and is trying to compact a block. Are you there is no memory overload or an external process killing prometheus?

It's just ultra weird that it is dying with no crash message. This line is appearing in both logs making me curious if another process in killing prometheus:
Mar 22 14:02:26 prometheus-server.nodomain systemd[1]: prometheus.service: main process exited, code=killed, status=9/KILL

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 27, 2018

rate(prometheus_tsdb_head_samples_appended_total[1h]) = 1353272.8825662483

That's quite a busy Prometheus at 1.3M samples/s. That'd be a ~10GB WAL worst case.

@tzz

This comment has been minimized.

Copy link
Author

tzz commented Mar 27, 2018

@gouthamve I found the restarter, it was the OOM killer... I thought it was a normal configuration reload and/or a Salt+systemd misfire.

@tzz

This comment has been minimized.

Copy link
Author

tzz commented Mar 27, 2018

I found the restarter, it was the OOM killer

In normal operation I saw Prometheus stable below 150GB memory before 2.2.1. That's not bad on a 250GB system. With 2.2.1 it looks like it has started to grow beyond that and it's getting OOM killed when anon-rss hits 250GB.

So I have the option of splitting up Prometheus to alleviate the memory issues. I'm not sure if that's sufficient to close this issue or if you think a OOM kill situation should be handled better.

@tzz tzz changed the title Restart-corruption-restart 30-minute cycle in 2.2.1 Restart-corruption-restart 30-minute OOM cycle in 2.2.1 Mar 28, 2018

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Mar 30, 2018

It's likely that prometheus is not able to recover due to reading the WAL in too fast.

@fabxc Any chance we can optimise (make it slower, really) the WAL reading?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 30, 2018

Reading the WAL shouldn't in principle take more memory than normal ingestion.

@tzz

This comment has been minimized.

Copy link
Author

tzz commented Mar 31, 2018

Over the last few days I dug into this more. I found one nasty issue: the NOFILE limit in the systemd unit was being ignored. This was not obvious because I had

...other things
# commented out old option \
LimitNOFILE=500000

so the comment line was continued and that limit was ignored silently. I noticed it accidentally in the Prometheus logs (the limit it logged was 4K or something like that) and fixed it 2 days ago. I also set the retention to 24h. I've also cut down the number of metrics and lowered the scraping interval, following advice in the IRC channel (thanks @brian-brazil @SuperQ and others)

Since then, the memory usage has been stable and I'll keep watching it and raising the retention. So it's possible that the NOFILE limit was the cause of this problem. I'll report back after a few more days.

@tzz

This comment has been minimized.

Copy link
Author

tzz commented Apr 5, 2018

I think this has been resolved for us.

The file limit issue was the only thing notable. With that, plus collecting fewer metrics less frequently, we seem to have stabilized.

Thanks to everyone for the assistance.

@tzz tzz closed this Apr 5, 2018

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.