Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparent memory leak & WAL file accumulation after unclean shutdown #4842

Closed
dswarbrick opened this Issue Nov 8, 2018 · 45 comments

Comments

Projects
None yet
7 participants
@dswarbrick
Copy link

dswarbrick commented Nov 8, 2018

Bug Report

What did you do?
Reboot host running Prometheus.

What did you expect to see?
Normal operation.

What did you see instead? Under which circumstances?
Apparent memory leak, WAL files accumulation.

I suspect that we had a few unclean shutdowns, and after the crash recovery (which incidentally eats a LOT of memory), things seemed to be back to normal, until the first WAL checkpoint.

We run retention=30d, min-block-duration=1h, max-block-duration=24h. The WAL checkpoint appeared to become somehow deadlocked. Metrics were still being ingested and could still be queried, however RAM was being steadily consumed, and the WAL files were accumulating.

Prometheus inevitably crashed when it ran out of memory or disk space, whichever happened first.

The only workaround I found was to most importantly catch Prometheus in such a state before it crashed, and perform a clean shutdown / restart. Once restarted, it began to process the backlog of WAL files, disk usage receded, and eventually everything got back to normal.

Environment

  • System information:
    Linux 4.18.0-0.bpo.1-amd64 x86_64

  • Prometheus version:
    prometheus, version 2.4.3+ds (branch: debian/sid, revision: 2.4.3+ds-2)
    build user: pkg-go-maintainers@lists.alioth.debian.org
    build date: 20181022-04:09:52
    go version: go1.10.4

  • Alertmanager version:
    n/a

  • Prometheus configuration file:
    n/a (but can be added if really necessary)

  • Alertmanager configuration file:
    n/a

  • Logs:

(originally responding to @krasi-georgiev in #4695)

I didn't quite understand why did you need to lower the default block duration?

This was partly motivated after reading the article written by @PeterZaitsev (https://www.percona.com/blog/2018/09/20/prometheus-2-times-series-storage-performance-analyses/). We found that the RAM usage was about 10-20% lower using the fixed, lower max-block-duration=24h, instead of the 10% calculated from our retention period. The cold startup time also appeared to be lower.

any unusual log messages when this happens?

The systemd unit file in the Debian Prometheus package has TimeoutStopSec=20s, which I suspect is too short for a graceful shutdown of a large instance. It also has SendSIGKILL=false, so at least it it will allow Prometheus to continue its shutdown despite overshooting the timeout. However if this is occurring as part of a reboot, it is likely that the host will already by trying to unmount filesystems and reboot. I think this is where we are getting a corrupt DB / WAL.

Upon restarting, we see the usual log messages, but nothing too scary:

Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.834318726Z caller=main.go:231 msg="Starting Prometheus" version="(version=2.4.3+ds, branch=debian/sid, revision=2.4.3+ds-2)"
Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.834442207Z caller=main.go:232 build_context="(go=go1.10.4, user=pkg-go-maintainers@lists.alioth.debian.org, date=20181022-04:09:52)"
Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.834483654Z caller=main.go:233 host_details="(Linux 4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1 (2018-09-13) x86_64 fra-prom1 (none))"
Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.834519284Z caller=main.go:234 fd_limits="(soft=8192, hard=8192)"
Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.834550991Z caller=main.go:235 vm_limits="(soft=unlimited, hard=unlimited)"
Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.835817375Z caller=web.go:392 component=web msg="Start listening for connections" address=0.0.0.0:9090
Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.835822928Z caller=main.go:547 msg="Starting TSDB ..."
Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.909715042Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538676000000 maxt=1538870400000 ulid=01CS67N83WJ7322TS7T0BTBGJT
Nov  5 14:53:35 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:53:35.912787882Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538870400000 maxt=1539064800000 ulid=01CSC11QHWV4BZNNEP46DTZR30
...
Nov  5 14:56:17 fra-prom1 prometheus[465]: level=warn ts=2018-11-05T14:56:17.020693598Z caller=head.go:371 component=tsdb msg="unknown series references" count=25
Nov  5 14:56:17 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:56:17.951110468Z caller=main.go:557 msg="TSDB started"
Nov  5 14:56:17 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:56:17.951267769Z caller=main.go:617 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
Nov  5 14:56:17 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:56:17.989881165Z caller=main.go:643 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
Nov  5 14:56:17 fra-prom1 prometheus[465]: level=info ts=2018-11-05T14:56:17.990048657Z caller=main.go:516 msg="Server is ready to receive web requests."

Even the next WAL commit and compactions look normal:

Nov  5 15:31:28 fra-prom1 prometheus[465]: level=info ts=2018-11-05T15:31:28.199164694Z caller=compact.go:398 component=tsdb msg="write block" mint=1541426400000 maxt=1541430000000 ulid=01CVJ7XC6W73TH0TAK9JQS09QB
Nov  5 15:31:40 fra-prom1 prometheus[465]: level=info ts=2018-11-05T15:31:40.624026Z caller=head.go:446 component=tsdb msg="head GC completed" duration=6.465479154s
Nov  5 15:32:13 fra-prom1 prometheus[465]: level=info ts=2018-11-05T15:32:13.964443157Z caller=head.go:477 component=tsdb msg="WAL checkpoint complete" low=8290 high=8304 duration=33.339447257s
...
Nov  5 16:31:26 fra-prom1 prometheus[465]: level=info ts=2018-11-05T16:31:26.39099609Z caller=compact.go:398 component=tsdb msg="write block" mint=1541430000000 maxt=1541433600000 ulid=01CVJBB7TXVYDQ9AS7GPKFJWR8
Nov  5 16:31:36 fra-prom1 prometheus[465]: level=info ts=2018-11-05T16:31:36.482639095Z caller=head.go:446 component=tsdb msg="head GC completed" duration=4.919497363s
Nov  5 16:32:06 fra-prom1 prometheus[465]: level=info ts=2018-11-05T16:32:06.97950364Z caller=head.go:477 component=tsdb msg="WAL checkpoint complete" low=8305 high=8319 duration=30.496711647s
Nov  5 16:33:31 fra-prom1 prometheus[465]: level=info ts=2018-11-05T16:33:31.835765448Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1541419200000 maxt=1541430000000 ulid=01CVJBF8YW56AME8TYY1A9AVW0 sources="[01CVJ11MZFNB1ZSRWTTXP11Z1M 01CVJ4FGKEFY42CEADCAAZR88N 01CVJ7XC6W73TH0TAK9JQS09QB]"

But by this time, it is already starting to "leak" memory and the WAL files are piling up. Several hours later, Prometheus runs out of memory:

Nov  6 00:58:09 fra-prom1 prometheus[465]: fatal error: runtime: out of memory
Nov  6 00:58:09 fra-prom1 prometheus[465]: runtime stack:
Nov  6 00:58:09 fra-prom1 prometheus[465]: runtime.throw(0x1449017, 0x16)
Nov  6 00:58:09 fra-prom1 prometheus[465]: #011/usr/lib/go-1.10/src/runtime/panic.go:616 +0x81
...

can you try to describe a minimal setup and step to reproduce this so I can spend some time to find the culprit for this weird behaviour.

I think you should be able to reproduce it if you SIGKILL Prometheus whilst it is still flushing / shutting down. It may take a while after restarting before you actually see the issue however.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 8, 2018

Thanks, it would really help to get the output of the promtool debug command (inside the same release folder as the Prometheus binary) run at the time this happens. This will give us a clue what starts consuming so much memory.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 12, 2018

But by this time, it is already starting to "leak" memory and the WAL files are piling up. Several hours later, Prometheus runs out of memory:

when it reaches this state no errors in the logs? I suspect some WAL checkpointing loop

@dswarbrick

This comment has been minimized.

Copy link
Author

dswarbrick commented Nov 12, 2018

when it reaches this state no errors in the logs? I suspect some WAL checkpointing loop

The only log output is what I've already posted in this ticket. IOW, there is no indication from the logs that anything weird is happening, until it goes boom.

I have not had time yet to provoke the issue on a test system, and have just been extremely cautious about ensuring that instances are cleanly shut down in the meantime. This seems to avoid the issue.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 12, 2018

That is weird. Any corruptions should be detected and repaired at start-up.

Ok i will try to replicate by many dirty shutdowns.

@dswarbrick

This comment has been minimized.

Copy link
Author

dswarbrick commented Nov 18, 2018

@krasi-georgiev I just upgraded a test instance to v2.5.0, which recently landed as a package in debian/sid. Within a couple of hours it was exhibiting the familiar symptoms I've described above. I grabbed the debug info you requested from promtool.
debug.tar.gz

@dswarbrick

This comment has been minimized.

Copy link
Author

dswarbrick commented Nov 18, 2018

Additionally, trying to do a clean shutdown of this "bugged" instance is failing. It is ignoring further sigterms, and the web UI is still reachable. Logs below, and another debug pprof taken whilst it was in that state:

Nov 18 22:36:43 fkb-prom-test systemd[1]: Stopping Monitoring system and time series database...
Nov 18 22:36:43 fkb-prom-test prometheus[21407]: level=warn ts=2018-11-18T21:36:43.440651758Z caller=main.go:399 msg="Received SIGTERM, exiting gracefully..."
Nov 18 22:36:43 fkb-prom-test prometheus[21407]: level=info ts=2018-11-18T21:36:43.441326852Z caller=main.go:424 msg="Stopping scrape discovery manager..."
Nov 18 22:36:43 fkb-prom-test prometheus[21407]: level=info ts=2018-11-18T21:36:43.441407926Z caller=main.go:438 msg="Stopping notify discovery manager..."
Nov 18 22:36:43 fkb-prom-test prometheus[21407]: level=info ts=2018-11-18T21:36:43.441451042Z caller=main.go:460 msg="Stopping scrape manager..."
Nov 18 22:36:43 fkb-prom-test prometheus[21407]: level=info ts=2018-11-18T21:36:43.441539996Z caller=main.go:420 msg="Scrape discovery manager stopped"
Nov 18 22:36:43 fkb-prom-test prometheus[21407]: level=info ts=2018-11-18T21:36:43.442274972Z caller=main.go:434 msg="Notify discovery manager stopped"
Nov 18 22:36:43 fkb-prom-test prometheus[21407]: level=info ts=2018-11-18T21:36:43.452287635Z caller=manager.go:657 component="rule manager" msg="Stopping rule manager..."
Nov 18 22:36:43 fkb-prom-test prometheus[21407]: level=info ts=2018-11-18T21:36:43.453145358Z caller=manager.go:663 component="rule manager" msg="Rule manager stopped"
Nov 18 22:36:43 fkb-prom-test prometheus[21407]: level=info ts=2018-11-18T21:36:43.452316184Z caller=main.go:454 msg="Scrape manager stopped"
Nov 18 22:37:03 fkb-prom-test systemd[1]: prometheus.service: State 'stop-sigterm' timed out. Skipping SIGKILL.
Nov 18 22:37:23 fkb-prom-test systemd[1]: prometheus.service: State 'stop-final-sigterm' timed out. Skipping SIGKILL. Entering failed mode.
Nov 18 22:37:23 fkb-prom-test systemd[1]: Stopped Monitoring system and time series database.
Nov 18 22:37:23 fkb-prom-test systemd[1]: prometheus.service: Unit entered failed state.
Nov 18 22:37:23 fkb-prom-test systemd[1]: prometheus.service: Failed with result 'timeout'.

debug-wont-shutdown.tar.gz

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 19, 2018

thanks that should help I will look into these today.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 19, 2018

unfortunately the block and mutex profiles are empty as these are enabled only in debug mode.

if os.Getenv("DEBUG") != "" {
runtime.SetBlockProfileRate(20)
runtime.SetMutexProfileFraction(20)
}

I will continue looking into the other profiles, but if you can start another instance and gather the profiles with "DEBUG" enabled it will also help.

@AndorCS

This comment has been minimized.

Copy link

AndorCS commented Nov 19, 2018

The same issue happens to us.
After some chunks removal rest of WAL logs was checkpointed and prometheus works smoothly for ~4 days until it happens again.

We have --min-block-duration=30m and storage.tsdb.max-block-duration=24h

There is a screenshot from Grafana: https://i.imgur.com/207elUK.png

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 19, 2018

Looking at the profile before the shutdown it seems that there is some silent panic in
github.com/modern-go/reflect2

caused somewhere in the k8s Service discovery.
most likely in k8s.io/apimachinery/pkg/runtime/serializer/json/json.go

the reflect2 package is used in the Promeetheus API as well, but the profile doesn't show anything about that package.

stuck 11 33

@dswarbrick

This comment has been minimized.

Copy link
Author

dswarbrick commented Nov 19, 2018

That's odd. I don't use k8s SD, although I am using Consul SD.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 19, 2018

that is strange indeed.
the k8s api is not involved anywhere in your configs?

can you also post the configs to look for some clues there.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 19, 2018

@AndorCS can you help with steps to reproduce?

After some chunks removal rest of WAL logs was checkpointed and prometheus works smoothly for ~4 days until it happens again.

chunks are inside the actual blocks. Are you sure that you actually removed chunks as these are not related to the WAL.

Could you post the full logs to check if there is anything unusual.

it would help if you could run promtool debug all when it happens(before shutting it down) to see if it looks similar to the one posted by @dswarbrick

@dswarbrick

This comment has been minimized.

Copy link
Author

dswarbrick commented Nov 19, 2018

@krasi-georgiev Here is the config from the instance where I ran promtool debug. Nothing too exotic, it's simply a test instance. File renamed to *.yml.txt because github doesn't support attaching .yml files.

prometheus.yml.txt

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 19, 2018

I find the collapsible block a good way to add long text directly in the comment.

<details>
  <summary>Config</summary>
  
	```
	config
	```
</details>
Your Config
	global:
  scrape_interval:     30s
  evaluation_interval: 15s

  external_labels:
    monitor: fkb-prom-test

alerting:
  alert_relabel_configs:
  - source_labels: [state]
    regex: DOWN|MAINTAINED
    action: drop

  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

rule_files:
  - rules.yml

scrape_configs:
- job_name: prometheus
  scrape_interval: 5s
  scrape_timeout: 5s
  static_configs:
  - targets:
    - fkb-prom-test:9090

- job_name: node
  static_configs:
  - targets:
    - fkb-prom-test:9100

- job_name: node-dc
  consul_sd_configs:
  - server: fra-alerta:8500
    services: [prometheus-node]
    datacenter: fkb
  relabel_configs:
  - source_labels: [__meta_consul_dc]                       # Set dc label to Consul datacenter
    target_label: dc
  - source_labels: [__meta_consul_metadata_state]           # Copy state from Consul metadata
    target_label: state
  - source_labels: [__meta_consul_metadata_state_reason]    # Copy state_reason from Consul metadata
    target_label: state_reason
  - source_labels: [__meta_consul_metadata_chassis]         # Copy chassis from Consul metadata
    target_label: chassis
  - source_labels: [__meta_consul_metadata_cluster_name]    # Extract cluster number from cluster_name
    target_label: cluster
    regex: \S+-\S+-cluster-0*(\d+)
  - source_labels: [__meta_consul_metadata_host_type]       # Copy host_type from Consul metadata
    target_label: host_type
  - source_labels: [host_type]
    regex: storage
    action: keep
  - source_labels: [cluster]
    regex: 17
    action: keep

- job_name: blackbox-icmp
  metrics_path: /probe
  params:
    module: [icmp]
  consul_sd_configs:
  - server: fra-alerta:8500
    services: [icmp]
    datacenter: fkb
  relabel_configs:
  - source_labels: [__address__]                            # Target param for Blackbox exporter
    regex: ([a-z0-9\.\-]+)(:\d+)?
    target_label: __param_target
  - source_labels: [__param_target]                         # Set instance label to that of target
    target_label: instance
  - target_label: __address__                               # Address to scrape (Blackbox exporter)
    replacement: localhost:9115
  - source_labels: [__meta_consul_dc]                       # Set dc label to Consul datacenter
    target_label: dc
  - source_labels: [__meta_consul_metadata_state]           # Copy state from Consul metadata
    target_label: state
  - source_labels: [__meta_consul_metadata_state_reason]    # Copy state_reason from Consul metadata
    target_label: state_reason
  - source_labels: [__meta_consul_metadata_chassis]         # Copy chassis from Consul metadata
    target_label: chassis
  - source_labels: [__meta_consul_metadata_cluster_name]    # Extract cluster number from cluster_name
    target_label: cluster
    regex: \S+-\S+-cluster-0*(\d+)

- job_name: dell_snmp
  scrape_interval: 1m
  file_sd_configs:
  - files:
    - targets/chassis-dell.yml
  metrics_path: /snmp
  params:
    module: [dell_cmc]
  relabel_configs:
  - source_labels: [__address__]
    target_label: __param_target
  - source_labels: [__param_target]
    target_label: instance
  - target_label: __address__
    replacement: 127.0.0.1:9116

- job_name: test-proc-exporter
  static_configs:
  - targets:
    - 10.3.1.59:9256
    - git03.pb.local:9113

- job_name: ebpf
  static_configs:
  - targets:
    - localhost:9123

# vim ft=yaml ts=2 sts=2

@AndorCS

This comment has been minimized.

Copy link

AndorCS commented Nov 19, 2018

@AndorCS can you help with steps to reproduce?

After some chunks removal rest of WAL logs was checkpointed and prometheus works smoothly for ~4 days until it happens again.

It happens again right now, trying to profile.

chunks are inside the actual blocks. Are you sure that you actually removed chunks as these are not related to the WAL.

Yes, I'm sure, I removed some chunks, not WAL.

Could you post the full logs to check if there is anything unusual.

it would help if you could run promtool debug all when it happens(before shutting it down) to see if it looks similar to the one posted by @dswarbrick

Will try right now.

@AndorCS

This comment has been minimized.

Copy link

AndorCS commented Nov 19, 2018

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 19, 2018

@AndorCS it is good that it happened again , this should help track down the culprit.
Mind posting the logs? Also maybe posting the config should help. Or any other info that might be relevant.

Yes, I'm sure, I removed some chunks, not WAL.

hm , that is strange, wal checkpointing involves only the files in the data/wal dir. Which dir did you remove these chunks from?

@AndorCS

This comment has been minimized.

Copy link

AndorCS commented Nov 19, 2018

Mind posting the logs? Also maybe posting the config should help. Or any other info that might be relevant.

Logs do not exist anymore. That happened ~1 week ago last time. Sorry.
Right now I have last messages in logs:

level=info ts=2018-11-19T12:15:13.946072707Z caller=compact.go:398 component=tsdb msg="write block" mint=1542627000000 maxt=1542628800000 ulid=01CWNYACP8DJ8P923SP1JRTRET
level=info ts=2018-11-19T12:15:19.244860164Z caller=head.go:446 component=tsdb msg="head GC completed" duration=657.814851ms
level=info ts=2018-11-19T12:45:13.738489984Z caller=compact.go:398 component=tsdb msg="write block" mint=1542628800000 maxt=1542630600000 ulid=01CWP01AG911766GRS4AKWQ7ZY
level=info ts=2018-11-19T12:45:17.062075489Z caller=head.go:446 component=tsdb msg="head GC completed" duration=618.669213ms
level=info ts=2018-11-19T12:45:22.078614939Z caller=head.go:477 component=tsdb msg="WAL checkpoint complete" low=1152 high=1153 duration=5.016442605s
level=info ts=2018-11-19T12:45:38.411126593Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1542623400000 maxt=1542628800000 ulid=01CWP0224NB3P1ETYQ43DS6S0M sources="[01CWNTWGXEYQ3TSGMZ2BPCFX12 01CWNWTTXWJQ1PC3HSKFXDZF3B 01CWNYACP8DJ8P923SP1JRTRET]"

And graph
screenshot 2018-11-19 at 15 49 01

As you can see, checkpoint happens twice and after that is stuck again.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 19, 2018

@AndorCS I am unable to open the archive.

@AndorCS

This comment has been minimized.

Copy link

AndorCS commented Nov 19, 2018

Sorry, that archive is broken
debug.tar.gz

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 19, 2018

@AndorCS any way to save the logs for the next time? Seeing some errors in the the logs would be the easiest way to troubleshoot this.

The WAL graph shows that something happened between 13.30-14.00. Do you know what was it?

Do you know if this container was OOM killed at some point? I already fixed a bug casued by this in prometheus/tsdb#451

Can you share the actual file paths you deleted to make Prometheus work again?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 19, 2018

@AndorCS your profiles looks normal so these won't help to troubleshoot your issue.

Anyway if these were taken at the time when the wal started misbehaving please open a new issue and ping me there as this most likely caused by a different issue than the one we are troubleshooting here.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 19, 2018

@dswarbrick can you ping me on the @prometheus-dev IRC channel , it might be quicker that way.

@dswarbrick

This comment has been minimized.

Copy link
Author

dswarbrick commented Nov 19, 2018

@krasi-georgiev tried to, but I seem to be read-only in that channel.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 19, 2018

I think it requires a registration.
https://freenode.net/kb/answer/registration

@AndorCS

This comment has been minimized.

Copy link

AndorCS commented Nov 19, 2018

@AndorCS any way to save the logs for the next time? Seeing some errors in the the logs would be the easiest way to troubleshoot this.

I checked, there was nothing suspicious in logs

The WAL graph shows that something happened between 13.30-14.00. Do you know what was it?
Do you know if this container was OOM killed at some point? I already fixed a bug casued by this in prometheus/tsdb#451

It was some restarts related to configuration changes, not related to the current issue.

Can you share the actual file paths you deleted to make Prometheus work again?

Unfortunately, I didn't save it.

There is fresh data:
nothing new in logs, no any line of logs at all

screenshot 2018-11-19 at 19 00 55

debug.tar.gz

@AndorCS

This comment has been minimized.

Copy link

AndorCS commented Nov 20, 2018

Today we faced the same issue again:
Red annotations are "unknown series references" in logs during startup.

screenshot 2018-11-20 at 13 13 13

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 20, 2018

it looks like the head is not truncated.
Can you add the details from ...:9090/flags

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 20, 2018

could you also ping me on IRC for a quick chat.

@dswarbrick

This comment has been minimized.

Copy link
Author

dswarbrick commented Nov 21, 2018

@krasi-georgiev I upgraded another instance from 2.4.3 to 2.5.0, which appeared to run ok for a while, but alas ultimately showed the same deadlock-like symptom, and the head chunk kept growing in the absence of WAL checkpoints. The only thing that stuck out in the logs was 40 unknown series references. After another clean shutdown / restart, it has been running fine for about 12 hours now.

I have just now upgraded another, slightly larger instance. When attempting to shutdown the 2.4.3 process, it seemed to hang. ss -t showed that it still had established TCP connections to various node_exporters and the blackbox_exporter. I'm not sure if these were "stuck" TCP connections, or whether it was in fact continuing to scrape. These are the log entries prior to me SIGKILL'ing it:

Nov 21 12:14:32 las-prom1 systemd[1]: Stopping Monitoring system and time series database...
Nov 21 12:14:32 las-prom1 prometheus[472]: level=warn ts=2018-11-21T12:14:32.421020395Z caller=main.go:391 msg="Received SIGTERM, exiting gracefully..."
Nov 21 12:14:32 las-prom1 prometheus[472]: level=info ts=2018-11-21T12:14:32.421132576Z caller=main.go:416 msg="Stopping scrape discovery manager..."
Nov 21 12:14:32 las-prom1 prometheus[472]: level=info ts=2018-11-21T12:14:32.421236338Z caller=main.go:430 msg="Stopping notify discovery manager..."
Nov 21 12:14:32 las-prom1 prometheus[472]: level=info ts=2018-11-21T12:14:32.421273643Z caller=main.go:452 msg="Stopping scrape manager..."
Nov 21 12:14:32 las-prom1 prometheus[472]: level=info ts=2018-11-21T12:14:32.421331346Z caller=main.go:412 msg="Scrape discovery manager stopped"
Nov 21 12:14:32 las-prom1 prometheus[472]: level=info ts=2018-11-21T12:14:32.422865819Z caller=main.go:426 msg="Notify discovery manager stopped"
Nov 21 12:14:32 las-prom1 prometheus[472]: level=info ts=2018-11-21T12:14:32.449589402Z caller=main.go:446 msg="Scrape manager stopped"
Nov 21 12:14:32 las-prom1 prometheus[472]: level=info ts=2018-11-21T12:14:32.449361161Z caller=manager.go:638 component="rule manager" msg="Stopping rule manager..."
Nov 21 12:14:32 las-prom1 prometheus[472]: level=info ts=2018-11-21T12:14:32.451496662Z caller=manager.go:644 component="rule manager" msg="Rule manager stopped"
Nov 21 12:14:32 las-prom1 prometheus[472]: level=info ts=2018-11-21T12:14:32.541673673Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
Nov 21 12:14:32 las-prom1 prometheus[472]: level=error ts=2018-11-21T12:14:32.541987879Z caller=notifier.go:473 component=notifier alertmanager=http://fkb-alertmgr:9093/api/v1/alerts count=64 msg="Error sending alert" err="context canceled"
Nov 21 12:14:32 las-prom1 prometheus[472]: level=error ts=2018-11-21T12:14:32.54221133Z caller=notifier.go:473 component=notifier alertmanager=http://fra-alertmgr:9093/api/v1/alerts count=64 msg="Error sending alert" err="context canceled"
Nov 21 12:14:32 las-prom1 prometheus[472]: level=error ts=2018-11-21T12:14:32.542184693Z caller=notifier.go:473 component=notifier alertmanager=http://ewr-alertmgr:9093/api/v1/alerts count=64 msg="Error sending alert" err="context canceled"
Nov 21 12:14:32 las-prom1 prometheus[472]: level=info ts=2018-11-21T12:14:32.542367079Z caller=main.go:601 msg="Notifier manager stopped"
Nov 21 12:14:52 las-prom1 systemd[1]: prometheus.service: State 'stop-sigterm' timed out. Skipping SIGKILL.
Nov 21 12:15:12 las-prom1 systemd[1]: prometheus.service: State 'stop-final-sigterm' timed out. Skipping SIGKILL. Entering failed mode.
Nov 21 12:15:12 las-prom1 systemd[1]: Stopped Monitoring system and time series database.
Nov 21 12:15:12 las-prom1 systemd[1]: prometheus.service: Unit entered failed state.
Nov 21 12:15:12 las-prom1 systemd[1]: prometheus.service: Failed with result 'timeout'.
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 21, 2018

the stuck shutdown is most likely a side effect of the wal issue so I wouldn't focus too much on that yet.

I am still waiting for the other maintainers to have a look at some PRs for WAL fixes and once we merge these can make another release and test it again.

@dswarbrick

This comment has been minimized.

Copy link
Author

dswarbrick commented Nov 21, 2018

debug pprof from latest instance to exhibit symptom, as discussed in irc.
debug.tar.gz

@dswarbrick

This comment has been minimized.

Copy link
Author

dswarbrick commented Nov 21, 2018

@krasi-georgiev I finally managed to get a debug stack pprof from an instance with DEBUG=1. This instance was only setup for the first time yesterday - it has only ever run v2.5.0. I nuked the DB earlier today, so it started with a fresh DB, and within a few hours was showing the deadlock-like symptoms. This is also a very low volume instance (17K series head chunk, ~1350 samples per second), so it does not appear to only affect high traffic instances.
debug.tar.gz

@dswarbrick

This comment has been minimized.

Copy link
Author

dswarbrick commented Nov 22, 2018

A couple more details about the aforementioned instance: despite several attempts, I was not able to get it to shut down cleanly. Every time I attempted this, it would hang during the shutdown. After each restart, it would compact any loose blocks, but it appeared that it was simply not going to do a WAL checkpoint, and in the end I just left it running for several hours. When I checked back later, it has indeed performed a checkpoint, but they are not occurring regularly or predictably (for the most part) like the other instances. It seems that on a fresh DB with sufficiently low traffic, there is a much longer gap between checkpoints. The last checkpoint for this new instance was nearly six hours ago.

In any case, the shutdown hangs are pretty concerning, and if they are as easy to provoke on a new instance as this one was, it's surprising that this got through automated testing (or even manual testing by developers).

@dswarbrick

This comment has been minimized.

Copy link
Author

dswarbrick commented Nov 22, 2018

Logs from aforementioned hanging instance:
txl-prom1.log.gz

@dswarbrick

This comment has been minimized.

Copy link
Author

dswarbrick commented Nov 22, 2018

I will most likely have to defer to others who know the tsdb code much better, but I have so far traced the bug as far as the pendingReaders waitgroup in block.go:

// Close closes the on-disk block. It blocks as long as there are readers reading from the block.
func (pb *Block) Close() error {
        pb.mtx.Lock()
        pb.closing = true
        pb.mtx.Unlock()

        pb.pendingReaders.Wait()

        var merr MultiError

        merr.Add(pb.chunkr.Close())
        merr.Add(pb.indexr.Close())
        merr.Add(pb.tombstones.Close())

        log.Println("block closed")

        return merr.Err()
}

I have seen goroutines hang there indefinitely. It seems to be more likely to occur at shutdown, possibly due to blocks being closed whilst reads are still pending. It can however happen spontaneously during normal operation, if the race conditions (?) are met.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 28, 2018

fixed in #4922

@dswarbrick, @AndorCS thanks for the team work!

@simonpasquier simonpasquier modified the milestone: v2.6.0 Nov 28, 2018

@spjspjspj

This comment has been minimized.

Copy link

spjspjspj commented Dec 7, 2018

We faced the same issue after a probably unclean restart. It seems Next GC is way too high and GC never took place for days until OOM (followed by crash loop):
image

Head Time Series / Chunks kept growing as well:

image

image

Using prometheus_tsdb_head_series_created_total, I can see that Head Series were being created at the same pace as they used to, but their removal stopped.

Wiped data/wal dir and prometheus server started up again.

2.4.3 - 167a4b4

I wonder if there is a way to fix without losing data in WAL.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 7, 2018

2.6.0 rc is out , upgrade to it and run with the same data folder and this should clear the problem without loosing any data.

@Andor

This comment has been minimized.

Copy link

Andor commented Dec 17, 2018

@krasi-georgiev I accidentally deployed "stable" version 2.5.0 (aka bugged) after your custom patched version and faced this issue again a couple of days ago.
Now when I'm trying to run 2.6.0-rc.1 it runs and shortly getting OOM-killed, repeated already multiple times and looks like I can't finally achieve clean shutdown at all.

Log: http://dpaste.com/27594H5

/prometheus $ du -sh /prometheus/wal/
18.0G   /prometheus/wal/
/prometheus $ ls -l /prometheus/wal/ | wc -l
145
@Andor

This comment has been minimized.

Copy link

Andor commented Dec 17, 2018

Ok, after some "latest" WAL-data cleanup my instance started to work.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 17, 2018

yeah if the WAL grows very big you can't avoid OOM even after a restart as it needs to read the whole thing in memory.

@FANLONGFANLONG

This comment has been minimized.

Copy link

FANLONGFANLONG commented Dec 18, 2018

I ran into the same issue, I use the 2.6.0 rc1 version but the issue happened again,any sugggestion? prometheus keeps eating memory

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 18, 2018

you are most probably hitting some other issue please open a new one with more details.
I am locking this issue to limit the noise.

@prometheus prometheus locked as resolved and limited conversation to collaborators Dec 18, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.