Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSDB Compaction failures bring to OOM shutdown #4757

Closed
ashepelev opened this Issue Oct 18, 2018 · 22 comments

Comments

Projects
None yet
5 participants
@ashepelev
Copy link

ashepelev commented Oct 18, 2018

Bug Report

What did you do?
Using prometheus with 2000+ targets, 41k+ samples/sec, 10Mbit/s downlink

What did you expect to see?
Stable work

What did you see instead? Under which circumstances?
Compaction failures, increasing RAM consuming, wal log growth, OOM error shutdown

Environment

  • System information:
Linux 4.4.0-137-generic x86_64

4CPU, 32GB RAM, Local TSDB data (SSD)
  • Prometheus version:
prometheus, version 2.4.3 (branch: HEAD, revision: 167a4b4e73a8eca8df648d2d2043e21bdb9a7449)
  build user:       root@1e42b46043e9
  build date:       20181004-08:42:02
  go version:       go1.11.1
  • TSDB launch arguments
--storage.tsdb.min-block-duration=30m --storage.tsdb.max-block-duration=2h --storage.tsdb.retention=1d
  • Prometheus configuration file:
---
# yamllint disable rule:line-length
global:
  scrape_interval: 30s
  evaluation_interval: 30s
  scrape_timeout: 20s

alerting:
  alertmanagers:
    - timeout: 1s
      static_configs:
        - targets:
          - alertmanager-1:9093
          - alertmanager-2:9093             

rule_files:
  - /opt/prometheus/rules/dynamic-rules.yml

scrape_configs:
  - job_name: consul-services
    consul_sd_configs:
     - server: consul-server
       token: token
       datacenter: prod
       tag: prometheus_exporter
[  - job_name: job-name
    scrape_interval: 30s
    static_configs:
      - targets:
          - target-#
        labels:
          group: project
          env: prod
          service: service
    relabel_configs:
      - source_labels: [__address__]
        regex: ^(.*):\d+$
        target_label: instance
        replacement: $1
        action: replace
]
  • Logs:
Oct 18 09:16:52 vm-prometheus-prod-app01 prometheus[17227]: level=error ts=2018-10-18T06:16:52.967557383Z caller=db.go:305 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: symbol entry for \"service_acquisition_latency_ms><>Mean: 0.015672376193069618\" does not exist"
  • Additional Info
    We use two prometheus servers with the same configuration (targets, consuld_sd_config, alertmanagers, etc). Luckily the compaction breaks on one of them and both of them are not affected at the same time.
    image

For the following graph we didn't detect any compaction failed errors in logs:
image
image

We've lived very long with 16GB RAM per node and then it went out of control.

If some additional metrics are needed - we can take them.

@SuperQ

This comment has been minimized.

Copy link
Member

SuperQ commented Oct 19, 2018

This seems related to #4110. Getting pprof heap profiles during the compaction would help to debug the issues.

@ashepelev

This comment has been minimized.

Copy link
Author

ashepelev commented Oct 19, 2018

@SuperQ Hello, thanks for your answer. Generally we detect compaction failures at about 16Gb ram consuming (50% of total). Do you think it might need so much memory for compaction process?

@SuperQ

This comment has been minimized.

Copy link
Member

SuperQ commented Oct 19, 2018

Yes, right now compactions do need quite a lot of memory.

One thing you're doing that's making it worse is setting the --storage.tsdb.min-block-duration=30m. I suggest you remove this flag and using the default 2h. Sample chunks are 120 samples long. Your scrape_interval is 30s, which means your writing out 60 samples in a 120 sample chunk, which wastes a ton of memory and storage space.

Also, if you set min-block and max-block to the same number, you will spend no time doing additional compactions.

@metri

This comment has been minimized.

Copy link

metri commented Oct 22, 2018

@SuperQ , hi! I am a colleague of @ashepelev .

How can I give you the archive with pproff debug?

@SuperQ

This comment has been minimized.

Copy link
Member

SuperQ commented Oct 22, 2018

The easiest way is to attach it as a file to this issue.

@metri

This comment has been minimized.

Copy link

metri commented Oct 23, 2018

prom_debug.tar.gz
Here pprof debug for two nodes.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 29, 2018

@SuperQ I think the problem here is different.

The error is when trying to persist the head into a block and not when trying to join 2 or more blocks together.

Oct 18 09:16:52 vm-prometheus-prod-app01 prometheus[17227]: level=error ts=2018-10-18T06:16:52.967557383Z caller=db.go:305 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: symbol entry for \"service_acquisition_latency_ms><>Mean: 0.015672376193069618\" does not exist"

I think the problem here is with some strange character in your series labels.

service_acquisition_latency_ms><>

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 29, 2018

If that is the case please let me know the series label that caused this so we can add some sanitization somewhere.

@metri

This comment has been minimized.

Copy link

metri commented Oct 30, 2018

@krasi-georgiev hi!

The fact that this happens at different labels. Eg, yesterday:

prometheus[9131]: level=error ts=2018-10-29T15:47:27.462714041Z caller=db.go:305 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: symbol entry for \"/var/lib/docker/overlay2/ea206854728cc1ff29dfc5a6f7c7b2e5cb98dd767c96cef3ba1b6481371070b2/merged\" does not exist"

This was due to a different label.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 30, 2018

hm , that looks like something very specific to your setup.

/var/lib/docker/overlay2/ea206854728cc1ff29dfc5a6f7c7b2e5cb98dd767c96cef3ba1b6481371070b2/merged\
is this really a label?

When did this start happening and did you change anything in your setup around that time?

@ashepelev

This comment has been minimized.

Copy link
Author

ashepelev commented Oct 30, 2018

@krasi-georgiev, hello,

I guess that compaction failed errors is not the root but the consequence of the problem. We detected that theoretically we should be fine with 32GB RAM installation for our amount of metrics (30m min/max block, 1d retention, 60s global scrape interval -> 23k samples/s). But the truth is we cannot make any more pressure on prometheus. Changing any of this settings causes OOM. After restart it starts to read WAL, starts compaction again and OOM again. I've tried to find out some prometheus metrics which will help me to resolve this issue, but all of them are just fine: I see head flushing every 30m, I see head time window size behaves just normally. I don't see any exceptions in prometheus debug mode. I've looked across other prometheus RAM issues and figured out several things to check:

  1. Heave rules. I've searched across rules that were added when the problem started (yes, the problem appeared immidiately, it was not the long way of growing RAM). The heavist one is 125ms, while others are 10-20ms evaluation time.
  2. We use pushgateway, but there are few metrics total (~100)
  3. Dynamic labels. When the exporter sends different label KV pairs all the time. That causes creating new timeseries and might be hard for it.
    I've tried to detect the current in-memory size and the number of samples with the help of prometheus metrics but I couldn't calculate that. Anyway I believe the problem is somewhere around samples/timeseries and processing of them, but I can't diagnose which one can cause this, maybe there are some advices?
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 30, 2018

The problem I see is completely different.

The head is the in memory holder of the ingested metrics and these are flushed to disk periodically to free up the memory.
Now this flush fails so the head grows indefinitely until eventually it eats all the memory and gets killed.

We need to figure out why are you getting the symbol entry error.

Back to my original question, did you change anything recently? New exporter, new app or anything new in the metrics side, relabelling etc.
Anything in relations to series labels.

symbols are all strings of the series label, name pairs. so I think the problem is that somehow these are messed up. Either when we add them or read them from the head. Probably caused by some weird character somewhere which we need to sanitise.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 30, 2018

adding an additional series label/name pair to an existing application could be another culprit.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 30, 2018

Another thing that can help if you could upload the data/wal folder somewhere and share the link with me privately kgeorgie at redhat.com

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 30, 2018

@fabxc , @gouthamve any other suggestion that might cause this?

@ashepelev

This comment has been minimized.

Copy link
Author

ashepelev commented Oct 30, 2018

@krasi-georgiev
Thank you for your answer. I'd like to notice that we don't detect any compaction failed problem now. So generally initial bug description might be confusing.
To the question of recently added exporters/apps. We use static targets with VCS control as well as dynamic targets coming with consul_sd, as well as pushgateway metrics. I don't control all of new targets and metric labels I got finally in prometheus. But I will be glad to have analyze tool so I can find which one affects a lot.
About symbols table. Maybe this graph will be valuable:
image
One more interesting thing is the number of currently loaded tsdb blocks. I've got it constant all the time - 50. I guess its kinda in-code limit. Maybe I should pay attention to this.
And about WAL: I'm trying to check the data/war size periodically. During heavy usage 90-100% RAM it was 5-6GB on both nodes. Now its something like 1.5-2.5GB.
I'd like to share data/wal but unfortunately I cannot. Maybe you can share some tools to analyze it?

@ashepelev

This comment has been minimized.

Copy link
Author

ashepelev commented Nov 1, 2018

I guess we figured out the problem.
After long investigation over scraping metrics and prometheus metrics we've looked inside data/<block_id>/index and found out that it is 3x size comparing with chunks size which is abnormal. Thanks to it's text format we were able to inspect the contents. We've found out that some labels have lots of values (hash sums, ids, etc.). Also we've found metrics coming from wrong configured JMX-exporter, which pass metrics value inside label value. After configuring droping of this metrics we've detected that RAM usage decreased significantly.
Currently we've returned pre-failure configuration: min.block=2h max.block=12h with 30d retention and 30s global scrape interval. Both of our nodes feel fine, using 30% of RAM, while it was catching OOM on min.block=30m max.block=30m retention=1d and scrape_interval=60s before fixing.
Thanks everyone for assitance and your tips.
Hope it will help someone to figure out your high RAM usage reasons.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 1, 2018

increased memory usage is expected, but not the original error you have posted so I still think that some sanitization is missing somewhere.

can you maybe post a sample of a scrape in the text format (removing all confidential info) so It gives me some starting point to dig deeper into this?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 9, 2018

ping @ashepelev

@ashepelev

This comment has been minimized.

Copy link
Author

ashepelev commented Nov 9, 2018

@krasi-georgiev
Sorry for not answering.
As I've said before we have resolved a problem by blacklisting metrics with lots of label values.
Here's an example:

http_metric{env="env",group="proj",instance="host.company.com",job="jobname",name="200><>FifteenMinuteRate: 0.0011530746427113085",service="service",type="http_metric"}

As you can see the label name is filled with metric value.
This happend because of wrong jmx_exporter config:

rules:
-   labels:
        name: $1
        type: http_metric
    name: http_metric
    pattern: ^com.*, name=(.*).*$

We had lots of such samples in scrape. That lead to incredible block index-file growing.
That correlates with initial error in prometheus logs:

Oct 18 09:16:52 vm-prometheus-prod-app01 prometheus[17227]: level=error ts=2018-10-18T06:16:52.967557383Z caller=db.go:305 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: symbol entry for \"service_acquisition_latency_ms><>Mean: 0.015672376193069618\" does not exist"

Generally each scrape produced new values. I guess prometheus wasn't ready for that.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Nov 9, 2018

Closing at it was due to unbounded label cardinality.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 9, 2018

still curious for that error log messages.

Could it be that the system didn't have anymore memory to take anymore symbol entries so it failed to add it causing the following read of it to fail as well.
Sound a weird way to OOM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.