Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting many compaction failed with invalid argument error in prometheus server log #5362

Closed
dcvtruong opened this Issue Mar 14, 2019 · 8 comments

Comments

Projects
None yet
2 participants
@dcvtruong
Copy link

dcvtruong commented Mar 14, 2019

Bug Report

What did you do?
I notice the many compaction error Prometheus server pod get scheduled over to a different k8s worker node
What did you expect to see?
Not so many compaction error in prometheus server log
What did you see instead? Under which circumstances?
Repeated compaction error with with invalid argument
Environment

  • System information:

    Linux 3.10.0-957.1.3.el7.x86_64 x86_64

  • Prometheus version:

    2.6.1

  • Prometheus configuration file:

      global:
        scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
        evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
        # scrape_timeout is set to the global default (10s).

      # Alertmanager configuration
      alerting:
        alertmanagers:
        - static_configs:
          - targets:
            # - alertmanager:9093

      # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
      rule_files:
        # - "first_rules.yml"
        # - "second_rules.yml"

      # A scrape configuration containing exactly one endpoint to scrape:
      # Here it's Prometheus itself.
      scrape_configs:
        # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
        - job_name: 'prometheus'

          # metrics_path defaults to '/metrics'
          # scheme defaults to 'http'.

          static_configs:
          - targets: ['localhost:9090']
  • Logs:
          level=info ts=2019-03-14T01:54:50.533936079Z caller=main.go:243 msg="Starting Prometheus" version="(version=2.6.1, branch=HEAD, revision=b639fe140c1f71b2cbad3fc322b17efe60839e7e)"
          level=info ts=2019-03-14T01:54:50.534059479Z caller=main.go:244 build_context="(go=go1.11.4, user=root@4c0e286fe2b3, date=20190115-19:12:04)"
          level=info ts=2019-03-14T01:54:50.534131879Z caller=main.go:245 host_details="(Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 prometheus-server-844f844bf8-vtdk6 (none))"
          level=info ts=2019-03-14T01:54:50.534159179Z caller=main.go:246 fd_limits="(soft=1048576, hard=1048576)"
          level=info ts=2019-03-14T01:54:50.534194279Z caller=main.go:247 vm_limits="(soft=unlimited, hard=unlimited)"
          level=info ts=2019-03-14T01:54:50.535053582Z caller=main.go:561 msg="Starting TSDB ..."
          level=info ts=2019-03-14T01:54:50.535301382Z caller=web.go:429 component=web msg="Start listening for connections" address=0.0.0.0:9090
          level=info ts=2019-03-14T01:55:16.467744497Z caller=main.go:571 msg="TSDB started"
          level=info ts=2019-03-14T01:55:16.467867997Z caller=main.go:631 msg="Loading configuration file" filename=/etc/config/prometheus.yml
          level=info ts=2019-03-14T01:55:16.471057806Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
          level=info ts=2019-03-14T01:55:16.47246501Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
          level=info ts=2019-03-14T01:55:16.473748513Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
          level=info ts=2019-03-14T01:55:16.474883016Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
          level=info ts=2019-03-14T01:55:16.476965822Z caller=kubernetes.go:201 component="discovery manager notify" discovery=k8s msg="Using pod service account via in-cluster config"
          level=info ts=2019-03-14T01:55:16.479123528Z caller=main.go:657 msg="Completed loading of configuration file" filename=/etc/config/prometheus.yml
          level=info ts=2019-03-14T01:55:16.479218928Z caller=main.go:530 msg="Server is ready to receive web requests."
          level=error ts=2019-03-14T01:55:43.025096323Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WYJEEX59YRGA29BPE8WAXY.tmp: invalid argument"
          level=error ts=2019-03-14T01:55:48.766084444Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WYK48SCPZCB83G3VV6VSFK.tmp: invalid argument"
          level=error ts=2019-03-14T01:55:49.022951547Z caller=federate.go:184 component=web msg="federation failed" err="write tcp 10.233.108.33:9090->10.233.108.0:52696: write: broken pipe"
          level=error ts=2019-03-14T01:57:27.830604926Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WYKAVEYZP90XZ45SC9BZ5F.tmp: invalid argument"
          level=error ts=2019-03-14T01:59:28.531418658Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WYPDHP3GRAAVCC3ZDBZG5W.tmp: invalid argument"
          level=error ts=2019-03-14T02:01:29.327471451Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WYT7AKK66SAPJW3C92V66V.tmp: invalid argument"
          level=error ts=2019-03-14T02:03:30.1805298Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WYY53F5BYS095Q4PBZCGTY.tmp: invalid argument"
          level=error ts=2019-03-14T02:05:31.049499893Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WZ2AR42R59P33A7439264X.tmp: invalid argument"
          level=error ts=2019-03-14T02:07:31.713056789Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WZ6W49WP23BAR2Z8TQ8V32.tmp: invalid argument"
          level=error ts=2019-03-14T02:09:32.49053092Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WZAHZ13XYQNJ9HY2CN3464.tmp: invalid argument"
          level=error ts=2019-03-14T02:11:33.16561207Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WZE7XABXD41FC83PNPJ5ER.tmp: invalid argument"
          level=error ts=2019-03-14T02:13:33.870413801Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WZHXRER4A1N97CW05ZTEXJ.tmp: invalid argument"
          level=error ts=2019-03-14T02:15:34.520194682Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WZNKMEVP1DGZHDW3R7G0W8.tmp: invalid argument"
          level=error ts=2019-03-14T02:17:35.354180368Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WZS9ERCBRE36SAMH39JC5T.tmp: invalid argument"
          level=error ts=2019-03-14T02:19:36.082133862Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WZWZETN1T238CAKQ2SW49M.tmp: invalid argument"
          level=error ts=2019-03-14T02:21:36.810232358Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5X00NBJXWJBCM2N2EG5NH9T.tmp: invalid argument"
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 15, 2019

Do you use persistent storage for the data? If yes what kind of storage?

@dcvtruong

This comment has been minimized.

Copy link
Author

dcvtruong commented Mar 15, 2019

Yes. Using storage class via azure files.

Here is the log of the compaction error occurring late yesterday. The prometheus server seems to be normal for 3+ hrs after it was up then the compaction message started occurring. There's a different *.tmp dir for every new scrape interval? Does this mean the storage is corrupted? I can see the *.tmp dir is created in the azure storage then getting deleted dynamically.

The files in wal dir stop at 00000006 with 0 MiB. Why does it stop there?

Latest data points is being collected regularly.

The TSDB head series remains at about the same at 195K.

      level=info ts=2019-03-14T01:54:50.533936079Z caller=main.go:243 msg="Starting Prometheus" version="(version=2.6.1, branch=HEAD, revision=b639fe140c1f71b2cbad3fc322b17efe60839e7e)"
      level=info ts=2019-03-14T01:54:50.534059479Z caller=main.go:244 build_context="(go=go1.11.4, user=root@4c0e286fe2b3, date=20190115-19:12:04)"
      level=info ts=2019-03-14T01:54:50.534131879Z caller=main.go:245 host_details="(Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 prometheus-server-844f844bf8-vtdk6 (none))"
      level=info ts=2019-03-14T01:54:50.534159179Z caller=main.go:246 fd_limits="(soft=1048576, hard=1048576)"
      level=info ts=2019-03-14T01:54:50.534194279Z caller=main.go:247 vm_limits="(soft=unlimited, hard=unlimited)"
      level=info ts=2019-03-14T01:54:50.535053582Z caller=main.go:561 msg="Starting TSDB ..."
      level=info ts=2019-03-14T01:54:50.535301382Z caller=web.go:429 component=web msg="Start listening for connections" address=0.0.0.0:9090
      level=info ts=2019-03-14T01:55:16.467744497Z caller=main.go:571 msg="TSDB started"
      level=info ts=2019-03-14T01:55:16.467867997Z caller=main.go:631 msg="Loading configuration file" filename=/etc/config/prometheus.yml
      level=info ts=2019-03-14T01:55:16.471057806Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
      level=info ts=2019-03-14T01:55:16.47246501Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
      level=info ts=2019-03-14T01:55:16.473748513Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
      level=info ts=2019-03-14T01:55:16.474883016Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
      level=info ts=2019-03-14T01:55:16.476965822Z caller=kubernetes.go:201 component="discovery manager notify" discovery=k8s msg="Using pod service account via in-cluster config"
      level=info ts=2019-03-14T01:55:16.479123528Z caller=main.go:657 msg="Completed loading of configuration file" filename=/etc/config/prometheus.yml
      level=info ts=2019-03-14T01:55:16.479218928Z caller=main.go:530 msg="Server is ready to receive web requests."
      level=error ts=2019-03-14T01:55:43.025096323Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WYJEEX59YRGA29BPE8WAXY.tmp: invalid argument"
      level=error ts=2019-03-14T01:55:48.766084444Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WYK48SCPZCB83G3VV6VSFK.tmp: invalid argument"
      level=error ts=2019-03-14T01:55:49.022951547Z caller=federate.go:184 component=web msg="federation failed" err="write tcp 10.233.108.33:9090->10.233.108.0:52696: write: broken pipe"
      level=error ts=2019-03-14T01:57:27.830604926Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WYKAVEYZP90XZ45SC9BZ5F.tmp: invalid argument"
      level=error ts=2019-03-14T01:59:28.531418658Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WYPDHP3GRAAVCC3ZDBZG5W.tmp: invalid argument"
      level=error ts=2019-03-14T02:01:29.327471451Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WYT7AKK66SAPJW3C92V66V.tmp: invalid argument"
      level=error ts=2019-03-14T02:03:30.1805298Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WYY53F5BYS095Q4PBZCGTY.tmp: invalid argument"
      level=error ts=2019-03-14T02:05:31.049499893Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WZ2AR42R59P33A7439264X.tmp: invalid argument"
      level=error ts=2019-03-14T02:07:31.713056789Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WZ6W49WP23BAR2Z8TQ8V32.tmp: invalid argument"
      level=error ts=2019-03-14T02:09:32.49053092Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WZAHZ13XYQNJ9HY2CN3464.tmp: invalid argument"
      level=error ts=2019-03-14T02:11:33.16561207Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WZE7XABXD41FC83PNPJ5ER.tmp: invalid argument"
      level=error ts=2019-03-14T02:13:33.870413801Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WZHXRER4A1N97CW05ZTEXJ.tmp: invalid argument"
      level=error ts=2019-03-14T02:15:34.520194682Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WZNKMEVP1DGZHDW3R7G0W8.tmp: invalid argument"
      level=error ts=2019-03-14T02:17:35.354180368Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WZS9ERCBRE36SAMH39JC5T.tmp: invalid argument"
      level=error ts=2019-03-14T02:19:36.082133862Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5WZWZETN1T238CAKQ2SW49M.tmp: invalid argument"
      level=error ts=2019-03-14T02:21:36.810232358Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5X00NBJXWJBCM2N2EG5NH9T.tmp: invalid argument"


      level=error ts=2019-03-15T04:11:42.779880204Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5ZRPZJQYRWJ0C98P2ZKH4Z6.tmp: invalid argument"
      level=error ts=2019-03-15T04:13:43.30694988Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5ZRTNCWR9ZNCRF0NSJ97WT4.tmp: invalid argument"
      level=error ts=2019-03-15T04:15:44.134036152Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5ZRYB3BXN6Q7KN4JSANCKCV.tmp: invalid argument"
      level=error ts=2019-03-15T04:15:54.340765719Z caller=federate.go:184 component=web msg="federation failed" err="write tcp 10.233.88.77:9090->10.233.108.0:61243: write: broken pipe"
      level=error ts=2019-03-15T04:17:11.828685297Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5ZS21366VDT6T9XTNH90KXH.tmp: invalid argument"
      level=error ts=2019-03-15T04:19:12.498811794Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5ZS4PQT53FHZVT2KP87BBV0.tmp: invalid argument"
      level=error ts=2019-03-15T04:21:13.289395879Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5ZS8CJK4GVNR6TEP3ZM9J70.tmp: invalid argument"
      level=error ts=2019-03-15T04:23:13.889615642Z caller=db.go:324 component=tsdb msg="compaction failed" err="persist head block: open index writer: sync dir: sync /data/01D5ZSC2JN2P9CGQ53FQ6GMFJP.tmp: invalid argument"
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 15, 2019

Hmm, I'm not familiar with Azure files but from the documentation, it is based on the SMB protocol and probably it isn't 100% POSIX compliant as mandated in the documentation. For instance, we often have similar issues reported for NFS setups. You're better off asking to the Azure support directly.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 15, 2019

Also found a similar report for etcd: etcd-io/etcd#6984 (comment)

@dcvtruong

This comment has been minimized.

Copy link
Author

dcvtruong commented Mar 15, 2019

@simonpasquier, Thanks for responding back with the helpful info.

I'm trying to understand the behavior of the failure and if possible find the root cause. Why does the compaction error occurs only 3+ hrs in after prometheus-server is running normally? How do I locate why the wal is stopped creating? Why is the *.tmp directory gets created and gets deleted dynamically from each scraped interval?

I'm doing alot of search from /var/log/messages, journalctl -xu kubelet but not finding concrete info from the logs.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 18, 2019

Why does the compaction error occurs only 3+ hrs in after prometheus-server is running normally?

The error is triggered when Prometheus wants to persist the WAL to disk which happens only after a while.

How do I locate why the wal is stopped creating?

As the compaction fails, the WAL is never reinitialized.

Why is the *.tmp directory gets created and gets deleted dynamically from each scraped interval?

Prometheus will clean after itself if/when something goes wrong.

@dcvtruong dcvtruong closed this Mar 18, 2019

@dcvtruong dcvtruong reopened this Mar 18, 2019

@dcvtruong dcvtruong closed this Mar 18, 2019

@dcvtruong

This comment has been minimized.

Copy link
Author

dcvtruong commented Mar 18, 2019

Hi @simonpasquier,

Follow-up to the WAL related question. If the WAL is never initialized due to the compaction error the block of sample is not persisted. But the metrics is still available to grafana?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 18, 2019

The write-ahead log (WAL) contains the metrics for the last 2 hours which are also kept in memory. If Prometheus can't persist the WAL then those metrics will be lost on restart/crash. In any case, running a Prometheus server with a filesystem known to cause problems isn't recommended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.