Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus 2.0.0-beta5 doesn't recover nicely when running out of disk space #3283

Open
EdSchouten opened this Issue Oct 12, 2017 · 21 comments

Comments

Projects
None yet
@EdSchouten
Copy link
Contributor

EdSchouten commented Oct 12, 2017

What did you do?

I run a Prometheus server that briefly ran out of disk space earlier today. A colleague of mine made the volume larger.

What did you expect to see?

As soon as disk space comes available, Prometheus should continue its business.

What did you see instead? Under which circumstances?

Prometheus was unable to scrape any targets from then on. The targets page showed "WAL log samples: log series: write /prometheus/wal/000024: file already closed" next to every target in the table.

I tried to do a restart of Prometheus, but what happened then was that Prometheus no longer wanted to start, terminating almost immediately with the message below:

Oct 12 11:48:20 ... docker[2709]: level=error ts=2017-10-12T09:48:20.103205196Z caller=main.go:317 msg="Opening storage failed" err="validate meta \"/prometheus/wal/000025\": EOF"

/prometheus/wal/000025 was a zero-byte file. After doing an rm /prometheus/wal/000025, Prometheus continued as usual.

In short, there may be two issues here:

  • Prometheus cannot recover after disk space becomes available again.
  • Prometheus doesn't like empty files in the wal/ directory.

Environment

  • System information:

Linux 3.16.0-4-amd64 x86_64

  • Prometheus version:

2.0.0-beta5

@mdlayher

This comment has been minimized.

Copy link
Member

mdlayher commented Oct 18, 2017

@smarterclayton

This comment has been minimized.

Copy link

smarterclayton commented Oct 19, 2017

Also seen this:

level=warn ts=2017-10-19T13:01:52.734909006Z caller=scrape.go:702 component="target manager" scrape_pool=kubernetes-cadvisor target=https://172.31.25.22:10250/metrics/cadvisor msg="append failed" err="WAL log samples: log series: write data/wal/000131: file already closed"
level=warn ts=2017-10-19T13:01:53.189146997Z caller=scrape.go:698 component="target manager" scrape_pool=kubernetes-nodes target=https://172.31.28.57:10250/metrics msg="append failed" err="WAL log samples: log series: write data/wal/000131: file already closed"
level=warn ts=2017-10-19T13:01:53.190981663Z caller=scrape.go:702 component="target manager" scrape_pool=kubernetes-nodes target=https://172.31.28.57:10250/metrics msg="append failed" err="WAL log samples: log series: write data/wal/000131: file already closed"
level=warn ts=2017-10-19T13:01:53.347227315Z caller=scrape.go:698 component="target manager" scrape_pool=kubernetes-nodes target=https://172.31.19.15:10250/metrics msg="append failed" err="WAL log samples: log series: write data/wal/000131: file already closed"
@EdSchouten

This comment has been minimized.

Copy link
Contributor Author

EdSchouten commented Oct 19, 2017

@smarterclayton Did this also happen as a result of Prometheus running out of disk space, or did it occur arbitrarily?

@smarterclayton

This comment has been minimized.

Copy link

smarterclayton commented Oct 19, 2017

It was very close to being out of disk (174GB out of 187GB) so I suspect it was. But I didn't actually see anything closer to 187

@EdSchouten

This comment has been minimized.

Copy link
Contributor Author

EdSchouten commented Oct 19, 2017

It may well be the case that you did hit the limit, though. Many file systems keep a certain percentage of free space reserved to prevent excessive block fragmentation.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 20, 2017

@smarterclayton do you recall whether 000131 was the most recent WAL segment? Any other messages related to the WAL in the logs before?

@smarterclayton

This comment has been minimized.

Copy link

smarterclayton commented Oct 23, 2017

Recreated prom.walcrash.tar.gz

Included both the list of files and the full log.

@smarterclayton

This comment has been minimized.

Copy link

smarterclayton commented Oct 23, 2017

level=info ts=2017-10-23T17:37:37.945952146Z caller=head.go:335 component=tsdb msg="head GC completed" duration=5.144256688s
level=info ts=2017-10-23T17:38:01.515384619Z caller=head.go:344 component=tsdb msg="WAL truncation completed" duration=23.56936641s
level=info ts=2017-10-23T17:38:07.592599758Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1508765400000 maxt=1508765520000
level=info ts=2017-10-23T17:39:53.754682751Z caller=head.go:335 component=tsdb msg="head GC completed" duration=6.13891679s
level=error ts=2017-10-23T17:40:05.511584981Z caller=head.go:346 component=tsdb msg="WAL truncation failed" err="write data/wal/compact.tmp: no space left on device" duration=11.756848013s
level=info ts=2017-10-23T17:40:11.710384646Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1508765520000 maxt=1508765640000
level=error ts=2017-10-23T17:40:11.726462581Z caller=db.go:260 component=tsdb msg="compaction failed" err="persist head block: mkdir data/01BX5542SY4TB7Q6Z2SJT8YRHQ.tmp: no space left on device"
level=info ts=2017-10-23T17:40:13.015288124Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1508765520000 maxt=1508765640000

and files:

...
./data/wal/000148
./data/wal/000149
./data/wal/000150
./data/wal/000151
./data/wal/compact.tmp
@kargakis

This comment has been minimized.

Copy link

kargakis commented Nov 2, 2017

I saw prometheus cluttered by the following logs yesterday in an instance we run in our ci cluster

level=error ts=2017-11-01T17:18:22.102842875Z caller=wal.go:683 component=tsdb msg="sync failed" err="flush buffer: write data/wal/000017: file already closed"
level=warn ts=2017-11-01T17:18:24.617463917Z caller=scrape.go:673 component="target manager" scrape_pool=kubernetes-nodes target=https://10.128.0.8:10250/metrics msg="append failed" err="WAL log samples: log series: write data/wal/000017: file already closed"

Seems like prom was hosed and stopped collecting metrics. I couldn't access the console, but that may be due to our proxy.

Running openshift/prometheus:v2.0.0-rc.2.

I've restarted prom ever since because of a different issue I wanted to debug but I guess it is going to recur.

@TimSimmons

This comment has been minimized.

Copy link

TimSimmons commented Nov 14, 2017

Note: This was a case of mistaken ncdu usage. Never use it for looking for exact sizes. When I ran out of space, I did have similar issues though

I saw this with Prometheus 2.0.0 today

log series: write /data/prometheus/wal/000815: file already closed"

Presumably because the disk was reporting full, I saw the same with df.

So I started looking to see what was using disk space with du and ncdu, but I found enough usage to account for about 30% of the disk. Eventually I deleted a bunch of files(maybe 2-5% of disk) and restarted Prometeus and everything returned to normal. df reported ~30% usage again.

I then looked at some of my other 2.0 servers and noticed the same gap in reporting from du and df that I don't see with 1.6.x.

For example:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       630G  248G  357G  41% /
$ ncdu
...
Total disk usage:  82.1GiB 

Looking at a 1.6.1 server.

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       630G   88G  517G  15% /
$ ncdu
....
 Total disk usage:  87.5GiB

Looking into that discrepancy led me to https://serverfault.com/questions/490704/ubuntu-server-hard-drive-always-full

The usual cause for what you are seeing is a file that has been deleted but hasn't yet been closed by the process that is writing to it.
As the file has been deleted, du won't see a directory entry for it and therefore cannot include it in it's report.
As the file is still open, the blocks it uses aren't free so df will report that they are in use.
You should be able to track down the file that is causing the problem with
lsof +L1

However I don't find anything with lsof +L1, or find /proc/*/fd -ls | grep '(deleted)' from https://unix.stackexchange.com/questions/68523/find-and-remove-large-files-that-are-open-but-have-been-deleted

Is it possible that Prometheus is keeping those files open longer than needed? But maybe somehow I can't find them?

@ankon

This comment has been minimized.

Copy link

ankon commented Nov 16, 2017

I just hit the same issue with prometheus 2.0.0: my disk ran full, prometheus started to complain about that, and eventually got itself in the 'file already closed' loop, continuously reporting this error for the same file. I do have a log available, if you need it.

@Mattias-

This comment has been minimized.

Copy link

Mattias- commented Jan 12, 2018

Saw this as well today

version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)"
build_context="(go=go1.9.2, user=root@615b82cb36b6, date=20171108-07:11:59)"
host_details="(Linux 4.4.0-109-generic #132-Ubuntu SMP Tue Jan 9 19:52:39 UTC 2018 x86_64 hostname (none))"

Data dir filled up pretty fast with .tmp dirs at a rate of 200G per hour.

@efahale

This comment has been minimized.

Copy link

efahale commented Jan 24, 2018

Hi,
I have seen this as well and I couldn't get it to work even after I removed the file it was complaining about. Has this been fixed in v2.1.0?

@stefreak

This comment has been minimized.

Copy link

stefreak commented Feb 12, 2018

I have a similar issue, with prometheus, version 2.0.0 (branch: HEAD, revision: 0a74f98628a0463dddc90528220c94de5032d1a0) – but in my case the disk isn't full!

/dev/mapper/data_hdd-prometheus     3.5T  1.5T  2.0T  42% /srv/prometheus

Also inodes usage is only 1%:

/dev/mapper/data_hdd-prometheus      175M  1.6M  174M    1% /srv/prometheus

Also ncdu reports:

Total disk usage:   1.4 TiB  Apparent size:   1.4 TiB  Items: 1657696

Prometheus stopped collecting metrics with the error WAL log samples: log series: write /srv/prometheus/metrics/wal/009807: file already closed

We'll go ahaad and test v2.1.0 there is a changelog entry that sounds like it could potentially be related? [BUGFIX] tsdb: Close WAL while shutting down.

@shomron

This comment has been minimized.

Copy link

shomron commented Feb 12, 2018

@stefreak this sounds like a separate issue I've encountered of exceeding the process ulimit for open files. Search your logs for too many open files.

@iksaif

This comment has been minimized.

Copy link
Contributor

iksaif commented Apr 9, 2018

Same issue, see:

Apr 09 07:04:22 mesos-slave005-pa4 prometheus_meta_local:  [ 1901bc3a327011e8a5c05065f387ef80 ] level=error ts=2018-04-09T07:04:22.782402003Z caller=db.go:281 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device"
Apr 09 07:05:22 mesos-slave005-pa4 prometheus_meta_local:  [ 1901bc3a327011e8a5c05065f387ef80 ] level=error ts=2018-04-09T07:05:22.9001564Z caller=db.go:281 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device"
Apr 09 07:43:30 mesos-slave005-pa4 prometheus_meta_local:  [ 1901bc3a327011e8a5c05065f387ef80 ] level=error ts=2018-04-09T07:43:30.747493475Z caller=db.go:281 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device"

Would be great if the health check could catch that (cc: @simonpasquier)

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jul 9, 2018

replying to the original issue. Everyone else please open a new one and will troubleshoot those as well.

Prometheus doesn't like empty files in the wal/ directory.

This is fixed now in tsdb and also added to prometheus. I just tested this locally and it works as expected by deleting the empty WAL files.

Prometheus cannot recover after disk space becomes available again.

Looking at the code the errors are return when making OS calls so I don't think we can do anything to handle this and also can't find an easy way to test this locally. I will test again locally to see if there is anything we can do to recover after freeing some space.

@TroubleshooteRJ

This comment has been minimized.

Copy link

TroubleshooteRJ commented Aug 7, 2018

My running prometheus, version 2.3.1. I keep on getting the following error -> WAL log samples: log series: write /var/lib/prometheus/wal/001530: disk quota exceeded
after increasing the lxc size to 80GB from 32GB we still get the same error for the same old wal file

msg="append failed" err="WAL log samples: log series: write /var/lib/prometheus/wal/001530: disk quota exceeded"

`rakesh.jain@labs-monitor:/var/lib/prometheus/wal$ ls -ltr
total 446203
-rw-r--r-- 1 prometheus prometheus 268428668 Jul 20 12:06 001527
-rw-r--r-- 1 prometheus prometheus 268434456 Jul 20 12:31 001528
-rw-r--r-- 1 prometheus prometheus 268428273 Jul 20 12:56 001529
-rw-r--r-- 1 prometheus prometheus 268435456 Jul 20 12:59 001530
-rw-r--r-- 1 prometheus prometheus 62241086 Aug 7 11:27 000001

rakesh.jain@labs-monitor:/var/lib/prometheus/wal$ du -sh *
6.8M 000001
137M 001527
137M 001528
137M 001529
19M 001530`

Do I need to restart the Prometheus service ??
I am scared to restart the prometheus..

@VLZZZ

This comment has been minimized.

Copy link

VLZZZ commented Feb 8, 2019

Same here on 2.4.3

/prometheus $ prometheus --version
prometheus, version 2.4.3 (branch: HEAD, revision: 167a4b4e73a8eca8df648d2d2043e21bdb9a7449)
  build user:       root@1e42b46043e9
  build date:       20181004-08:42:02
  go version:       go1.11.1

We've run out of space, then resize the kubernetes pvc, restart prometheus and got the same behavior.

df -h | grep vol_a00ededc1b710723062d01c73d1bd0bd
10.105.11.124:vol_a00ededc1b710723062d01c73d1bd0bd   30G   23G  8.0G  74% /var/lib/kubelet/pods/989c20f5-2baf-11e9-b704-068b1510facd/volumes/kubernetes.io~glusterfs/pvc-f629933a-dde3-11e8-b704-068b1510facd

Prometheus logs:

level=info ts=2019-02-08T14:42:23.067747049Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1549548000000 maxt=1549555200000 ulid=01D34EDS3ZJ6E1FXZNYCK62D1V
level=info ts=2019-02-08T14:42:24.991005996Z caller=main.go:423 msg="Stopping scrape discovery manager..."
level=info ts=2019-02-08T14:42:24.991176943Z caller=main.go:437 msg="Stopping notify discovery manager..."
level=info ts=2019-02-08T14:42:24.991223154Z caller=main.go:459 msg="Stopping scrape manager..."
level=info ts=2019-02-08T14:42:24.991269578Z caller=manager.go:638 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-02-08T14:42:24.991330922Z caller=manager.go:644 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-02-08T14:42:24.991377881Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
level=info ts=2019-02-08T14:42:24.991422804Z caller=main.go:608 msg="Notifier manager stopped"
level=info ts=2019-02-08T14:42:24.991486672Z caller=main.go:419 msg="Scrape discovery manager stopped"
level=info ts=2019-02-08T14:42:24.991547567Z caller=main.go:433 msg="Notify discovery manager stopped"
level=info ts=2019-02-08T14:42:24.991606663Z caller=main.go:453 msg="Scrape manager stopped"
level=error ts=2019-02-08T14:42:24.99180431Z caller=main.go:617 err="opening storage failed: stat /prometheus/wal/00001371: no space left on device"

wal file on node:

/prometheus-db/wal# ll 00001371
-rw-r--r-- 1 1000 2018 132960256 Feb  8 16:41 00001371

After removing this file and restarting prometheus everything works fine
Are there are any updates on this issue?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 8, 2019

nothing yet, still waiting for @nipuntalukdar to find the time and tackle this in #3283 otherwise I might give it a try after clearing some other pending PRs.
Any help as always is appreciated.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 15, 2019

just did and tested prometheus/tsdb#582.

Once the file system has more free space Prometheus will continue to write to the WAL.

I am not sure if we should make it to hard fail instead of keep retrying and flooding the error log with messages which might cause other issues.

Option 1 - continue retrying until it succeeds or Prometheus is terminated.
Option 2 - exit at the first write failure or maybe exit after lets say 100 attempts.

I think I am leaning towards option 2 - hard fail after some number of attempts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.