Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upGetting many compaction failed with invalid argument error in prometheus server log #5362
Comments
This comment has been minimized.
This comment has been minimized.
|
Do you use persistent storage for the data? If yes what kind of storage? |
This comment has been minimized.
This comment has been minimized.
|
Yes. Using storage class via azure files. Here is the log of the compaction error occurring late yesterday. The prometheus server seems to be normal for 3+ hrs after it was up then the compaction message started occurring. There's a different *.tmp dir for every new scrape interval? Does this mean the storage is corrupted? I can see the *.tmp dir is created in the azure storage then getting deleted dynamically. The files in wal dir stop at 00000006 with 0 MiB. Why does it stop there? Latest data points is being collected regularly. The TSDB head series remains at about the same at 195K.
|
This comment has been minimized.
This comment has been minimized.
|
Hmm, I'm not familiar with Azure files but from the documentation, it is based on the SMB protocol and probably it isn't 100% POSIX compliant as mandated in the documentation. For instance, we often have similar issues reported for NFS setups. You're better off asking to the Azure support directly. |
This comment has been minimized.
This comment has been minimized.
|
Also found a similar report for etcd: etcd-io/etcd#6984 (comment) |
This comment has been minimized.
This comment has been minimized.
|
@simonpasquier, Thanks for responding back with the helpful info. I'm trying to understand the behavior of the failure and if possible find the root cause. Why does the compaction error occurs only 3+ hrs in after prometheus-server is running normally? How do I locate why the wal is stopped creating? Why is the *.tmp directory gets created and gets deleted dynamically from each scraped interval? I'm doing alot of search from /var/log/messages, journalctl -xu kubelet but not finding concrete info from the logs. |
This comment has been minimized.
This comment has been minimized.
The error is triggered when Prometheus wants to persist the WAL to disk which happens only after a while.
As the compaction fails, the WAL is never reinitialized.
Prometheus will clean after itself if/when something goes wrong. |
dcvtruong
closed this
Mar 18, 2019
dcvtruong
reopened this
Mar 18, 2019
dcvtruong
closed this
Mar 18, 2019
This comment has been minimized.
This comment has been minimized.
|
Hi @simonpasquier, Follow-up to the WAL related question. If the WAL is never initialized due to the compaction error the block of sample is not persisted. But the metrics is still available to grafana? |
This comment has been minimized.
This comment has been minimized.
|
The write-ahead log (WAL) contains the metrics for the last 2 hours which are also kept in memory. If Prometheus can't persist the WAL then those metrics will be lost on restart/crash. In any case, running a Prometheus server with a filesystem known to cause problems isn't recommended. |
dcvtruong commentedMar 14, 2019
Bug Report
What did you do?
I notice the many compaction error Prometheus server pod get scheduled over to a different k8s worker node
What did you expect to see?
Not so many compaction error in prometheus server log
What did you see instead? Under which circumstances?
Repeated compaction error with with invalid argument
Environment
System information:
Linux 3.10.0-957.1.3.el7.x86_64 x86_64
Prometheus version:
2.6.1
Prometheus configuration file: