Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upTSDB Compaction issue - 2.2.1 #4108
Comments
This comment has been minimized.
This comment has been minimized.
|
Please attach any log files you can to this issue. |
This comment has been minimized.
This comment has been minimized.
|
unfortunately I cannot locate any Prometheus logs on my server, had to power off the instance as the server became unresponsive and cannot find any service logs for Prometheus. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Hi There @SuperQ https://snapshot.raintank.io/dashboard/snapshot/MJJf2xYBrfsGjh4YCXcLyGb2A6XrpbOE I have set to expire after 7 days. |
This comment has been minimized.
This comment has been minimized.
|
@VR6Pete That's not exactly what I was looking for, but I think it's close enough to show that this is indeed an OOM-related failure, possibly related to the large compaction. Having a snapshot of this dashboard would be useful: |
This comment has been minimized.
This comment has been minimized.
colonha
commented
Apr 25, 2018
|
Also confirming here happening to me. Prometheus V 2.2.1 with fresh new database.
|
This comment has been minimized.
This comment has been minimized.
shuz
commented
Apr 30, 2018
|
We are seeing the same issue and the same error log in our cluster.
|
This comment has been minimized.
This comment has been minimized.
|
Can you give some info and steps how to replicate? |
This comment has been minimized.
This comment has been minimized.
matejkloska
commented
May 8, 2018
•
|
This is from our production logs. Hope it would help in investigation.
|
This comment has been minimized.
This comment has been minimized.
|
@matejkloska how often does it happen and do you remember any details what triggered it? |
This comment has been minimized.
This comment has been minimized.
|
The panic you see @matejkloska is coming from a different place and has been fixed. Need to update the vendored TSDB. Coming to the out of order compaction, I'll take a deeper look now. |
This comment has been minimized.
This comment has been minimized.
|
Also, are any of you doing deletes on the servers? |
This comment has been minimized.
This comment has been minimized.
colonha
commented
May 9, 2018
•
|
What kind of deletes? Since my server crashed and I don't want to start with a empty DB. I delete the corrupted series in order to have back and running my Prometheus setup. Not sure if this kind of delete you are talking about. Please let me know if you required detailed information. |
This comment has been minimized.
This comment has been minimized.
shuz
commented
May 9, 2018
|
Yes, we had an incident that it keeps compacting the same set of blocks, and we deleted that folder as a work around. It only happened to us 2-3 times in past 2 months. And usually we end up either find the bad block and delete it, or if it's not production data we just start clean. But we also don't know what's the repro steps to trigger it. |
This comment has been minimized.
This comment has been minimized.
matejkloska
commented
May 9, 2018
|
@krasi-georgiev I see this error every time I try to remove series mentioned in compaction error. @gouthamve please, how can I update vendored tsdb in prometheus which is deployed in k8s? |
This comment has been minimized.
This comment has been minimized.
|
@colonha I meant the Delete series API: One of the reasons I am asking is because the error "should" never happen at all. We are sorting the series before adding them, so just trying to figure out why this is happening is a little weird. |
gouthamve
added
the
component/local storage
label
May 10, 2018
This comment has been minimized.
This comment has been minimized.
|
After spending several hours scratching my head, I still have no idea why this is happening, and am not able to reproduce it. I would really appreciate it if you provide the data directory when you hit the error again so that I can reproduce it locally and see why this is happening. I would only need the |
This comment has been minimized.
This comment has been minimized.
strowi
commented
May 10, 2018
|
Maybe this is related to #3487? |
This comment has been minimized.
This comment has been minimized.
mysteryegg
commented
May 16, 2018
|
Running docker image prom/prometheus:v2.2.1 in a pod in OpenShift Origin 3.7 and with /prometheus mounted to a local hostPath (backing storage provided to the hypervisor by a SAN, so NFS is not involved), I encountered OOM errors due to Prometheus exceeding the 3Gi of RAM it had been allocated. After resolving the OOM condition (not sure if RAM consumption is tied to the same issue), I saw this in the logs:
I recovered over 10GB disk space by deleting the .tmp directories Prometheus had generated, but the compaction errors continue to be logged. I can provide some data if this is considered the same issue. |
This comment has been minimized.
This comment has been minimized.
mysteryegg
commented
May 16, 2018
•
|
Reproduced the compaction errors after upgrading another instance from prom/prometheus:v2.0.0 to prom/prometheus:v2.2.1 in Docker Swarm Mode (Docker 17.12.0-ce on CentOS 7.4) with bind mount to host path for for /prometheus data directory; block device (virtual hard disk) formatted as XFS. In this instance, it seemed to be related to deleting the /prometheus data while preserving the rest of the Prometheus data. I falsely assumed that wiping /prometheus would result in a fresh TSDB, but in the logs I saw: |
This comment has been minimized.
This comment has been minimized.
|
if you still have the data I would like to take a look kgeorgie@redhat.com |
This comment has been minimized.
This comment has been minimized.
mysteryegg
commented
May 17, 2018
|
Is it possible that an out-of-order series could be introduced by an orchestrator running a second replica of Prometheus writing to a common location? |
This comment has been minimized.
This comment has been minimized.
|
The file locking on the tsdb should catch that. |
This comment has been minimized.
This comment has been minimized.
|
@mysteryegg Actually this might be a very good suggestion I had a quick look and it seems that the lock check will ignore the lock file when the PID from the file doesn't match a running process which might be the case when running inside 2 different containers. |
This comment has been minimized.
This comment has been minimized.
|
Well that's a bug. We should be able to rely on just normal file locking rather than inventing our own thing. |
This comment has been minimized.
This comment has been minimized.
|
we are using |
This comment has been minimized.
This comment has been minimized.
|
See also prometheus/tsdb#178 and #2689.
+1 |
This comment has been minimized.
This comment has been minimized.
|
@mysteryegg I have checked the WAL you sent me and it replicates the error What I found so far confirms your suggestion that more than one Prometheus instance has written to the WAL files at the same time. the twist here is there is a incorrect matching with the series hashes and the samples references. I will try to replicate it by running more than one Prometheus instance to confirm. |
krasi-georgiev
referenced this issue
May 21, 2018
Closed
Not Deleting Old Data After TSDB Retention Passed #4176
This comment has been minimized.
This comment has been minimized.
|
Although the logs look similar I think we have few different issues here
It bugs me how to find a better way to replicate these since sharing the DATA is always an issue for exposing company details. |
This comment has been minimized.
This comment has been minimized.
|
Hi, I managed to replicate it thanks to the data sent to me by @mysteryegg (sorry I couldn't take a look earlier). This could be caused by two prometheus writing to the same directory. So from your data what I see is, there are two different series with the same series-id in the WAL. Now what happens then is that this gets called twice with the same So we essentially create Again, huge thanks to @mysteryegg for shipping me the data! I am pretty sure there is no way that two series could have the same id in the WAL unless two different prometheus are writing to it. Though I'll keep digging to confirm it is the case. |
This comment has been minimized.
This comment has been minimized.
shuz
commented
May 24, 2018
|
It feels like we hit the same issue, because when it happens, we got some wrong query results. |
This comment has been minimized.
This comment has been minimized.
|
@shuz the issue might be caused by allowing more than one Prometheus writing to the same files at any time and not exactly at the time you are seeing the error logs. In other words if more than one Prometheus instance wrote to the same folder at any given time expect the unexpected at any given time after that |
This comment has been minimized.
This comment has been minimized.
shuz
commented
May 31, 2018
|
@krasi-georgiev Thank you, we reviewed the prometheus metrics and did find another prometheus 2.0.0 is running for a short period of time and then disappeared. And finally tracked down an issue in our k8s cluster that lead to some unexpected pods get run on our cluster. |
This comment has been minimized.
This comment has been minimized.
|
@shuz glad we could resolve it |
This comment has been minimized.
This comment has been minimized.
|
added in the wiki FAQ section |
This comment has been minimized.
This comment has been minimized.
|
I think we managed to find the cause for this so closing now. as a side note Fabian replaced the file locking with a different package so in theory this should prevent starting more than one Prometheus instance with the same data folder |
krasi-georgiev
closed this
Jun 5, 2018
This comment has been minimized.
This comment has been minimized.
amitsingla
commented
Jun 5, 2018
•
|
@krasi-georgiev I have prometheus-federation running which gathers metrics from 5 different prometheus-servers running on different clusters. As, i am only persisting data on the prometheus-federation. All metrics data is written in one persistent volume
Logs Snippet
|
This comment has been minimized.
This comment has been minimized.
|
@amitsingla this issue got a bit mixed up and covers report for few different issues as per my comment above. in your case it doesn't matter that the prometheus server gathers data from different servers. As long as only a single server writes the collected data to the disk folder than it will all work as planed. now in your case I suspect you are hitting the first issue in my comment
If you believe that your issues is different would you mind opening a new issues with a bit more details as this one is quite mixed already. |
This comment has been minimized.
This comment has been minimized.
amitsingla
commented
Jun 25, 2018
|
@krasi-georgiev As you suggested , i have increased the MEM for my prometheus federation from 8GB to 12 GB & monitored for 2 weeks. I did not faced the *.tmp files issues in last 2 weeks. Prometheus-Server version : 2.2.0 Also i have upgraded my prometheus federation from 2.2.0 to 2.3.0 few days back & till now no issues with 2.3.0. |
This comment has been minimized.
This comment has been minimized.
|
@amitsingla thanks for the update , much appreciated. We have also started discussing how to decrease the memory usage during compaction. |
This comment has been minimized.
This comment has been minimized.
amitsingla
commented
Jun 26, 2018
•
|
@krasi-georgiev After 3 weeks & upgrade to 2.3.0 . I faced a CLBO issue today with below error. After rollback to 2.2.0 fixed the below issue. So , i will wait for an upgrade till above issue fix. {"log":"level=info ts=2018-06-26T16:06:30.927238649Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1530005400000 maxt=1530007200000 ulid=01CGY3J66ZFC9GXQXYV2T1CKN |
This comment has been minimized.
This comment has been minimized.
|
what is CLBO? was this caused by another OOM? |
krasi-georgiev
reopened this
Jun 26, 2018
This comment has been minimized.
This comment has been minimized.
|
@amitsingla is not the OP, and their log file is different. This is a different issue. Please file new issues as new bugs rather than chiming in on unrelated closed bugs. |
brian-brazil
closed this
Jun 26, 2018
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
VR6Pete commentedApr 23, 2018
As discussed on the Prometheus users group:
All storage consumed on 500GB volume shortly after midnight.
https://groups.google.com/forum/#!topic/prometheus-users/uAZ6ALu9AbU
Version: 2.2.1
I have outputted all information requested by Ben into the thread;