Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPrometheus 2.0 fails to start up after couple of restarts #3191
Comments
This comment has been minimized.
This comment has been minimized.
|
Thanks for reporting. You are likely hitting the startup deadlock reported previously in #3185. It got fixed in prometheus/tsdb#146. We fixed and a couple more things over the last few days. I'm currently testing things with this container image: quay.io/fabxc/prometheus:v2.0.0-beta.4-debug-2955. You can try it out yourself in the mean time. Those changes also include the web server being reachable while we are restoring the database on restart. |
This comment has been minimized.
This comment has been minimized.
|
@fabxc I've rolled out your version across all of our non-production clusters to see if the issues goes and things run well. So far things seem okay, apart from having to wipe out all data. I hope the data format stabilizes quickly :) |
This comment has been minimized.
This comment has been minimized.
|
So far so good with the debug version, while the beta 3 instances were hanging again. |
This comment has been minimized.
This comment has been minimized.
|
We experienced the same problem. Deploying the debug version made it work again. |
This comment has been minimized.
This comment has been minimized.
|
I do still run into the following issue(s) with this debug version: |
This comment has been minimized.
This comment has been minimized.
|
Can you share the exact logs you are seeing? Also, did you run this against a storage written by a beta.4? |
This comment has been minimized.
This comment has been minimized.
|
Nevermind, also just hit that one now. I'll investigate. |
This comment has been minimized.
This comment has been minimized.
|
It was indeed fresh data since upgrading to your debug version. Hope you find the cause and fix. |
This comment has been minimized.
This comment has been minimized.
|
Thinking about it – this may actually be intended behavior and we are just forgetting to skip the logging. The log line was added just recently: prometheus/tsdb@162a48e |
This comment has been minimized.
This comment has been minimized.
|
Eventually the instance outputting these logs started, it just took quite long. About 5 minutes. |
This comment has been minimized.
This comment has been minimized.
|
Thanks @JorritSalverda – this was a legitimate semantical race happening. Fix in prometheus/tsdb#150 |
This comment has been minimized.
This comment has been minimized.
|
I hope you have a test version soon, our production instances have gone into CrashLoopBackoff state. Probably because recovery takes too long. Happy to test your fix. |
This comment has been minimized.
This comment has been minimized.
|
Upgrading to |
This comment has been minimized.
This comment has been minimized.
|
@fabxc so far all of our 24 Prometheus instances are running fine. Feel free to close the issue. I'll reopen or create a new one if anything new pops up. Thanks for your super fast fixes! |
This comment has been minimized.
This comment has been minimized.
|
Awesome, thank you so much! |
fabxc
closed this
Sep 22, 2017
This comment has been minimized.
This comment has been minimized.
cauwulixuan
commented
Jan 20, 2018
•
|
Hi, I came the same issue here.
The pod status turned CrashLoopBackOff and restarts over and over again, but still failed.
According to the logs here, I still cannt figure out what is going on here, any suggestions? Many Thanks. |
cauwulixuan
referenced this issue
Jan 20, 2018
Closed
Opening storage failed" err="invalid block sequence" #3714
This comment has been minimized.
This comment has been minimized.
Kirchen99
commented
Apr 3, 2018
|
Prometheus 2.2.1 fails to start up with similar issue:
|
This comment has been minimized.
This comment has been minimized.
adilnaimi
commented
Apr 4, 2018
|
same issue
|
This comment has been minimized.
This comment has been minimized.
|
Hi @adilnaimi @Kirchen99 depending on the size of the data, it might take a minute or two to start. If it never starts, please open a new issue describing the issue in detail. |
This comment has been minimized.
This comment has been minimized.
adilnaimi
commented
Apr 4, 2018
|
Maybe this can help someone else, please check the configuration files. for my case, it was my configuration file |
This comment has been minimized.
This comment has been minimized.
|
@adilnaimi That should error out rather than hang prometheus. Could you share if the process is exiting or hanging? If hanging, which entry caused this? |
This comment has been minimized.
This comment has been minimized.
adilnaimi
commented
Apr 4, 2018
|
@gouthamve it was hanging I had this entry in log file
|
This comment has been minimized.
This comment has been minimized.
yanc0
commented
May 30, 2018
|
Our Prometheus had a lot of restarts too. We added 500MB to the resourceRequest and the resourceLimit. Values were previously 1024MB each. There is no more restarts now but I think it can be nice to log this behaviour on stdout or stderr telling that Prometheus have to Kill itself because of the lack of memory.
|
This comment has been minimized.
This comment has been minimized.
|
AFAIC Prometheus is killed by the OS so can't show any logs. Otherwise any other porblems are reported by a log or go routine trace. |
This comment has been minimized.
This comment has been minimized.
yanc0
commented
May 30, 2018
|
I looked to Kubernetes events and the is no trace of OOMKilling. I suspect a bad catch of the return code of malloc or something like that. I really don't know... |
This comment has been minimized.
This comment has been minimized.
|
so far I have not seen any report of silent restarts so I would advise to keep digging the k8s logs and if you are 100% this is a bug please open a new issue and how to reproduce. |
This comment has been minimized.
This comment has been minimized.
kaspernissen
commented
Jun 20, 2018
|
I think I'm experiencing the same issue. Last week I updated our k8s cluster from
Watching the status of I was previously on Prometheus version |
This comment has been minimized.
This comment has been minimized.
kaspernissen
commented
Jun 20, 2018
|
A small follow-up. Since this was happening in one of our test environments, deleting and recreating the pvc was a possible solution, which works. However, not a solution for production environments. |
This comment has been minimized.
This comment has been minimized.
|
@kaspernissen Please don't me too on unrelated (and closed) issues. If you believe you have found a bug, file a new issue. |
This comment has been minimized.
This comment has been minimized.
kaspernissen
commented
Jun 20, 2018
|
@brian-brazil Sure, I can create a new issue. However, this issue seemed to be the most relevant place because of the recent activity despite being closed. |
This comment has been minimized.
This comment has been minimized.
Neru007
commented
Jun 21, 2018
•
|
I'm also facing similar kind of issue after upgrading to 2.3. It's taking eternity to start.
|
This comment has been minimized.
This comment has been minimized.
sdeehring
commented
Jun 28, 2018
•
|
@brian-brazil that doesn't seem like a very constructive response to someone providing evidence of what is a related issue. I am also experiencing the same behavior on 2.3.1 and this is the only issue that seems to be relevant in my searches. Each time a new pod is brought up in kubernetes my UI just returns "Service Unavailable". This is with using a persistent volume also. |
This comment has been minimized.
This comment has been minimized.
|
@sdeehring We want to fix bugs that users report, however it is very difficult to do so when an issue ends up with in this case 5 distinct behaviours being discussed, which is unfortunately common. Accordingly when an unrelated issue is spotted we ask the reporter to file a new bug so that both it an the original issue can get appropriate attention. A deadlock issue that was fixed in 2.0.0beta4 is unlikely to be related to a OOM issue in 2.3.1.
This sounds like a 6th distinct behaviour. If you think this is a bug please file a new one so it can be looked into. |
This comment has been minimized.
This comment has been minimized.
tvvignesh
commented
Jul 12, 2018
|
Hi. Facing the same issue with the latest version of Prometheus. It was working well for 2 days but the next day, I got this error and even using the tsdb tool does not help. Am I doing something wrong? `[ramcoazure@HCMDEMAZIND01 data]$ ls Block 01CJ4XPDNXD5AWTB44DZ0QTGQG contains highest samples count and is ommited from the deletion list! overlapping blocks : 18/07/11 08:00-10:00 18/07/11 Block 01CJ4FYXFG0V5VM0G2FQXT6G3A contains highest samples count and is ommited from the deletion list! overlapping blocks : 18/07/11 14:00-16:00 18/07/11 Block 01CJ54J5V66TX4YSMV3RDMK917 contains highest samples count and is ommited from the deletion list! BLOCK ULID MIN TIME MAX TIME NUM SAMPLES NUM CHUNKS NUM SERIES PATH |
This comment has been minimized.
This comment has been minimized.
tvvignesh
commented
Jul 12, 2018
|
Oh. It started working. I had to run the tool again and again and again and it kept deleting blocks. Now it works. Any way to prevent this from happening again? |
This comment has been minimized.
This comment has been minimized.
|
the 2.3.2 release has a fix for that. |
This comment has been minimized.
This comment has been minimized.
Misterhex
commented
Aug 13, 2018
•
|
Using 2.3.2 but still getting the same issue.
|
This comment has been minimized.
This comment has been minimized.
|
@Misterhex would you mind opening a new issue and include a more detailed report with minimal config and steps to reproduce and I can look into it. |
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
JorritSalverda commentedSep 19, 2017
What did you do?
We run 2 Prometheus instances on Google Container Engine using preemptibles. This means the instances are relocated at least every 24 hours (the max lifetime of a preemptible).
With version 1.7.1 this caused issues because the graceful shutdown sometimes took too long and didn't fully finish. After that startup took longer than the
initialDelaySecondsof1200scausing Kubernetes to restart Prometheus over and over and making Prometheus unavailable. Deleting the instances and their data was the only simple way to get things up and running again.With the much faster storage system in Prometheus 2 I had high hopes that this would no longer happen, but we seem to experience another cause of failure. After some restarts Prometheus does start and shows the following 4 log lines, but it fails to respond to the liveness check using the
/statusendpoint.What did you expect to see?
Prometheus to cope well with restarts and being able to gracefully shutdown within the max 30 seconds a preemptible shutdown allows.
What did you see instead? Under which circumstances?
The restarts / relocations work fine most of the time, but every now and then lead to failure. This usually seems to happen for both instances at pretty much the same time (or at least same day).
Environment
Linux 4.4.64+ x86_64Last logs before it starts to fail
Hundreds of log lines similar to
Logs when restarting after failure