-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus 2.0 fails to start up after couple of restarts #3191
Comments
Thanks for reporting. You are likely hitting the startup deadlock reported previously in #3185. It got fixed in prometheus-junkyard/tsdb#146. We fixed and a couple more things over the last few days. I'm currently testing things with this container image: quay.io/fabxc/prometheus:v2.0.0-beta.4-debug-2955. You can try it out yourself in the mean time. Those changes also include the web server being reachable while we are restoring the database on restart. |
@fabxc I've rolled out your version across all of our non-production clusters to see if the issues goes and things run well. So far things seem okay, apart from having to wipe out all data. I hope the data format stabilizes quickly :) |
So far so good with the debug version, while the beta 3 instances were hanging again. |
We experienced the same problem. Deploying the debug version made it work again. |
I do still run into the following issue(s) with this debug version: |
Can you share the exact logs you are seeing? Also, did you run this against a storage written by a beta.4? |
Nevermind, also just hit that one now. I'll investigate. |
It was indeed fresh data since upgrading to your debug version. Hope you find the cause and fix. |
Thinking about it – this may actually be intended behavior and we are just forgetting to skip the logging. The log line was added just recently: prometheus-junkyard/tsdb@162a48e |
Eventually the instance outputting these logs started, it just took quite long. About 5 minutes. |
Thanks @JorritSalverda – this was a legitimate semantical race happening. Fix in prometheus-junkyard/tsdb#150 |
I hope you have a test version soon, our production instances have gone into CrashLoopBackoff state. Probably because recovery takes too long. Happy to test your fix. |
Upgrading to |
@fabxc so far all of our 24 Prometheus instances are running fine. Feel free to close the issue. I'll reopen or create a new one if anything new pops up. Thanks for your super fast fixes! |
Awesome, thank you so much! |
Hi, I came the same issue here.
The pod status turned CrashLoopBackOff and restarts over and over again, but still failed.
According to the logs here, I still cannt figure out what is going on here, any suggestions? Many Thanks. |
Prometheus 2.2.1 fails to start up with similar issue:
|
same issue
|
Hi @adilnaimi @Kirchen99 depending on the size of the data, it might take a minute or two to start. If it never starts, please open a new issue describing the issue in detail. |
Maybe this can help someone else, please check the configuration files. for my case, it was my configuration file |
@adilnaimi That should error out rather than hang prometheus. Could you share if the process is exiting or hanging? If hanging, which entry caused this? |
@gouthamve it was hanging I had this entry in log file
|
Our Prometheus had a lot of restarts too. We added 500MB to the resourceRequest and the resourceLimit. Values were previously 1024MB each. There is no more restarts now but I think it can be nice to log this behaviour on stdout or stderr telling that Prometheus have to Kill itself because of the lack of memory.
|
AFAIC Prometheus is killed by the OS so can't show any logs. Otherwise any other porblems are reported by a log or go routine trace. |
I looked to Kubernetes events and the is no trace of OOMKilling. I suspect a bad catch of the return code of malloc or something like that. I really don't know... |
so far I have not seen any report of silent restarts so I would advise to keep digging the k8s logs and if you are 100% this is a bug please open a new issue and how to reproduce. |
I think I'm experiencing the same issue. Last week I updated our k8s cluster from
Watching the status of I was previously on Prometheus version |
A small follow-up. Since this was happening in one of our test environments, deleting and recreating the pvc was a possible solution, which works. However, not a solution for production environments. |
@kaspernissen Please don't me too on unrelated (and closed) issues. If you believe you have found a bug, file a new issue. |
@brian-brazil Sure, I can create a new issue. However, this issue seemed to be the most relevant place because of the recent activity despite being closed. |
I'm also facing similar kind of issue after upgrading to 2.3. It's taking eternity to start.
|
@brian-brazil that doesn't seem like a very constructive response to someone providing evidence of what is a related issue. I am also experiencing the same behavior on 2.3.1 and this is the only issue that seems to be relevant in my searches. Each time a new pod is brought up in kubernetes my UI just returns "Service Unavailable". This is with using a persistent volume also. |
@sdeehring We want to fix bugs that users report, however it is very difficult to do so when an issue ends up with in this case 5 distinct behaviours being discussed, which is unfortunately common. Accordingly when an unrelated issue is spotted we ask the reporter to file a new bug so that both it an the original issue can get appropriate attention. A deadlock issue that was fixed in 2.0.0beta4 is unlikely to be related to a OOM issue in 2.3.1.
This sounds like a 6th distinct behaviour. If you think this is a bug please file a new one so it can be looked into. |
Hi. Facing the same issue with the latest version of Prometheus. It was working well for 2 days but the next day, I got this error and even using the tsdb tool does not help. Am I doing something wrong? `[ramcoazure@HCMDEMAZIND01 data]$ ls Block 01CJ4XPDNXD5AWTB44DZ0QTGQG contains highest samples count and is ommited from the deletion list! overlapping blocks : 18/07/11 08:00-10:00 18/07/11 Block 01CJ4FYXFG0V5VM0G2FQXT6G3A contains highest samples count and is ommited from the deletion list! overlapping blocks : 18/07/11 14:00-16:00 18/07/11 Block 01CJ54J5V66TX4YSMV3RDMK917 contains highest samples count and is ommited from the deletion list! BLOCK ULID MIN TIME MAX TIME NUM SAMPLES NUM CHUNKS NUM SERIES PATH |
Oh. It started working. I had to run the tool again and again and again and it kept deleting blocks. Now it works. Any way to prevent this from happening again? |
the 2.3.2 release has a fix for that. |
Using 2.3.2 but still getting the same issue.
|
@Misterhex would you mind opening a new issue and include a more detailed report with minimal config and steps to reproduce and I can look into it. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
What did you do?
We run 2 Prometheus instances on Google Container Engine using preemptibles. This means the instances are relocated at least every 24 hours (the max lifetime of a preemptible).
With version 1.7.1 this caused issues because the graceful shutdown sometimes took too long and didn't fully finish. After that startup took longer than the
initialDelaySeconds
of1200s
causing Kubernetes to restart Prometheus over and over and making Prometheus unavailable. Deleting the instances and their data was the only simple way to get things up and running again.With the much faster storage system in Prometheus 2 I had high hopes that this would no longer happen, but we seem to experience another cause of failure. After some restarts Prometheus does start and shows the following 4 log lines, but it fails to respond to the liveness check using the
/status
endpoint.What did you expect to see?
Prometheus to cope well with restarts and being able to gracefully shutdown within the max 30 seconds a preemptible shutdown allows.
What did you see instead? Under which circumstances?
The restarts / relocations work fine most of the time, but every now and then lead to failure. This usually seems to happen for both instances at pretty much the same time (or at least same day).
Environment
Linux 4.4.64+ x86_64
Last logs before it starts to fail
Hundreds of log lines similar to
Logs when restarting after failure
The text was updated successfully, but these errors were encountered: