Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Scheduler dies with "Schedulercache is corrupted" #50916
Comments
k8s-ci-robot
added
kind/bug
sig/scheduling
labels
Aug 18, 2017
k8s-merge-robot
added
needs-sig
and removed
needs-sig
labels
Aug 18, 2017
This seems to show that our retry logic is broken. |
|
@wojtek-t I tried for several times but can't reproduce. @julia-stripe Any tips about reproducing this issue? |
|
So far I don't know how to reproduce this. This started happening when we upgraded from 1.7.0 to 1.7.4 (604542533d82f4bcb6a90b92a1425c9c89b6c886 to 793658f), and we also rebuilt our etcd cluster at the same time. I'm not sure whether it was caused by the upgrade or rebuilding our etcd cluster or neither but will report any new finds! |
|
I'll enqueue, but I don't have time right now. IIRC there were a number of small fixes in the shared informer this cycle too that may affect this. /cc @ncdc |
timothysc
self-assigned this
Sep 5, 2017
timothysc
added this to the v1.9 milestone
Sep 5, 2017
|
@timothysc I glanced at the "scheduler cache" and it appears to be unrelated to shared informers? |
|
Agree. We are using shared informer in scheduler, but I would be surprised if it was related to that. |
|
This scheduler panic has been happening in our production cluster (still running 1.7.4) several times a day every day for the last 30 days. It doesn't appear to cause any actual scheduling issues right now (the scheduler panics, restarts, and everything seems to be fine) but it worries me that I can't figure out what's going on. I just spent a bit more time staring at the logs to try and understand what's happening. Here are some logs:
This looks to me like the pod informer is giving the scheduler pods that aren't up to date (because the crash happens right after There is something weird about this number 1355 in very happy to debug/fix this, I'm just feeling a bit stuck and would love ideas / debugging strategies |
|
Could this be an Etcd / apiserver locking issue, maybe one you only seen when running in high availability mode - If the status of the cache is being Trampled on? ...What happens if two schedulers run at the same time without a both having lock? |
|
We run both the scheduler & controller manager with |
|
@julia-stripe I would like to help investigate this issue. Given the sequence of events, a race condition between the two API servers could be the cause, but it is hard to tell without further investigation. The fact that it is happening kind of frequently in your cluster is a good thing for debugging. We can try a couple of scenarios, for example running a single API server if possible, or perhaps raise logging verbosity to see if we can find any more clues. |
|
This is easy to test : Kill just an API server and probably .. problem solved : you have a bug in the scheduler lock, then further investigation to figure out why the API servers racing on the lock. |
|
I updated our configuration today so that the scheduler talks to the API server on localhost. So now the scheduler always sends requests to the same API server. Previously the scheduler would choose a random apiserver through a load balancer. I was super hopeful that this would fix the issue (picking a random apiserver to talk to on every request is a pretty weird thing to do!). But sadly it didn't, the scheduler is still panicking in the same way as before. I don't think running only one API server in production would be an acceptable workaround for us. |
|
Looking into logs from #50916 (comment) - the scheduler is trying to schedule the same pod
But I have a couple questions:
Also - can you please link the whole scheduler log - it's difficult to come up with a good understanding of a problem from just small snippets. |
Yes, those are from the disruption controller -- i included them because they mentioned the same pod. (the logs I pasted in that comment come from searching for that pod ID in our log aggregation system, so I could find any events that mentioned it)
How do I figure out whether a re-list is being triggered in the informer?
Here are the logs from the scheduler for 2 crashes, from the time when the watch closes to when the scheduler crashes https://gist.github.com/julia-stripe/55eac5af6f76043f7cb3c924b10aae21. happy to send even more logs if that would help, but I need to write a script to redact them a bit if I'm going to send more complete logs :) |
The easiest way I know is to look into apiserver logs around that time and see if there is list API call coming from scheduler.
I looked into those and in both cases I'm seeing this at the beginning:
I'm really wondering why there are so many watch events (pods) delievered initially. We should be doing "LIST+WATCH" (right after acquiring lock), and there shouldn't be many watch events. Something strange is happening here. I'm afraid we may actually need apiserver logs to debug further... |
|
Here are all the API requests to crash #1: time: 21:11:04.000 API server audit logs: (the results of
crash #2: time: 15:02:43.591276 API server audit logs: (the results of
Basically at all times as far as I can tell there are 2 watches on pods:
there are usually list events happening around crashes (because there are lists happening every 7 minutes or so) but as you can see in the logs they don't necessarily happen immediately before the crashes |
|
All of those has "?watch=true" bit, so those weren't relists. BTW - it's strange because the times are completely different here and in the previous file (11:31 and 16:26 vs 21:0* and 14:5*), so I guess those are different crashes. But this clearly shows that there weren't any relists. |
|
Relists look like this right? |
|
OK - so my hypothesis with relists and some bug in the code around that isn't a valid one. In my opinion there are two possible hypothesis now:
I bet on the second to be honest... |
|
the 2nd scheduler hasn't printed any log lines at all for the last 3 weeks (with |
k8s-merge-robot
added
milestone/incomplete-labels
milestone/removed
and removed
milestone/incomplete-labels
labels
Oct 5, 2017
|
[MILESTONENOTIFIER] Milestone Removed Important: This issue was missing labels required for the v1.9 milestone for more than 3 days: priority: Must specify exactly one of |
k8s-merge-robot
removed this from the v1.9 milestone
Oct 9, 2017
Is it possible you're hitting coreos/etcd#8411 (fixed in etcd 3.2.6+)? We saw this impact the apiserver watch cache (#43152 (comment), #43152 (comment)), but it could affect any watch client connected to an etcd member that dropped and then restored from the etcd cluster via a snapshot (any watch events contained in that snapshot are not sent to watchers) |
|
other questions:
|
|
|
Switched to using |
embano1
commented
Oct 10, 2017
|
@julia-stripe I also accidentally stumbled across this setting, which is the default in recent Kubernetes environments (also with the improved support on the etcdv3 side). Not directly related, but have a look at this issue how a certain etcd version caused cluster trouble (#52498). Did you try downgrading etcd? Just a blind guess... |
timothysc
added
the
priority/important-soon
label
Oct 26, 2017
timothysc
removed their assignment
Oct 26, 2017
|
I also encountered the same problem in 1.7.8 . |
|
we're seeing this on 1.7.6 as well |
|
seeing this in our logs:
|
|
Reproducing test and fix in #55262 |
julia-stripe commentedAug 18, 2017
•
Edited 1 time
-
julia-stripe
Aug 18, 2017
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
/sig scheduling
What happened:
The scheduler panics with "Schedulercache is corrupted and can badly affect scheduling decisions". There are a few related issues reporting a similar bug (#46347, #47150), but in both of those issues the recommended fix is to upgrade to etcd 3.0.17. We're using etcd 3.2.0 which is newer than that. This happened in our production cluster twice today.
some evidence that this is issue an with etcd:
Conversely, evidence that it isn't an issue with etcd: when we later restart the scheduler (~30 minutes later), the system recovers without any changes to etcd (according to the apiserver lots). So etcd's contents are at least not permanently corrupted.
Here's an excerpt of the logs. full logs are in this gist (all the relevant kubelet/controller manager/apiserver/scheduler logs)
Environment:
kubectl version): 1.7.4