-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kube-scheduler dies with "Schedulercache is corrupted" #46347
Comments
@kubernetes/sig-scheduling-bugs |
There are some PR's around this that came in recently, /cc @smarterclayton |
@pizzarabe - can you please provide the whole scheduler log? This is definitely some kind of race, and without logs it will be hard to debug (I think we fixed some issue with it in 1.5.x, but it seems there is something more here...) |
we were mutating the shared cache from a threaded environment, so my changes should in theory fix that. |
xref: #46223 |
Given this issue, it is important to backport Clayton's fix to 1.6 and 1.5. |
We can actually do a much smaller fix - just copy before you put things
back in the cache. If that's the actual problem here (may not be).
…On Wed, May 24, 2017 at 3:04 PM, Bobby (Babak) Salamat < ***@***.***> wrote:
Given this issue, it is important to backport Clayton's fix to 1.6 and 1.5.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#46347 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p1U7xFihA1wEdHdvWW9Q__3AL0iwks5r9H89gaJpZM4NlKXU>
.
|
@smarterclayton @timothysc - can you clarify why do you think this PR can help with this problem? I looked into it again, and I don't see any place where we didn't do copy and we are doing now. I agree it fixes watch semantics, but this is unrelated to this issue. |
We mutate the status of the pod when we call update without copying in 1.6
and earlier.
…On Thu, May 25, 2017 at 4:28 AM, Wojciech Tyczynski < ***@***.***> wrote:
@smarterclayton <https://github.com/smarterclayton> @timothysc
<https://github.com/timothysc> - can you clarify why do you think this PR
can help with this problem? I looked into it again, and I don't see any
place where we didn't do copy and we are doing now. I agree it fixes watch
semantics, but this is unrelated to this issue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#46347 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p8rf9QieGQU07EebB9jwT3GZRi8yks5r9TutgaJpZM4NlKXU>
.
|
Aah ok - you mean this one, right: |
The problem I can see is that if it's just about status (and we are not mutating spec), we are not using Status in the cache... |
According to the log And it's said
So it maybe related with #45453 , but no direct evidence :(. |
@k82cn in the past we used to have a problem (that would cause exactly this behavior) that:
I thought that this is already fixed, but maybe we should revisit if this is really fixed... |
@wojtek-t , got the case :) Let me revisit it to see whether any clue here. |
@wojtek-t
[The dead scheduler]:
[The restarted scheduler]:
Not sure if this helps anybody but here is the kube-scheduler manifest:
|
@pizzarabe - thanks a lot for those logs Looking into those from dead scheduler, they look extremely strange to me.
|
or maybe it's because of flushing? the logs from cache.go are Error/Fatal so are flushed immediately, the other are Info, so are not flushed immediately. So maybe the order is a red herring? |
Yeah, this is my test cluster, I reproduced the error before that, you can see the scheduler was started ~ 1 min before killing it with the
This was not caused by |
It seems both "assigned non-terminated ListerWatcher" and "un-assigned non-terminated ListerWatcher" got Pod |
I noticed you said your etcd version was v3.0.10 - could you be running into #45506 where the etcd version is too old to work correctly with Kubernetes? I had that issue occur on my cluster which caused all sorts of strange behaviours, as pod deletions were not correctly observed by the Kubernetes master components since they weren't getting the DELETED events. If it is this issue you should see in your apiserver logs lines like |
@pizzarabe Are you saying that you know the reproduction steps? If so, it would help a lot if you cloud give us the steps so that we can investigate more and see if the issue still exists in presence of recent fixes. |
@tinselspoon Yes, my Kube-Api Log says thing like @bsalamat I can reproduce the scheduler to die (creating a pod with the manifest I mentioned before). |
@pizzarabe - if you are using etcd in 3.0.10 version, this is definitely possible. You should upgrade to at least 3.0.12 (prefferably to 3.0.17). We should wait if this is reproducible also with 3.0.17 etcd version. I suspect it's not, and this is etcd issue that we are aware of (and was fixed in higher versions). |
BUG REPORT
Kubernetes version (use
kubectl version
):Environment:
Container Linux by CoreOS 1353.7.0 (Master + most nodes)
and
Container Linux by CoreOS 1298.5.0 (some nodes)
1353.7.0:
(4.9.24-coreos #1 SMP Wed Apr 26 21:44:23 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz GenuineIntel GNU/Linux)
1298.5.0:
(Linux alien8 4.9.9-coreos-r1 #1 SMP Tue Feb 28 00:06:10 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz GenuineIntel GNU/Linux)
Hyperkube (https://coreos.com/kubernetes/docs/latest/getting-started.html)
What happened:
After some testing with my private image registry (using the docs at https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/) I had some problems with the kube-scheduler.
Creating a new pod e.g.
kills the kube-scheduler
the kube-scheduler is restarted by kubelet and the pod is started after that.
Some output from kubelet
The journal entry right before that shows:
May 24 15:29:33 alien1 dockerd[1757]: time="2017-05-24T15:29:33.460151779+02:00" level=error msg="Error closing logger: invalid argument"
Creating a new pod with sth. like
/kubectl run busybox --image busybox /bin/sh
does not kill the scheduler!What you expected to happen:
The Scheduler should not die :)
How to reproduce it :
Not sure, the cluster worked ~ 30days without problems. The problem started after working with my private image registry. Before that I updated k8s from 1.6.1 to .1.6.3.
Anything else we need to know:
I restarted the Master after the problem first occurred (that helped before with some other problems)
The text was updated successfully, but these errors were encountered: