New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler dies with "Schedulercache is corrupted" #50916

Closed
julia-stripe opened this Issue Aug 18, 2017 · 31 comments

Comments

Projects
None yet
@julia-stripe
Contributor

julia-stripe commented Aug 18, 2017

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

/sig scheduling

What happened:

The scheduler panics with "Schedulercache is corrupted and can badly affect scheduling decisions". There are a few related issues reporting a similar bug (#46347, #47150), but in both of those issues the recommended fix is to upgrade to etcd 3.0.17. We're using etcd 3.2.0 which is newer than that. This happened in our production cluster twice today.

some evidence that this is issue an with etcd:

  1. scheduler fails with "Schedulercache is corrupted and can badly affect scheduling decisions"
  2. another scheduler on another machine gets elected the leader
  3. that scheduler also fails with "Schedulercache is corrupted and can badly affect scheduling decisions"

Conversely, evidence that it isn't an issue with etcd: when we later restart the scheduler (~30 minutes later), the system recovers without any changes to etcd (according to the apiserver lots). So etcd's contents are at least not permanently corrupted.

Here's an excerpt of the logs. full logs are in this gist (all the relevant kubelet/controller manager/apiserver/scheduler logs)

======== scheduler (kubemaster--0fd4bec98eba7656c) ======

I0818 09:16:25.678670       1 factory.go:511] About to try and schedule pod explorer-job-1503047460-7681l
I0818 09:16:25.678677       1 scheduler.go:261] Attempting to schedule pod: scheduled-jobs/explorer-job-1503047460-7681l
W0818 09:16:25.679040       1 cache.go:228] Pod scheduled-jobs/explorer-job-1503047460-7681l assumed to a different node than added to.
I0818 09:16:25.679059       1 factory.go:725] Attempting to bind explorer-job-1503047460-7681l to ip-10-68-30-4.us-west-2.compute.internal
E0818 09:16:25.679061       1 cache.go:263] Pod scheduled-jobs/explorer-job-1503047460-7681l updated on a different node than previously added to.
W0818 09:16:25.679040       1 cache.go:228] Pod scheduled-jobs/explorer-job-1503047460-7681l assumed to a different node than added to.
E0818 09:16:25.679061       1 cache.go:263] Pod scheduled-jobs/explorer-job-1503047460-7681l updated on a different node than previously added to.
F0818 09:16:25.679081       1 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions


>>>> stack trace redacted: panic <<<<

======== apiserver audit ======

2017-08-18T09:16:25.679953099Z AUDIT: id="1e0516d4-9a95-4b4b-aaee-98a0894b3625" ip="10.68.24.10" method="POST" user="kubemaster--0fd4bec98eba7656c.nw" groups="\"kubemaster.northwest.prod\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="scheduled-jobs" uri="/api/v1/namespaces/scheduled-jobs/pods/explorer-job-1503047460-7681l/binding"

======== scheduler (kubemaster--0aa8e87a222ebc781.nw) ======

I0818 09:16:45.738257       1 leaderelection.go:189] successfully acquired lease kube-system/kube-scheduler
I0818 09:16:45.738413       1 event.go:218] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"kube-scheduler", UID:"3d55f039-8367-11e7-8b13-02708673f6f6", APIVersion:"v1", ResourceVersion:"737096", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' kubemaster--0aa8e87a222ebc781.nw became leader
I0818 09:16:45.838560       1 shared_informer.go:116] caches populated

I0818 09:19:22.361105       1 leaderelection.go:204] successfully renewed lease kube-system/kube-scheduler
I0818 09:19:24.369669       1 leaderelection.go:204] successfully renewed lease kube-system/kube-scheduler
I0818 09:19:25.338146       1 reflector.go:405] external/com_github_kubernetes/plugin/cmd/kube-scheduler/app/server.go:100: Watch close - *v1.Pod total 21 items received
I0818 09:19:25.340488       1 factory.go:511] About to try and schedule pod explorer-job-1503047460-7681l
I0818 09:19:25.340495       1 scheduler.go:261] Attempting to schedule pod: scheduled-jobs/explorer-job-1503047460-7681l
W0818 09:19:25.340723       1 cache.go:228] Pod scheduled-jobs/explorer-job-1503047460-7681l assumed to a different node than added to.
E0818 09:19:25.340741       1 cache.go:263] Pod scheduled-jobs/explorer-job-1503047460-7681l updated on a different node than previously added to.
F0818 09:19:25.340759       1 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions

Environment:

  • Kubernetes version (use kubectl version): 1.7.4
  • Etcd version: 3.2.0-rc.1+git
  • Cloud provider or hardware configuration**: AWS
  • OS: Ubuntu 15.04
  • Kernel: 4.4.0
@resouer

This comment has been minimized.

Show comment
Hide comment
Member

resouer commented Aug 19, 2017

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Aug 21, 2017

Member
W0818 09:19:25.340723       1 cache.go:228] Pod scheduled-jobs/explorer-job-1503047460-7681l assumed to a different node than added to.
E0818 09:19:25.340741       1 cache.go:263] Pod scheduled-jobs/explorer-job-1503047460-7681l updated on a different node than previously added to.

This seems to show that our retry logic is broken.

Member

wojtek-t commented Aug 21, 2017

W0818 09:19:25.340723       1 cache.go:228] Pod scheduled-jobs/explorer-job-1503047460-7681l assumed to a different node than added to.
E0818 09:19:25.340741       1 cache.go:263] Pod scheduled-jobs/explorer-job-1503047460-7681l updated on a different node than previously added to.

This seems to show that our retry logic is broken.

@resouer

This comment has been minimized.

Show comment
Hide comment
@resouer

resouer Aug 21, 2017

Member

@wojtek-t I tried for several times but can't reproduce.

@julia-stripe Any tips about reproducing this issue?

Member

resouer commented Aug 21, 2017

@wojtek-t I tried for several times but can't reproduce.

@julia-stripe Any tips about reproducing this issue?

@julia-stripe

This comment has been minimized.

Show comment
Hide comment
@julia-stripe

julia-stripe Aug 21, 2017

Contributor

So far I don't know how to reproduce this. This started happening when we upgraded from 1.7.0 to 1.7.4 (604542533d82f4bcb6a90b92a1425c9c89b6c886 to 793658f), and we also rebuilt our etcd cluster at the same time. I'm not sure whether it was caused by the upgrade or rebuilding our etcd cluster or neither but will report any new finds!

Contributor

julia-stripe commented Aug 21, 2017

So far I don't know how to reproduce this. This started happening when we upgraded from 1.7.0 to 1.7.4 (604542533d82f4bcb6a90b92a1425c9c89b6c886 to 793658f), and we also rebuilt our etcd cluster at the same time. I'm not sure whether it was caused by the upgrade or rebuilding our etcd cluster or neither but will report any new finds!

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Sep 5, 2017

Member

I'll enqueue, but I don't have time right now. IIRC there were a number of small fixes in the shared informer this cycle too that may affect this.

/cc @ncdc

Member

timothysc commented Sep 5, 2017

I'll enqueue, but I don't have time right now. IIRC there were a number of small fixes in the shared informer this cycle too that may affect this.

/cc @ncdc

@timothysc timothysc self-assigned this Sep 5, 2017

@timothysc timothysc added this to the v1.9 milestone Sep 5, 2017

@ncdc

This comment has been minimized.

Show comment
Hide comment
@ncdc

ncdc Sep 6, 2017

Member

@timothysc I glanced at the "scheduler cache" and it appears to be unrelated to shared informers?

Member

ncdc commented Sep 6, 2017

@timothysc I glanced at the "scheduler cache" and it appears to be unrelated to shared informers?

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Sep 6, 2017

Member

Agree. We are using shared informer in scheduler, but I would be surprised if it was related to that.
I guess something is wrong with schedulercache itself, but I also don't have time to look into it now.

Member

wojtek-t commented Sep 6, 2017

Agree. We are using shared informer in scheduler, but I would be surprised if it was related to that.
I guess something is wrong with schedulercache itself, but I also don't have time to look into it now.

@julia-stripe

This comment has been minimized.

Show comment
Hide comment
@julia-stripe

julia-stripe Sep 25, 2017

Contributor

This scheduler panic has been happening in our production cluster (still running 1.7.4) several times a day every day for the last 30 days. It doesn't appear to cause any actual scheduling issues right now (the scheduler panics, restarts, and everything seems to be fine) but it worries me that I can't figure out what's going on. I just spent a bit more time staring at the logs to try and understand what's happening. Here are some logs:

I0924 02:10:55.393785       1 leaderelection.go:204] successfully renewed lease kube-system/kube-scheduler
I0924 02:10:56.814434       1 factory.go:511] About to try and schedule pod misc-reaper-1506106440-dsnx2
I0924 02:10:56.814448       1 scheduler.go:261] Attempting to schedule pod: ops-scheduled-tasks/misc-reaper-1506106440-dsnx2
I0924 02:10:56.814724       1 factory.go:725] Attempting to bind misc-reaper-1506106440-dsnx2 to ip-10-68-25-61.us-west-2.compute.internal
I0924 02:10:56.823453       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ops-scheduled-tasks", Name:"misc-reaper-1506106440-dsnx2", UID:"99da90a2-a0cd-11e7-b720-02c5763b3f4e", APIVersion:"v1", ResourceVersion:"96758266", FieldPath:""}): type: 'Normal' reason: 'Scheduled' Successfully assigned ops--misc--auto-check-reaper-1506106440-dsnx2 to ip-10-68-25-61.us-west-2.compute.internal
	
I0924 02:10:56.823581       1 disruption.go:349] updatePod called on pod "misc-reaper-1506106440-dsnx2"

I0924 02:11:03.125052       1 reflector.go:405] external/com_github_kubernetes/plugin/cmd/kube-scheduler/app/server.go:100: Watch close - *v1.Pod total 1355 items received

I0924 02:11:03.140644       1 factory.go:511] About to try and schedule pod misc-reaper-1506106440-dsnx2
I0924 02:11:03.140649       1 scheduler.go:261] Attempting to schedule pod: ops-scheduled-tasks/misc-reaper-1506106440-dsnx2
W0924 02:11:03.140834       1 cache.go:228] Pod ops-scheduled-tasks/misc-reaper-1506106440-dsnx2 assumed to a different node than added to.
I0924 02:11:03.140842       1 factory.go:725] Attempting to bind misc-reaper-1506106440-dsnx2 to ip-10-68-28-145.us-west-2.compute.internal
E0924 02:11:03.140849       1 cache.go:263] Pod ops-scheduled-tasks/misc-reaper-1506106440-dsnx2 updated on a different node than previously added to.
F0924 02:11:03.140862       1 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions
<stacktrace>

This looks to me like the pod informer is giving the scheduler pods that aren't up to date (because the crash happens right after Watch close - *v1.Pod total 1355 items received), but I don't really understand how that would happen. Also if there were an issue with the shared informer it seems like it would affect more people.

There is something weird about this number 1355 in *v1.Pod total 1355 items received -- at no time do we ever have more than about 300 pods in the cluster (according to kubectl get pods --all-namespaces -a | wc), so I don't understand where 1355 pods would come from. Probably unrelated though.

very happy to debug/fix this, I'm just feeling a bit stuck and would love ideas / debugging strategies

Contributor

julia-stripe commented Sep 25, 2017

This scheduler panic has been happening in our production cluster (still running 1.7.4) several times a day every day for the last 30 days. It doesn't appear to cause any actual scheduling issues right now (the scheduler panics, restarts, and everything seems to be fine) but it worries me that I can't figure out what's going on. I just spent a bit more time staring at the logs to try and understand what's happening. Here are some logs:

I0924 02:10:55.393785       1 leaderelection.go:204] successfully renewed lease kube-system/kube-scheduler
I0924 02:10:56.814434       1 factory.go:511] About to try and schedule pod misc-reaper-1506106440-dsnx2
I0924 02:10:56.814448       1 scheduler.go:261] Attempting to schedule pod: ops-scheduled-tasks/misc-reaper-1506106440-dsnx2
I0924 02:10:56.814724       1 factory.go:725] Attempting to bind misc-reaper-1506106440-dsnx2 to ip-10-68-25-61.us-west-2.compute.internal
I0924 02:10:56.823453       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ops-scheduled-tasks", Name:"misc-reaper-1506106440-dsnx2", UID:"99da90a2-a0cd-11e7-b720-02c5763b3f4e", APIVersion:"v1", ResourceVersion:"96758266", FieldPath:""}): type: 'Normal' reason: 'Scheduled' Successfully assigned ops--misc--auto-check-reaper-1506106440-dsnx2 to ip-10-68-25-61.us-west-2.compute.internal
	
I0924 02:10:56.823581       1 disruption.go:349] updatePod called on pod "misc-reaper-1506106440-dsnx2"

I0924 02:11:03.125052       1 reflector.go:405] external/com_github_kubernetes/plugin/cmd/kube-scheduler/app/server.go:100: Watch close - *v1.Pod total 1355 items received

I0924 02:11:03.140644       1 factory.go:511] About to try and schedule pod misc-reaper-1506106440-dsnx2
I0924 02:11:03.140649       1 scheduler.go:261] Attempting to schedule pod: ops-scheduled-tasks/misc-reaper-1506106440-dsnx2
W0924 02:11:03.140834       1 cache.go:228] Pod ops-scheduled-tasks/misc-reaper-1506106440-dsnx2 assumed to a different node than added to.
I0924 02:11:03.140842       1 factory.go:725] Attempting to bind misc-reaper-1506106440-dsnx2 to ip-10-68-28-145.us-west-2.compute.internal
E0924 02:11:03.140849       1 cache.go:263] Pod ops-scheduled-tasks/misc-reaper-1506106440-dsnx2 updated on a different node than previously added to.
F0924 02:11:03.140862       1 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions
<stacktrace>

This looks to me like the pod informer is giving the scheduler pods that aren't up to date (because the crash happens right after Watch close - *v1.Pod total 1355 items received), but I don't really understand how that would happen. Also if there were an issue with the shared informer it seems like it would affect more people.

There is something weird about this number 1355 in *v1.Pod total 1355 items received -- at no time do we ever have more than about 300 pods in the cluster (according to kubectl get pods --all-namespaces -a | wc), so I don't understand where 1355 pods would come from. Probably unrelated though.

very happy to debug/fix this, I'm just feeling a bit stuck and would love ideas / debugging strategies

@jayunit100

This comment has been minimized.

Show comment
Hide comment
@jayunit100

jayunit100 Sep 25, 2017

Member

Could this be an Etcd / apiserver locking issue, maybe one you only seen when running in high availability mode - If the status of the cache is being Trampled on? ...What happens if two schedulers run at the same time without a both having lock?

Member

jayunit100 commented Sep 25, 2017

Could this be an Etcd / apiserver locking issue, maybe one you only seen when running in high availability mode - If the status of the cache is being Trampled on? ...What happens if two schedulers run at the same time without a both having lock?

@julia-stripe

This comment has been minimized.

Show comment
Hide comment
@julia-stripe

julia-stripe Sep 25, 2017

Contributor

We run both the scheduler & controller manager with --leader-elect=true which should mean only one of them is running at a time. We do have two apiservers running at the same time though, maybe that's related? Will investigate.

Contributor

julia-stripe commented Sep 25, 2017

We run both the scheduler & controller manager with --leader-elect=true which should mean only one of them is running at a time. We do have two apiservers running at the same time though, maybe that's related? Will investigate.

@bsalamat

This comment has been minimized.

Show comment
Hide comment
@bsalamat

bsalamat Sep 26, 2017

Contributor

@julia-stripe I would like to help investigate this issue. Given the sequence of events, a race condition between the two API servers could be the cause, but it is hard to tell without further investigation. The fact that it is happening kind of frequently in your cluster is a good thing for debugging. We can try a couple of scenarios, for example running a single API server if possible, or perhaps raise logging verbosity to see if we can find any more clues.

Contributor

bsalamat commented Sep 26, 2017

@julia-stripe I would like to help investigate this issue. Given the sequence of events, a race condition between the two API servers could be the cause, but it is hard to tell without further investigation. The fact that it is happening kind of frequently in your cluster is a good thing for debugging. We can try a couple of scenarios, for example running a single API server if possible, or perhaps raise logging verbosity to see if we can find any more clues.

@jayunit100

This comment has been minimized.

Show comment
Hide comment
@jayunit100

jayunit100 Sep 27, 2017

Member

This is easy to test : Kill just an API server and probably .. problem solved : you have a bug in the scheduler lock, then further investigation to figure out why the API servers racing on the lock.

Member

jayunit100 commented Sep 27, 2017

This is easy to test : Kill just an API server and probably .. problem solved : you have a bug in the scheduler lock, then further investigation to figure out why the API servers racing on the lock.

@julia-stripe

This comment has been minimized.

Show comment
Hide comment
@julia-stripe

julia-stripe Sep 27, 2017

Contributor

I updated our configuration today so that the scheduler talks to the API server on localhost. So now the scheduler always sends requests to the same API server. Previously the scheduler would choose a random apiserver through a load balancer.

I was super hopeful that this would fix the issue (picking a random apiserver to talk to on every request is a pretty weird thing to do!). But sadly it didn't, the scheduler is still panicking in the same way as before.

I don't think running only one API server in production would be an acceptable workaround for us.

Contributor

julia-stripe commented Sep 27, 2017

I updated our configuration today so that the scheduler talks to the API server on localhost. So now the scheduler always sends requests to the same API server. Previously the scheduler would choose a random apiserver through a load balancer.

I was super hopeful that this would fix the issue (picking a random apiserver to talk to on every request is a pretty weird thing to do!). But sadly it didn't, the scheduler is still panicking in the same way as before.

I don't think running only one API server in production would be an acceptable workaround for us.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Oct 2, 2017

Member

Looking into logs from #50916 (comment) - the scheduler is trying to schedule the same pod misc-reaper-1506106440-dsnx2 for the second time, even though it previously scheduled it:

I0924 02:10:56.823453       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ops-scheduled-tasks", Name:"misc-reaper-1506106440-dsnx2", UID:"99da90a2-a0cd-11e7-b720-02c5763b3f4e", APIVersion:"v1", ResourceVersion:"96758266", FieldPath:""}): type: 'Normal' reason: 'Scheduled' Successfully assigned ops--misc--auto-check-reaper-1506106440-dsnx2 to ip-10-68-25-61.us-west-2.compute.internal

But I have a couple questions:

I0924 02:10:56.823581       1 disruption.go:349] updatePod called on pod "misc-reaper-1506106440-dsnx2"
  1. Why do we have this log in scheduler? This is coming from disruption controller IIUC, so it shouldn't be part of scheduler logs...

  2. Since this is happening right after watch closed: @julia-stripe - do you know if this is triggering re-list in informer? Or just re-sends watch requests with a different resource version?

Also - can you please link the whole scheduler log - it's difficult to come up with a good understanding of a problem from just small snippets.

Member

wojtek-t commented Oct 2, 2017

Looking into logs from #50916 (comment) - the scheduler is trying to schedule the same pod misc-reaper-1506106440-dsnx2 for the second time, even though it previously scheduled it:

I0924 02:10:56.823453       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ops-scheduled-tasks", Name:"misc-reaper-1506106440-dsnx2", UID:"99da90a2-a0cd-11e7-b720-02c5763b3f4e", APIVersion:"v1", ResourceVersion:"96758266", FieldPath:""}): type: 'Normal' reason: 'Scheduled' Successfully assigned ops--misc--auto-check-reaper-1506106440-dsnx2 to ip-10-68-25-61.us-west-2.compute.internal

But I have a couple questions:

I0924 02:10:56.823581       1 disruption.go:349] updatePod called on pod "misc-reaper-1506106440-dsnx2"
  1. Why do we have this log in scheduler? This is coming from disruption controller IIUC, so it shouldn't be part of scheduler logs...

  2. Since this is happening right after watch closed: @julia-stripe - do you know if this is triggering re-list in informer? Or just re-sends watch requests with a different resource version?

Also - can you please link the whole scheduler log - it's difficult to come up with a good understanding of a problem from just small snippets.

@julia-stripe

This comment has been minimized.

Show comment
Hide comment
@julia-stripe

julia-stripe Oct 2, 2017

Contributor

Why do we have this log in scheduler? This is coming from disruption controller IIUC, so it shouldn't be part of scheduler logs...

Yes, those are from the disruption controller -- i included them because they mentioned the same pod. (the logs I pasted in that comment come from searching for that pod ID in our log aggregation system, so I could find any events that mentioned it)

Since this is happening right after watch closed: @julia-stripe - do you know if this is triggering re-list in informer? Or just re-sends watch requests with a different resource version?

How do I figure out whether a re-list is being triggered in the informer?

can you please link the whole scheduler log

Here are the logs from the scheduler for 2 crashes, from the time when the watch closes to when the scheduler crashes https://gist.github.com/julia-stripe/55eac5af6f76043f7cb3c924b10aae21. happy to send even more logs if that would help, but I need to write a script to redact them a bit if I'm going to send more complete logs :)

Contributor

julia-stripe commented Oct 2, 2017

Why do we have this log in scheduler? This is coming from disruption controller IIUC, so it shouldn't be part of scheduler logs...

Yes, those are from the disruption controller -- i included them because they mentioned the same pod. (the logs I pasted in that comment come from searching for that pod ID in our log aggregation system, so I could find any events that mentioned it)

Since this is happening right after watch closed: @julia-stripe - do you know if this is triggering re-list in informer? Or just re-sends watch requests with a different resource version?

How do I figure out whether a re-list is being triggered in the informer?

can you please link the whole scheduler log

Here are the logs from the scheduler for 2 crashes, from the time when the watch closes to when the scheduler crashes https://gist.github.com/julia-stripe/55eac5af6f76043f7cb3c924b10aae21. happy to send even more logs if that would help, but I need to write a script to redact them a bit if I'm going to send more complete logs :)

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Oct 3, 2017

Member

Since this is happening right after watch closed: @julia-stripe - do you know if this is triggering re-list in informer? Or just re-sends watch requests with a different resource version?

How do I figure out whether a re-list is being triggered in the informer?

The easiest way I know is to look into apiserver logs around that time and see if there is list API call coming from scheduler.

Here are the logs from the scheduler for 2 crashes

I looked into those and in both cases I'm seeing this at the beginning:

I1002 11:31:23.299897       1 leaderelection.go:204] successfully renewed lease kube-system/kube-scheduler
I1002 11:31:23.664068       1 reflector.go:405] external/com_github_kubernetes/plugin/cmd/kube-scheduler/app/server.go:100: Watch close - *v1.Pod total 6891 items received

I'm really wondering why there are so many watch events (pods) delievered initially. We should be doing "LIST+WATCH" (right after acquiring lock), and there shouldn't be many watch events.

Something strange is happening here.

I'm afraid we may actually need apiserver logs to debug further...

Member

wojtek-t commented Oct 3, 2017

Since this is happening right after watch closed: @julia-stripe - do you know if this is triggering re-list in informer? Or just re-sends watch requests with a different resource version?

How do I figure out whether a re-list is being triggered in the informer?

The easiest way I know is to look into apiserver logs around that time and see if there is list API call coming from scheduler.

Here are the logs from the scheduler for 2 crashes

I looked into those and in both cases I'm seeing this at the beginning:

I1002 11:31:23.299897       1 leaderelection.go:204] successfully renewed lease kube-system/kube-scheduler
I1002 11:31:23.664068       1 reflector.go:405] external/com_github_kubernetes/plugin/cmd/kube-scheduler/app/server.go:100: Watch close - *v1.Pod total 6891 items received

I'm really wondering why there are so many watch events (pods) delievered initially. We should be doing "LIST+WATCH" (right after acquiring lock), and there shouldn't be many watch events.

Something strange is happening here.

I'm afraid we may actually need apiserver logs to debug further...

@julia-stripe

This comment has been minimized.

Show comment
Hide comment
@julia-stripe

julia-stripe Oct 3, 2017

Contributor

Here are all the API requests to /api/v1/pods from the scheduler for 10 minutes before a few crashes.

crash #1: time: 21:11:04.000

API server audit logs: (the results of sourcetype=kube-apiserver-audit method=GET uri=/api/v1/pods* NOT spec.nodeName NOT kube-state-metrics)

2017-10-02T21:08:14.238876191Z AUDIT: id="fe4ec695-6ac2-4e9c-afaf-4cd3d3d4271c" ip="10.68.24.10" method="GET" user="schedkubemaster--0fd4bec98eba7656c.northwest.stripe.io" groups="\"schedkubemaster.northwest.prod\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="<none>" uri="/api/v1/pods?resourceVersion=179093338&timeoutSeconds=330&watch=true"
2017-10-02T21:05:01.919774573Z AUDIT: id="fad970d9-7d73-4406-9975-68d634fe82b4" ip="10.68.24.122" method="GET" user="schedkubemaster--0aa8e87a222ebc781.northwest.stripe.io" groups="\"schedkubemaster.northwest.prod\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="<none>" uri="/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=179080832&timeoutSeconds=456&watch=true"
2017-10-02T21:04:13.842486847Z AUDIT: id="276775ac-072c-476d-b0a0-7cd3e16f5631" ip="10.68.24.10" method="GET" user="schedkubemaster--0fd4bec98eba7656c.northwest.stripe.io" groups="\"schedkubemaster.northwest.prod\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="<none>" uri="/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=179078533&timeoutSeconds=411&watch=true"

crash #2: time: 15:02:43.591276

API server audit logs: (the results of sourcetype=kube-apiserver-audit method=GET uri=/api/v1/pods* NOT spec.nodeName NOT kube-state-metrics)

2017-10-02T15:00:28.855716661Z AUDIT: id="58da304d-382a-4673-92f5-a93cca213ba7" ip="10.68.24.122" method="GET" user="schedkubemaster--0aa8e87a222ebc781.northwest.stripe.io" groups="\"schedkubemaster.northwest.prod\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="<none>" uri="/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=176828942&timeoutSeconds=457&watch=true"
2017-10-02T14:57:22.863912862Z AUDIT: id="0a6a7628-4a9a-4199-a358-2ee1048d3f7d" ip="10.68.24.10" method="GET" user="schedkubemaster--0fd4bec98eba7656c.northwest.stripe.io" groups="\"schedkubemaster.northwest.prod\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="<none>" uri="/api/v1/pods?resourceVersion=176789786&timeoutSeconds=514&watch=true"
2017-10-02T14:55:33.587863002Z AUDIT: id="9175e135-f59b-4525-885f-d372301ff590" ip="10.68.24.10" method="GET" user="schedkubemaster--0fd4bec98eba7656c.northwest.stripe.io" groups="\"schedkubemaster.northwest.prod\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="<none>" uri="/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=176766772&timeoutSeconds=430&watch=true"
2017-10-02T14:55:16.854385159Z AUDIT: id="d935f4a0-af12-46d8-970b-4d361d430676" ip="10.68.24.122" method="GET" user="schedkubemaster--0aa8e87a222ebc781.northwest.stripe.io" groups="\"schedkubemaster.northwest.prod\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="<none>" uri="/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=176763150&timeoutSeconds=312&watch=true"

Basically at all times as far as I can tell there are 2 watches on pods:

  • one like /api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=183370640&timeoutSeconds=502&watch=true
  • one like /api/v1/pods?resourceVersion=176789786&timeoutSeconds=514&watch=true

there are usually list events happening around crashes (because there are lists happening every 7 minutes or so) but as you can see in the logs they don't necessarily happen immediately before the crashes

Contributor

julia-stripe commented Oct 3, 2017

Here are all the API requests to /api/v1/pods from the scheduler for 10 minutes before a few crashes.

crash #1: time: 21:11:04.000

API server audit logs: (the results of sourcetype=kube-apiserver-audit method=GET uri=/api/v1/pods* NOT spec.nodeName NOT kube-state-metrics)

2017-10-02T21:08:14.238876191Z AUDIT: id="fe4ec695-6ac2-4e9c-afaf-4cd3d3d4271c" ip="10.68.24.10" method="GET" user="schedkubemaster--0fd4bec98eba7656c.northwest.stripe.io" groups="\"schedkubemaster.northwest.prod\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="<none>" uri="/api/v1/pods?resourceVersion=179093338&timeoutSeconds=330&watch=true"
2017-10-02T21:05:01.919774573Z AUDIT: id="fad970d9-7d73-4406-9975-68d634fe82b4" ip="10.68.24.122" method="GET" user="schedkubemaster--0aa8e87a222ebc781.northwest.stripe.io" groups="\"schedkubemaster.northwest.prod\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="<none>" uri="/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=179080832&timeoutSeconds=456&watch=true"
2017-10-02T21:04:13.842486847Z AUDIT: id="276775ac-072c-476d-b0a0-7cd3e16f5631" ip="10.68.24.10" method="GET" user="schedkubemaster--0fd4bec98eba7656c.northwest.stripe.io" groups="\"schedkubemaster.northwest.prod\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="<none>" uri="/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=179078533&timeoutSeconds=411&watch=true"

crash #2: time: 15:02:43.591276

API server audit logs: (the results of sourcetype=kube-apiserver-audit method=GET uri=/api/v1/pods* NOT spec.nodeName NOT kube-state-metrics)

2017-10-02T15:00:28.855716661Z AUDIT: id="58da304d-382a-4673-92f5-a93cca213ba7" ip="10.68.24.122" method="GET" user="schedkubemaster--0aa8e87a222ebc781.northwest.stripe.io" groups="\"schedkubemaster.northwest.prod\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="<none>" uri="/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=176828942&timeoutSeconds=457&watch=true"
2017-10-02T14:57:22.863912862Z AUDIT: id="0a6a7628-4a9a-4199-a358-2ee1048d3f7d" ip="10.68.24.10" method="GET" user="schedkubemaster--0fd4bec98eba7656c.northwest.stripe.io" groups="\"schedkubemaster.northwest.prod\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="<none>" uri="/api/v1/pods?resourceVersion=176789786&timeoutSeconds=514&watch=true"
2017-10-02T14:55:33.587863002Z AUDIT: id="9175e135-f59b-4525-885f-d372301ff590" ip="10.68.24.10" method="GET" user="schedkubemaster--0fd4bec98eba7656c.northwest.stripe.io" groups="\"schedkubemaster.northwest.prod\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="<none>" uri="/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=176766772&timeoutSeconds=430&watch=true"
2017-10-02T14:55:16.854385159Z AUDIT: id="d935f4a0-af12-46d8-970b-4d361d430676" ip="10.68.24.122" method="GET" user="schedkubemaster--0aa8e87a222ebc781.northwest.stripe.io" groups="\"schedkubemaster.northwest.prod\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="<none>" uri="/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=176763150&timeoutSeconds=312&watch=true"

Basically at all times as far as I can tell there are 2 watches on pods:

  • one like /api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=183370640&timeoutSeconds=502&watch=true
  • one like /api/v1/pods?resourceVersion=176789786&timeoutSeconds=514&watch=true

there are usually list events happening around crashes (because there are lists happening every 7 minutes or so) but as you can see in the logs they don't necessarily happen immediately before the crashes

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Oct 3, 2017

Member

All of those has "?watch=true" bit, so those weren't relists.

BTW - it's strange because the times are completely different here and in the previous file (11:31 and 16:26 vs 21:0* and 14:5*), so I guess those are different crashes.

But this clearly shows that there weren't any relists.

Member

wojtek-t commented Oct 3, 2017

All of those has "?watch=true" bit, so those weren't relists.

BTW - it's strange because the times are completely different here and in the previous file (11:31 and 16:26 vs 21:0* and 14:5*), so I guess those are different crashes.

But this clearly shows that there weren't any relists.

@julia-stripe

This comment has been minimized.

Show comment
Hide comment
@julia-stripe

julia-stripe Oct 3, 2017

Contributor

Relists look like this right? GET /api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=0. There's always one of those right after a crash which makes sense because the scheduler needs to reload everything from scratch, but not otherwise. I can explain all of the relists I see by either the scheduler starting or the controller manager starting.

Contributor

julia-stripe commented Oct 3, 2017

Relists look like this right? GET /api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=0. There's always one of those right after a crash which makes sense because the scheduler needs to reload everything from scratch, but not otherwise. I can explain all of the relists I see by either the scheduler starting or the controller manager starting.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Oct 3, 2017

Member

OK - so my hypothesis with relists and some bug in the code around that isn't a valid one.

In my opinion there are two possible hypothesis now:

  • there is some interaction between those two schedulers that you have
  • there is a bug in our retrying code (if pod can't be scheduled initially, or its scheduling fails because of some reason)

I bet on the second to be honest...

Member

wojtek-t commented Oct 3, 2017

OK - so my hypothesis with relists and some bug in the code around that isn't a valid one.

In my opinion there are two possible hypothesis now:

  • there is some interaction between those two schedulers that you have
  • there is a bug in our retrying code (if pod can't be scheduled initially, or its scheduling fails because of some reason)

I bet on the second to be honest...

@julia-stripe

This comment has been minimized.

Show comment
Hide comment
@julia-stripe

julia-stripe Oct 3, 2017

Contributor

the 2nd scheduler hasn't printed any log lines at all for the last 3 weeks (with --v=4) so I don't believe it's doing anything.

Contributor

julia-stripe commented Oct 3, 2017

the 2nd scheduler hasn't printed any log lines at all for the last 3 weeks (with --v=4) so I don't believe it's doing anything.

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Oct 9, 2017

Contributor

[MILESTONENOTIFIER] Milestone Removed

@julia-stripe @timothysc

Important: This issue was missing labels required for the v1.9 milestone for more than 3 days:

priority: Must specify exactly one of priority/critical-urgent, priority/important-longterm or priority/important-soon.

Help
Contributor

k8s-merge-robot commented Oct 9, 2017

[MILESTONENOTIFIER] Milestone Removed

@julia-stripe @timothysc

Important: This issue was missing labels required for the v1.9 milestone for more than 3 days:

priority: Must specify exactly one of priority/critical-urgent, priority/important-longterm or priority/important-soon.

Help

@k8s-merge-robot k8s-merge-robot removed this from the v1.9 milestone Oct 9, 2017

@liggitt

This comment has been minimized.

Show comment
Hide comment
@liggitt

liggitt Oct 10, 2017

Member

We're using etcd 3.2.0 which is newer than that. This happened in our production cluster twice today.

Conversely, evidence that it isn't an issue with etcd: when we later restart the scheduler (~30 minutes later), the system recovers without any changes to etcd (according to the apiserver lots).

Is it possible you're hitting etcd-io/etcd#8411 (fixed in etcd 3.2.6+)? We saw this impact the apiserver watch cache (#43152 (comment), #43152 (comment)), but it could affect any watch client connected to an etcd member that dropped and then restored from the etcd cluster via a snapshot (any watch events contained in that snapshot are not sent to watchers)

Member

liggitt commented Oct 10, 2017

We're using etcd 3.2.0 which is newer than that. This happened in our production cluster twice today.

Conversely, evidence that it isn't an issue with etcd: when we later restart the scheduler (~30 minutes later), the system recovers without any changes to etcd (according to the apiserver lots).

Is it possible you're hitting etcd-io/etcd#8411 (fixed in etcd 3.2.6+)? We saw this impact the apiserver watch cache (#43152 (comment), #43152 (comment)), but it could affect any watch client connected to an etcd member that dropped and then restored from the etcd cluster via a snapshot (any watch events contained in that snapshot are not sent to watchers)

@liggitt

This comment has been minimized.

Show comment
Hide comment
@liggitt

liggitt Oct 10, 2017

Member

other questions:

  • how many etcd members?
  • are you running api servers with --etcd-quorum-read?
Member

liggitt commented Oct 10, 2017

other questions:

  • how many etcd members?
  • are you running api servers with --etcd-quorum-read?
@julia-stripe

This comment has been minimized.

Show comment
Hide comment
@julia-stripe

julia-stripe Oct 10, 2017

Contributor
  1. We have 3 etcd members
  2. We're not running the apiservers with --etcd-quorum-read. I'd never heard of that option until right now (it's not mentioned in the documentation anywhere, though I found a lot of github issues saying it's important to turn it on in a HA setup). Going to try that right away and see if it helps.
Contributor

julia-stripe commented Oct 10, 2017

  1. We have 3 etcd members
  2. We're not running the apiservers with --etcd-quorum-read. I'd never heard of that option until right now (it's not mentioned in the documentation anywhere, though I found a lot of github issues saying it's important to turn it on in a HA setup). Going to try that right away and see if it helps.
@julia-stripe

This comment has been minimized.

Show comment
Hide comment
@julia-stripe

julia-stripe Oct 10, 2017

Contributor

Switched to using --etcd-quorum-read and the scheduler has crashed twice since then with this error. So it doesn't seem to have fixed the issue.

Contributor

julia-stripe commented Oct 10, 2017

Switched to using --etcd-quorum-read and the scheduler has crashed twice since then with this error. So it doesn't seem to have fixed the issue.

@embano1

This comment has been minimized.

Show comment
Hide comment
@embano1

embano1 Oct 10, 2017

@julia-stripe I also accidentally stumbled across this setting, which is the default in recent Kubernetes environments (also with the improved support on the etcdv3 side). Not directly related, but have a look at this issue how a certain etcd version caused cluster trouble (#52498).

Did you try downgrading etcd? Just a blind guess...

embano1 commented Oct 10, 2017

@julia-stripe I also accidentally stumbled across this setting, which is the default in recent Kubernetes environments (also with the improved support on the etcdv3 side). Not directly related, but have a look at this issue how a certain etcd version caused cluster trouble (#52498).

Did you try downgrading etcd? Just a blind guess...

@danielqsj

This comment has been minimized.

Show comment
Hide comment
@danielqsj

danielqsj Oct 30, 2017

Contributor

I also encountered the same problem in 1.7.8 .

Contributor

danielqsj commented Oct 30, 2017

I also encountered the same problem in 1.7.8 .

@liggitt

This comment has been minimized.

Show comment
Hide comment
@liggitt

liggitt Nov 7, 2017

Member

we're seeing this on 1.7.6 as well

Member

liggitt commented Nov 7, 2017

we're seeing this on 1.7.6 as well

@liggitt

This comment has been minimized.

Show comment
Hide comment
@liggitt

liggitt Nov 7, 2017

Member

seeing this in our logs:

I1103 08:24:31.633312  102351 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"mynamespace", Name:"mydeployment-2-6wcd9", UID:"1236b32d-1230-1237-1233-12372d41666a", APIVersion:"v1", ResourceVersion:"123123123", FieldPath:""}): type: 'Normal' reason: 'Scheduled' Successfully assigned mydeployment-2-6wcd9 to mynode
W1103 08:25:03.927820  102351 cache.go:228] Pod mynamespace/mydeployment-2-6wcd9 assumed to a different node than added to.
E1103 08:25:03.927862  102351 cache.go:263] Pod mynamespace/mydeployment-2-6wcd9 updated on a different node than previously added to.
F1103 08:25:03.927869  102351 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions
Member

liggitt commented Nov 7, 2017

seeing this in our logs:

I1103 08:24:31.633312  102351 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"mynamespace", Name:"mydeployment-2-6wcd9", UID:"1236b32d-1230-1237-1233-12372d41666a", APIVersion:"v1", ResourceVersion:"123123123", FieldPath:""}): type: 'Normal' reason: 'Scheduled' Successfully assigned mydeployment-2-6wcd9 to mynode
W1103 08:25:03.927820  102351 cache.go:228] Pod mynamespace/mydeployment-2-6wcd9 assumed to a different node than added to.
E1103 08:25:03.927862  102351 cache.go:263] Pod mynamespace/mydeployment-2-6wcd9 updated on a different node than previously added to.
F1103 08:25:03.927869  102351 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions
@liggitt

This comment has been minimized.

Show comment
Hide comment
@liggitt

liggitt Nov 7, 2017

Member

Reproducing test and fix in #55262

Member

liggitt commented Nov 7, 2017

Reproducing test and fix in #55262

k8s-merge-robot added a commit that referenced this issue Nov 8, 2017

Merge pull request #55262 from liggitt/schedulercache
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fix 'Schedulercache is corrupted' error

Fixes #50916

If an Assume()ed pod is Add()ed with a different nodeName, the podStates view of the pod is not corrected to reflect the actual nodeName. On the next Update(), the scheduler observes the mismatch and process exits.

```release-note
Fixed 'Schedulercache is corrupted' error in kube-scheduler
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment