Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Fix 'Schedulercache is corrupted' error #55262
Conversation
k8s-merge-robot
assigned
bsalamat and
k82cn
Nov 7, 2017
k8s-ci-robot
added
cncf-cla: yes
release-note
and removed
do-not-merge/release-note-label-needed
labels
Nov 7, 2017
liggitt
referenced this pull request
Nov 7, 2017
Closed
Scheduler dies with "Schedulercache is corrupted" #50916
k8s-ci-robot
added
sig/scheduling
kind/bug
labels
Nov 7, 2017
|
/retest |
|
/assign @wojtek-t |
k8s-ci-robot
assigned
wojtek-t
Nov 7, 2017
timothysc
modified the milestone:
v1.9
Nov 7, 2017
timothysc
added
the
cherrypick-candidate
label
Nov 7, 2017
timothysc
added this to the v1.8 milestone
Nov 7, 2017
| + t.Fatalf("AddPod failed: %v", err) | ||
| + } | ||
| + } | ||
| + cache.cleanupAssumedPods(now.Add(2 * ttl)) |
liggitt
Nov 7, 2017
Member
this is simulating passing of time to ensure the added pod doesn't expire. was copy/paste from previous testcase, though, and not really relevant to this one, so I can remove it
k8s-merge-robot
added
the
approved
label
Nov 7, 2017
k8s-cherrypick-bot
commented
Nov 7, 2017
|
Removing label |
k8s-cherrypick-bot
removed
the
cherrypick-candidate
label
Nov 7, 2017
timothysc
added
the
cherrypick-candidate
label
Nov 7, 2017
|
updated to remove unrelated ttl bit from unit test |
This was referenced Nov 7, 2017
jpbetz
added
priority/critical-urgent
lgtm
cherrypick-approved
and removed
cherrypick-candidate
cherrypick-approved
labels
Nov 7, 2017
|
CRD Flake |
bsalamat
reviewed
Nov 7, 2017
Your change looks good to me, I just wonder if Scheduler resource accounting remains valid in such scenario where a pod is assumed to a different node that it is bound to.
| + } | ||
| + for _, podToUpdate := range tt.podsToUpdate { | ||
| + if err := cache.UpdatePod(podToUpdate[0], podToUpdate[1]); err != nil { | ||
| + t.Fatalf("AddPod failed: %v", err) |
|
fixed gofmt error and test message, re-tagging |
k8s-merge-robot
removed
the
lgtm
label
Nov 7, 2017
It was just the podStates' version of the pod that wasn't getting updated. The removePod(currState.pod)/addPod(pod) dance corrects the per-node accounting already (and the unit test I added demonstrates the per-node accounting is corrected): kubernetes/plugin/pkg/scheduler/schedulercache/cache.go Lines 234 to 241 in 454074d |
liggitt
added
the
lgtm
label
Nov 7, 2017
|
@liggitt looks like there's still a gofmt issue on |
|
Thanks, @liggitt! /lgtm |
k8s-merge-robot
removed
the
lgtm
label
Nov 7, 2017
|
actually fixed gofmt error |
|
reapplying LGTM |
k8s-ci-robot
assigned
cblecker
Nov 7, 2017
k8s-ci-robot
added
the
lgtm
label
Nov 7, 2017
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, cblecker, liggitt, timothysc Associated issue: 50916 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
|
uhhh..
That's an odd unit test failure. Let's retry. |
|
unit test failing: #55276 |
|
fingers crossed |
|
TestCRD flake |
added a commit
that referenced
this pull request
Nov 8, 2017
|
/test all [submit-queue is verifying that this PR is safe to merge] |
|
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here. |
liggitt commentedNov 7, 2017
•
Edited 7 times
-
liggitt
Nov 7, 2017
-
liggitt
Nov 7, 2017
-
liggitt
Nov 7, 2017
-
liggitt
Nov 7, 2017
-
liggitt
Nov 7, 2017
-
liggitt
Nov 7, 2017
-
liggitt
Nov 7, 2017
Fixes #50916
If an Assume()ed pod is Add()ed with a different nodeName, the podStates view of the pod is not corrected to reflect the actual nodeName. On the next Update(), the scheduler observes the mismatch and process exits.