scheduler: fix panic while removing node from imageStates cache #66224

nikhita · 2018-07-16T06:15:20Z

Currently, when I run hack/local-up-cluster.sh, the scheduler encounters a panic. From /tmp/kube-scheduler.log :

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x15e9988]

goroutine 55 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107
panic(0x4242880, 0x877d870)
	/usr/local/go/src/runtime/panic.go:502 +0x229
k8s.io/kubernetes/pkg/scheduler/cache.(*schedulerCache).removeNodeImageStates(0xc4203dfe50, 0xc420ae3b80)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/scheduler/cache/cache.go:510 +0xe8
k8s.io/kubernetes/pkg/scheduler/cache.(*schedulerCache).UpdateNode(0xc4203dfe50, 0xc420ae3b80, 0xc420415340, 0x0, 0x0)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/scheduler/cache/cache.go:442 +0xcd
k8s.io/kubernetes/pkg/scheduler/factory.(*configFactory).updateNodeInCache(0xc420d2ca00, 0x4b680c0, 0xc420ae3b80, 0x4b680c0, 0xc420415340)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/scheduler/factory/factory.go:794 +0x9a
k8s.io/kubernetes/pkg/scheduler/factory.(*configFactory).(k8s.io/kubernetes/pkg/scheduler/factory.updateNodeInCache)-fm(0x4b680c0, 0xc420ae3b80, 0x4b680c0, 0xc420415340)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/scheduler/factory/factory.go:248 +0x52
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(0xc4209fc8f0, 0xc4209fc900, 0xc4209fc910, 0x4b680c0, 0xc420ae3b80, 0x4b680c0, 0xc420415340)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/controller.go:202 +0x5d
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.(*processorListener).run.func1.1(0x42cf8f, 0xc4215035a0, 0x0)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:552 +0x18a
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff(0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0xc4214addf0, 0x42cad9, 0xc421598f30)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:203 +0x9c
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:548 +0x81
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc421503768)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc4214adf68, 0xdf8475800, 0x0, 0x40fdd01, 0xc4215d2360)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc421503768, 0xdf8475800, 0xc4215d2360)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.(*processorListener).run(0xc420187100)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:546 +0x78
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.(*processorListener).(k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.run)-fm()
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:390 +0x2a
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc4209b4840, 0xc42025e370)
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x4f
created by k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.(*Group).Start
	/home/nraghunath/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:69 +0x62

#65745 was merged recently which introduced

kubernetes/pkg/scheduler/cache/cache.go

Lines 506 to 510 in c861ceb

    
           state, ok := cache.imageStates[name] 
        
           if ok { 
        
           	state.nodes.Delete(node.Name) 
        
           } 
        
           if len(state.nodes) == 0 {

If !ok i.e. state is nil, state.nodes ends up in a panic.

Release note:

NONE

nikhita · 2018-07-16T06:16:03Z

/cc @silveryfu @resouer @bsalamat @ravisantoshgudimetla

k8s-ci-robot · 2018-07-16T06:16:06Z

@nikhita: GitHub didn't allow me to request PR reviews from the following users: silveryfu.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @silveryfu @resouer @bsalamat @ravisantoshgudimetla

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ravisantoshgudimetla

Thanks @nikhita for finding and working on this. This LGTM.

/lgtm

ravisantoshgudimetla · 2018-07-16T12:06:23Z

pkg/scheduler/cache/cache.go

-				// imageStates represents the total number of different
-				// images on all nodes
-				delete(cache.imageStates, name)
+				if len(state.nodes) == 0 {


I think previous line doesn't make much sense now -state.nodes.Delete(node.Name) when len(state.nodes) == 0, you can change the order. I'd to do a !ok pattern, atleast log something in the !ok block and then in else we can handle this condition.

I think there are still some nits and tests to be fixed as a follow-up for #65745. Maybe this can be incorporated into that, or I'll fix it as a follow-up later. 👍

ravisantoshgudimetla · 2018-07-16T12:08:02Z

/lgtm

k82cn · 2018-07-16T13:26:34Z

/approve

k8s-ci-robot · 2018-07-16T13:26:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: k82cn, nikhita, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [k82cn]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-github-robot · 2018-07-16T14:37:21Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2018-07-16T14:49:02Z

Automatic merge from submit-queue (batch tested with PRs 66203, 66224). If you want to cherry-pick this change to another branch, please follow the instructions here.

bsalamat · 2018-07-16T18:06:54Z

Thanks @nikita for fixing the issue. Any fix like this should be accompanied by a test that reproduces the bug. That would verify that the fix does not miss corner cases and also raises the probability of catching similar bugs in other areas of the code.

scheduler: fix panic while removing node from imageStates cache

c166743

k8s-ci-robot requested review from davidopp and k82cn July 16, 2018 06:15

k8s-ci-robot requested review from resouer and bsalamat July 16, 2018 06:16

k8s-ci-robot requested a review from ravisantoshgudimetla July 16, 2018 06:16

ravisantoshgudimetla approved these changes Jul 16, 2018

View reviewed changes

k8s-ci-robot assigned ravisantoshgudimetla Jul 16, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 16, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 16, 2018

k82cn approved these changes Jul 16, 2018

View reviewed changes

k8s-github-robot merged commit 72440a1 into kubernetes:master Jul 16, 2018

nikhita deleted the fix-scheduler-panic branch July 16, 2018 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: fix panic while removing node from imageStates cache #66224

scheduler: fix panic while removing node from imageStates cache #66224

nikhita commented Jul 16, 2018

nikhita commented Jul 16, 2018

k8s-ci-robot commented Jul 16, 2018

ravisantoshgudimetla left a comment •

edited

Loading

ravisantoshgudimetla Jul 16, 2018

nikhita Jul 16, 2018

ravisantoshgudimetla commented Jul 16, 2018

k82cn commented Jul 16, 2018

k8s-ci-robot commented Jul 16, 2018

k8s-github-robot commented Jul 16, 2018

k8s-github-robot commented Jul 16, 2018

bsalamat commented Jul 16, 2018

	state, ok := cache.imageStates[name]
	if ok {
	state.nodes.Delete(node.Name)
	}
	if len(state.nodes) == 0 {

scheduler: fix panic while removing node from imageStates cache #66224

scheduler: fix panic while removing node from imageStates cache #66224

Conversation

nikhita commented Jul 16, 2018

nikhita commented Jul 16, 2018

k8s-ci-robot commented Jul 16, 2018

ravisantoshgudimetla left a comment • edited Loading

Choose a reason for hiding this comment

ravisantoshgudimetla Jul 16, 2018

Choose a reason for hiding this comment

nikhita Jul 16, 2018

Choose a reason for hiding this comment

ravisantoshgudimetla commented Jul 16, 2018

k82cn commented Jul 16, 2018

k8s-ci-robot commented Jul 16, 2018

k8s-github-robot commented Jul 16, 2018

k8s-github-robot commented Jul 16, 2018

bsalamat commented Jul 16, 2018

ravisantoshgudimetla left a comment •

edited

Loading