Skip to content
This repository has been archived by the owner on Dec 5, 2017. It is now read-only.

nil pointer dereference error in mesos procurement.go #768

Closed
ravilr opened this issue Feb 5, 2016 · 8 comments
Closed

nil pointer dereference error in mesos procurement.go #768

ravilr opened this issue Feb 5, 2016 · 8 comments

Comments

@ravilr
Copy link

ravilr commented Feb 5, 2016

@sttts @jdef
version: v0.7.2-v1.1.5

looks like there arises some kind of racy condition when then are pods using nodeSelector and mesos slave attributes getting exposed as k8s node labels.

I0205 02:16:14.065151 6249 errorhandler.go:59] Error scheduling k8s-router-mv96c: No suitable offers for pod/task; retrying
I0205 02:16:14.165429 6249 queuer.go:164] attempting to yield a pod
E0205 02:16:15.065531 6249 util.go:82] Recovered from panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/pkg/util/util.go:76
/var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/pkg/util/util.go:64
/usr/local/go/src/runtime/asm_amd64.s:402
/usr/local/go/src/runtime/panic.go:387
/usr/local/go/src/runtime/panic.go:42
/usr/local/go/src/runtime/sigpanic_unix.go:26
/var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/scheduler/podtask/procurement.go:132
/var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/scheduler/podtask/procurement.go:96
/var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/scheduler/podtask/procurement.go:108
:11
/var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/scheduler/components/algorithm/podschedulers/fcfs.go:100
/var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/scheduler/components/scheduler.go:101
/var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/offers/offers.go:472
/var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/offers/offers.go:508
/var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/runtime/util.go:115
/var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/runtime/util.go:116
/usr/local/go/src/runtime/asm_amd64.s:2232

@ravilr
Copy link
Author

ravilr commented Feb 5, 2016

apart from the nil pointer deference of *api.node labels, also observing that the node registrator doesn't seem to add back a mesos slave host, which went offline and came back after some time, into k8s node registry.
the slave host gets deleted from k8s api on slave down event:
I0205 02:03:05.514811 6249 service.go:710] deleting node "s1.www.com" from registry
but never seems to get registered back when it comes up and starts offering offers again. Had to restart the k8sm-scheduler to recover from this.

@ravilr ravilr changed the title nil pointer deference error in mesos procurement.go nil pointer dereference error in mesos procurement.go Feb 5, 2016
@jdef jdef added this to the v0.7.3 milestone Feb 5, 2016
@ravilr
Copy link
Author

ravilr commented Feb 6, 2016

also seeing this in controller-manager logs:
E0205 23:53:43.761186 1 statusupdater.go:68] Error listing slaves without kubelet: Get http://master1.www.com:5050/state: dial tcp 1.1.1.1:5050: connection refused

mesos cluster version : 0.24. but, looks like we fall back to /state.json from code, so above message is harmless ?

@jdef
Copy link

jdef commented Feb 6, 2016

yes, it should be falling back to state.json

@jdef
Copy link

jdef commented Feb 9, 2016

Pretty sure that the problem is that the Fit func is passing a nil and the procurement funcs aren't checking for that:

$ find contrib/mesos -type f -exec grep -Hn -e 'Fit(' \{\} \;
contrib/mesos/pkg/scheduler/components/scheduler.go:102:                                return !task.Has(podtask.Launched) && ps.Fit(task, offer, nil)
contrib/mesos/pkg/scheduler/components/algorithm/podschedulers/types.go:40:     Fit(*podtask.T, *mesosproto.Offer, *api.Node) bool
contrib/mesos/pkg/scheduler/components/algorithm/podschedulers/fcfs.go:99:func (fps *fcfsPodScheduler) Fit(t *podtask.T, offer *mesosproto.Offer, n *api.Node) bool {

@jdef
Copy link

jdef commented Feb 9, 2016

fixed kubernetes/kubernetes#20936

@ravilr
Copy link
Author

ravilr commented Feb 10, 2016

pulled in the above fix and it seems to be working in my cluster running couple of pods with nodeSelector, without any panics.

@jdef
Copy link

jdef commented Feb 10, 2016

awesome - thanks for verifying!

On Tue, Feb 9, 2016 at 9:12 PM, ravilr notifications@github.com wrote:

pulled in the above fix and it seems to be working in my cluster running
couple of pods with nodeSelector, without any panics.


Reply to this email directly or view it on GitHub
#768 (comment)
.

@jdef
Copy link

jdef commented Feb 20, 2016

cherry-picked into 0.7.3, removing label

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants