vSphere Cloud Provider triggers panic in controller-manager pod #36295

KingJ · 2016-11-06T01:24:48Z

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.): No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): vsphere, ResourcePool, ComputeResource

Is this a BUG REPORT or FEATURE REQUEST? (choose one): Bug Report

Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.5+coreos.0", GitCommit:"f70c2e5b2944cb5d622621a706bdec3d8a5a9c5e", GitTreeState:"clean", BuildDate:"2016-10-31T19:16:47Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

Environment: CoreOS on vSphere

Cloud provider or hardware configuration: vSphere
OS (e.g. from /etc/os-release): CoreOS 1185.3.0 (MoreOS)
Kernel (e.g. uname -a): Linux k8-m1 4.7.3-coreos-r2 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Tue Nov 1 01:38:43 UTC 2016

What happened:
I configured the controller-manager to use vsphere as the cloud provider and supplied it with a provider cloud configuration. When the controller-manager pod started, it panicked with the following stack trace;

2016-11-06T00:34:22.670422172Z panic: reflect.Set: value of type mo.ResourcePool is not assignable to type mo.ComputeResource
2016-11-06T00:34:22.670477454Z 
2016-11-06T00:34:22.670494551Z goroutine 68 [running]:
2016-11-06T00:34:22.670592707Z panic(0x38d5200, 0xc820d2b310)
2016-11-06T00:34:22.670754360Z 	/usr/local/go/src/runtime/panic.go:481 +0x3e6
2016-11-06T00:34:22.670990355Z reflect.Value.assignTo(0x4a676e0, 0xc8200a5600, 0x99, 0x4df7650, 0xb, 0x4a66e60, 0x0, 0x0, 0x0, 0x0)
2016-11-06T00:34:22.671019041Z 	/usr/local/go/src/reflect/value.go:2164 +0x3be
2016-11-06T00:34:22.671380809Z reflect.Value.Set(0x4a66e60, 0xc82049e780, 0x199, 0x4a676e0, 0xc8200a5600, 0x99)
2016-11-06T00:34:22.671462864Z 	/usr/local/go/src/reflect/value.go:1334 +0x95
2016-11-06T00:34:22.671715875Z k8s.io/kubernetes/vendor/github.com/vmware/govmomi/vim25/mo.LoadRetrievePropertiesResponse(0xc820d308e0, 0x46ecda0, 0xc82049e780, 0x0, 0x0)
2016-11-06T00:34:22.672055230Z 	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/vmware/govmomi/vim25/mo/retrieve.go:128 +0xe21
2016-11-06T00:34:22.672431372Z k8s.io/kubernetes/vendor/github.com/vmware/govmomi/property.(*Collector).Retrieve(0xc820b54418, 0x7f8a572ab340, 0xc8202d2540, 0xc820d3ca80, 0x1, 0x1, 0xc820cecfd0, 0x1, 0x1, 0x46ecda0, ...)
2016-11-06T00:34:22.672467148Z 	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/vmware/govmomi/property/collector.go:167 +0x52f
2016-11-06T00:34:22.672606970Z k8s.io/kubernetes/vendor/github.com/vmware/govmomi/property.(*Collector).RetrieveOne(0xc820b54418, 0x7f8a572ab340, 0xc8202d2540, 0xc820cec9c0, 0xc, 0xc820cec9e0, 0xa, 0xc820cecfd0, 0x1, 0x1, ...)
2016-11-06T00:34:22.672681600Z 	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/vmware/govmomi/property/collector.go:173 +0x10e
2016-11-06T00:34:22.672871709Z k8s.io/kubernetes/vendor/github.com/vmware/govmomi/object.Common.Properties(0x0, 0x0, 0xc820b1e500, 0xc8202cedd0, 0xb, 0xc8202cee00, 0xb, 0x7f8a572ab340, 0xc8202d2540, 0xc820cec9c0, ...)
2016-11-06T00:34:22.672904730Z 	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/vmware/govmomi/object/common.go:97 +0x19f
2016-11-06T00:34:22.673072861Z k8s.io/kubernetes/pkg/cloudprovider/providers/vsphere.readInstance(0xc82025b0e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
2016-11-06T00:34:22.673104582Z 	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/cloudprovider/providers/vsphere/vsphere.go:223 +0xe8f
2016-11-06T00:34:22.673118861Z k8s.io/kubernetes/pkg/cloudprovider/providers/vsphere.newVSphere(0xc820252200, 0x18, 0xc8204c25f0, 0xc, 0xc82011fd60, 0x9, 0xc8204c2048, 0x3, 0x1, 0xc8204c2ae8, ...)
2016-11-06T00:34:22.673128111Z 	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/cloudprovider/providers/vsphere/vsphere.go:237 +0x7e
2016-11-06T00:34:22.673204822Z k8s.io/kubernetes/pkg/cloudprovider/providers/vsphere.init.1.func1(0x7f8a572c9488, 0xc82014c668, 0x0, 0x0, 0x0, 0x0)
2016-11-06T00:34:22.673274571Z 	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/cloudprovider/providers/vsphere/vsphere.go:153 +0xdf
2016-11-06T00:34:22.673365023Z k8s.io/kubernetes/pkg/cloudprovider.GetCloudProvider(0x7ffc9a344abf, 0x7, 0x7f8a572c9488, 0xc82014c668, 0x0, 0x0, 0x0, 0x0)
2016-11-06T00:34:22.673457368Z 	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/cloudprovider/plugins.go:62 +0x112
2016-11-06T00:34:22.673526314Z k8s.io/kubernetes/pkg/cloudprovider.InitCloudProvider(0x7ffc9a344abf, 0x7, 0x7ffc9a344ad6, 0x1c, 0x0, 0x0, 0x0, 0x0)
2016-11-06T00:34:22.673582280Z 	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/cloudprovider/plugins.go:84 +0x3e2
2016-11-06T00:34:22.673593710Z k8s.io/kubernetes/cmd/kube-controller-manager/app.StartControllers(0xc82062f900, 0xc8203aea80, 0xc8201de340, 0xc8203af860, 0x7f8a572d6790, 0xc820418640, 0x0, 0x0)
2016-11-06T00:34:22.673674949Z 	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-controller-manager/app/controllermanager.go:225 +0x719
2016-11-06T00:34:22.673686774Z k8s.io/kubernetes/cmd/kube-controller-manager/app.Run.func2(0xc8203af860)
2016-11-06T00:34:22.673696396Z 	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-controller-manager/app/controllermanager.go:166 +0x6a
2016-11-06T00:34:22.673755265Z created by k8s.io/kubernetes/pkg/client/leaderelection.(*LeaderElector).Run
2016-11-06T00:34:22.673766424Z 	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/client/leaderelection/leaderelection.go:177 +0x91

What you expected to happen:
I expected the controller-manager pod to run as normal.

How to reproduce it (as minimally and precisely as possible):
Configure the controller-manager pod to use the vsphere cloud provider and pass the following provider cloud configuration file;

[Global]
server = vcenter
port = 443
user = administrator@vsphere.local
password = removed
insecure-flag = true
datacenter = FC

[Network]
public-network = External

Anything else do we need to know:

This cluster was created by following the CoreOS + Kubernetes Step by Step guide. The only deviation from the guide was to adjust the /etc/kubernetes/manifests/kube-controller-manager.yaml file to include configuration flags for cloud-provider and cloud-config, and adding an additional volume and volume mount for the provider cloud config file.
The CoreOS VMs are part of a single resource group (K8) on the same ESXi host.
My base image for all of the hyperkube containers is quay.io/coreos/hyperkube:v1.4.5_coreos.0
Before adding the additional configuration to use the vsphere cloud provider, this cluster was working as expected. The only change that was made to produce the stack trace above was to enable the vsphere cloud provider and pass the provider cloud config above.
I am still fairly new to Kubernetes so i'm open to the possibility that I might have made a fatal error somewhere, but as far as I can tell this does appear to be a genuine bug. Apologies if this does turn out to be a mistake on my behalf!

The text was updated successfully, but these errors were encountered:

pdhamdhere · 2016-11-06T05:04:53Z

@kerneltime regression from last regression-fix?

kerneltime · 2016-11-07T20:09:12Z

Not a regression but the init code needs to be vetted for varying deployments scenarios. Similar themed panic was hit by RedHat folks as well @ line https://github.com/kubernetes/kubernetes/blob/release-1.4/pkg/cloudprovider/providers/vsphere/vsphere.go#L220

erinboyd · 2016-11-07T20:36:30Z

Today the resource pool isn't a parameter of the cloud config, should it be set there, rather than via govc export?

kerneltime · 2016-11-07T21:03:20Z

The code in question I think needs to scan the hierarchy correctly, today there are strong assumptions around what type a parent or the node itself can be. cc @vipulsabhaya

kerneltime · 2016-11-08T01:25:58Z

Some of the relevant documentation here

A resource pool can contain child resource pools, virtual machines, or both. 
You can create a hierarchy of shared resources. 
The resource pools at a higher level are called parent resource pools. 
Resource pools and virtual machines that are at the same level are called siblings. 
The cluster itself represents the root resource pool. 
If you do not create child resource pools, only the root resource pools exist.

Will try to get a fix in this week.

kerneltime · 2016-11-08T21:40:22Z

The resource pool is not important here, the code tries to discover the information it should return for GetZone() API. It has a prescriptive notion for what the deployment should look like and tries to discover the cluster it is in to return zone information. I suggest that the selection of zone be done at install time and no assumption be made within the cloud provider code as to what constitute a zone when deploying vSphere on premise. For certain customer single ESX boxes might constitute an availability zone while for others it might be a vSphere Cluster. @vipulsabhaya any comments?

erinboyd · 2016-11-10T23:26:14Z

@kerneltime So, the code assumes you are providing the path off of /Datacenter/host/

Thus:
[root@ose3-nfs-0 ~]# govc ls -l 'host/*'
/Boston/host/devel/Resources (ResourcePool)
/Boston/host/devel/10.19.114.222 (HostSystem)
/Boston/host/devel/10.19.114.223 (HostSystem)
[root@ose3-nfs-0 ~]#

My resource pool should be "/devel/Resources" or just "Resources" or may I specify the full path?

kerneltime · 2016-11-11T17:01:09Z

Give the full path. Here is what it looks for my setup using Kube-up
export GOVC_RESOURCE_POOL='/Datacenter/host/10.20.104.41/Resources'
That said the assumptions for what constitutes a zone is very prescriptive here. It should be up to the deployment logic to label the nodes for their availability zones rather than the nodes trying to figure it out.

KingJ · 2016-11-12T22:29:40Z

After moving my ESXi host in to a cluster in vSphere, instead of having it as a standalone host inside a datacentre, I no longer receive this error. My datacentre config is unchanged and is set as the name of the datacentre as shown in vSphere - i've not used the full path that govc ls outputs.

erinboyd · 2016-11-14T19:28:16Z

@kerneltime So I am thinking the full path is:
/Boston/host/devel/Resources
But from your example you give I am wondering if I shouldn't have an IP (host) rather then 'devel' in my path.
Thoughts?

dav1x · 2016-11-14T20:09:53Z

@erinboyd I think in @kerneltime's example he is using a single node. The devel is the cluster name. We could always take a node out of the cluster and test with a single node by itself to see if that works.

kerneltime · 2016-11-14T21:17:52Z

About that code in question, I was hoping that the HPE team would chime in but so far they have not. So I will put my 2 cents in, right now my understanding is that the cloud provider is trying to discover the values to be returned for GetZone() and they are very prescriptive about the deployment and how it maps to Regions and Zones which might not be true for every deployment. If some one wants, they can remove the code currently in place and pick up the values from a config file that the deployment engine (the entity that should be aware of Regions and Zones) can populate. At least that is the code change I plan to do when I get to this issue.

kerneltime · 2016-11-14T21:18:25Z

@dav1x yes that is true my setup has only one vSphere node.

BaluDontu · 2016-11-29T22:07:10Z

In the current vSphere CP code, the zone is populated with the region and the failure domain. vSphere has a concept in which a VM can be assigned to the failure domain. If we can populate this value by querying the VC, K8s can use this information to create pods in multiple failure domains.

However, I see that current govmomi/govc has no support to fault domain requests. So, we can't go ahead to query the VC for the fault domain info.
There is a second option which can be done with the way @kerneltime has proposed - Make region and fault domain configurable by the user and these information can be used at install time. Providing a single fault domain Id by the user to be used by the vSphere CP provides no benefits at all w.r.t to how K8s creates pods on these nodes as will it consistent across all the nodes.

kerneltime · 2016-11-30T22:27:01Z

@BaluDontu The deployment logic can decide what the availability zone should be for a node, as long as it is set correctly, CP will report it and k8s should take advantage of it.

Automatic merge from submit-queue (batch tested with PRs 34002, 38535, 37330, 38522, 38423) Fix panic in vSphere cloud provider Currently vSphere Cloud Provider triggers panic in controller-manager pod kubernetes. This is because it queries for the cluster name from the VC. We have eliminated that code from the vSphere cloud provider. Fixes #36295

…-kubernetes-release-1.5 Automatic merge from submit-queue Automated cherry pick of #38423 Cherry pick of #38423 on release-1.5. #38423: Fix panic in vSphere cloud provider. Fixes #36295

nagavenkatab · 2017-02-10T09:27:09Z

getting the same issue for kube-controller-manager and kubelet for the version

kubectl version

Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:57:25Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:52:34Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}

k8s-github-robot added area/nodecontroller team/ux labels Nov 6, 2016

pwittrock added team/cluster and removed team/ux labels Nov 17, 2016

This was referenced Nov 22, 2016

must fix bugs vmware-archive/kubernetes-archived#29

Closed

vSphere CloudProvider doesn't initiate #37402

Closed

pdhamdhere mentioned this issue Dec 1, 2016

vSphere Cloud Provider triggers panic in controller-manager pod kubernetes#36295 vmware-archive/kubernetes-archived#33

Closed

BaluDontu mentioned this issue Dec 8, 2016

Fix panic in vSphere cloud provider #38423

Merged

kerneltime pushed a commit to vmware-archive/kubernetes-archived that referenced this issue Dec 9, 2016

Fix panic in vSphere cloud provider. Fixes kubernetes#36295

5e376fe

kerneltime pushed a commit to vmware-archive/kubernetes-archived that referenced this issue Dec 9, 2016

Fix panic in vSphere cloud provider. Fixes kubernetes#36295

2c60d7b

kerneltime mentioned this issue Dec 9, 2016

Automated cherry pick of #38423 #38508

Merged

k8s-github-robot closed this as completed in #38423 Dec 10, 2016

kerneltime pushed a commit to vmware-archive/kubernetes-archived that referenced this issue Jan 11, 2017

Fix panic in vSphere cloud provider. Fixes kubernetes#36295

e14a92d

kerneltime mentioned this issue Jan 11, 2017

Automated cherry pick of #38423 #39752

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vSphere Cloud Provider triggers panic in controller-manager pod #36295

vSphere Cloud Provider triggers panic in controller-manager pod #36295

KingJ commented Nov 6, 2016 •

edited

pdhamdhere commented Nov 6, 2016

kerneltime commented Nov 7, 2016

erinboyd commented Nov 7, 2016

kerneltime commented Nov 7, 2016

kerneltime commented Nov 8, 2016 •

edited

kerneltime commented Nov 8, 2016

erinboyd commented Nov 10, 2016

kerneltime commented Nov 11, 2016

KingJ commented Nov 12, 2016

erinboyd commented Nov 14, 2016

dav1x commented Nov 14, 2016

kerneltime commented Nov 14, 2016

kerneltime commented Nov 14, 2016

BaluDontu commented Nov 29, 2016 •

edited

kerneltime commented Nov 30, 2016

nagavenkatab commented Feb 10, 2017

vSphere Cloud Provider triggers panic in controller-manager pod #36295

vSphere Cloud Provider triggers panic in controller-manager pod #36295

Comments

KingJ commented Nov 6, 2016 • edited

pdhamdhere commented Nov 6, 2016

kerneltime commented Nov 7, 2016

erinboyd commented Nov 7, 2016

kerneltime commented Nov 7, 2016

kerneltime commented Nov 8, 2016 • edited

kerneltime commented Nov 8, 2016

erinboyd commented Nov 10, 2016

kerneltime commented Nov 11, 2016

KingJ commented Nov 12, 2016

erinboyd commented Nov 14, 2016

dav1x commented Nov 14, 2016

kerneltime commented Nov 14, 2016

kerneltime commented Nov 14, 2016

BaluDontu commented Nov 29, 2016 • edited

kerneltime commented Nov 30, 2016

nagavenkatab commented Feb 10, 2017

kubectl version

KingJ commented Nov 6, 2016 •

edited

kerneltime commented Nov 8, 2016 •

edited

BaluDontu commented Nov 29, 2016 •

edited