gce: getInstancesByNames lookup by canonicalized names #26897

glerchundi · 2016-06-06T17:11:42Z

Hi guys,

I'm having a lot of troubles getting up my cluster with provider load balancers, concretely with GCE. controller-manager fails trying to retrieve instances by hostnames (using getInstancesByNames method):

I0606 16:49:08.399340       1 gce.go:474] EnsureLoadBalancer(aa271f4332bfa11e69cfe42010a00000, us-central1, <nil>, [TCP/80 TCP/443], [k8s-worker-lx53.c.mr-potato.internal k8s-worker-xog8.c.mr-potato.internal k8s-worker-1pg4.c.mr-potato.internal k8s-worker-buvc.c.mr-potato.internal], default/lb, map[])
E0606 16:49:08.464763       1 gce.go:2368] Failed to retrieve instance: "k8s-worker-lx53.c.mr-potato.internal"
E0606 16:49:08.464894       1 servicecontroller.go:196] Failed to process service delta. Retrying in 5m0s: Failed to create load balancer for service default/lb: instance not found

I tried to compare the gce access implementation used in 1.1.4 (because I get into this when I was trying a cluster upgrade from 1.1.4 to 1.2.3) with the master implementation and there is something that caught my attention.

Why this piece of code doesn't make use of canonicalizeInstanceName-ed names instead of the ones provided by the user (by user I mean the full computer hostname):

https://github.com/kubernetes/kubernetes/blob/v1.2.3/pkg/cloudprovider/providers/gce/gce.go#L2364-L2372

Maybe this is the reason why I'm getting 'instance not found' all the time.

Thanks in advance,

PD.: Before this I tried to search something related to this in stackoverflow and asked in slack with no answers at all...

The text was updated successfully, but these errors were encountered:

glerchundi · 2016-06-07T10:18:55Z

I firmly believe the issue is exactly what I reported because I was finally able to change the instance templates hostnames by using an overriden hostname (through kubelet --hostname-override=$(hostname -s) arg) which transforms the current CoreOS machine FQDN (k8s-worker-lx53.c.mr-potato.internal) into a shortened version one (k8s-worker-lx53) and after this everything worked properly.

Apparently the difference states in how CoreOS handles uname -n which outputs the FQDN. In contrast, Debian (GKE?) shortens the hostname: https://github.com/kubernetes/kubernetes/blob/v1.2.3/pkg/util/node/node.go#L29-L38.

I can send a PR fixing this if you can confirm the behaviour on your side.

shahidhk · 2016-08-27T20:36:52Z

I have also faced this issue, k8s v1.3.4 on coreos gce. The instance name is set as instance-1, but the hostname comes out to be instance-1.c.project-1.internal.

$ hostname
instance-1.c.project-1.internal
$ hostname -s
instance-1
$ uname -n
instance-1.c.project-1.internal

Now when I try to create a loadbalancer or nginx, this happens: (log from kube-controller-manager)

E0827 20:11:36.650762       1 gce.go:2609] Failed to retrieve instance: "instance-1.c.project-1.internal"
E0827 20:11:36.650867       1 servicecontroller.go:201] Failed to process service delta. Retrying in 5m0s: Failed to create load balancer for service default/nginx: instance not found

As @glerchundi mentioned, can somebody please confirm this behaviour?

fejta-bot · 2017-12-17T07:36:37Z

Issues go stale after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-16T08:24:25Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-02-15T08:31:02Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

j3ffml added area/os/coreos sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 28, 2016

omkensey mentioned this issue Aug 29, 2016

Provide a generic mechanism for flag customizations kubernetes-retired/bootkube#106

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 16, 2018

k8s-ci-robot closed this as completed Feb 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gce: getInstancesByNames lookup by canonicalized names #26897

gce: getInstancesByNames lookup by canonicalized names #26897

glerchundi commented Jun 6, 2016

glerchundi commented Jun 7, 2016 •

edited

shahidhk commented Aug 27, 2016

fejta-bot commented Dec 17, 2017

fejta-bot commented Jan 16, 2018

fejta-bot commented Feb 15, 2018

gce: getInstancesByNames lookup by canonicalized names #26897

gce: getInstancesByNames lookup by canonicalized names #26897

Comments

glerchundi commented Jun 6, 2016

glerchundi commented Jun 7, 2016 • edited

shahidhk commented Aug 27, 2016

fejta-bot commented Dec 17, 2017

fejta-bot commented Jan 16, 2018

fejta-bot commented Feb 15, 2018

glerchundi commented Jun 7, 2016 •

edited