Multi AZ on existing VPC not working with Calico #4466

jeffutter · 2018-02-20T02:59:10Z

What kops version are you running?

1.8.1

What Kubernetes version are you running?

Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-09T21:51:54Z", GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.6", GitCommit:"6260bb08c46c31eea6cb538b34a9ceb3e406689c", GitTreeState:"clean", BuildDate:"2017-12-21T06:23:29Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

What cloud provider are you using?

AWS

What commands did you run? What is the simplest way to reproduce this issue?

Sorry, this isn't a very simple setup. I'm installing kops into existing VPCs using terraform output.

I generate the initial config with:

kops create cluster --master-zones "us-east-1a" --zones="us-east-1a,us-east-1b,us-east-1c" --topology=private --dns-zone="MY_ZONE_ID" --networking=calico --vpc="vpc-00000000" --state="s3://my-state-bucket" --node-size=t2.medium --master-size=t2.small --node-count=4 --master-count=1 --target=terraform --out=. mycluster.com

Then I kops edit cluste and set

networking:
    calico:
      crossSubnet: true

and alter the subnet section to reference my existing VPCs/Subnets.

Then I kops update cluster --out . --target=terraform mycluster.com and finally terraform plan and terraform apply.

What happened after the commands executed?

The cluster was created but containers are unable to network with each other across availability zones. This mostly manifested itself as DNS errors as containers couldn't reach kube-dns running on other nodes. Nodes in the same AZ could talk to each other.

I redid the above process with weave networking and everything worked ok.

What did you expect to happen?

I expected containers across availability zones to be able to talk to each other.

Please provide your cluster manifest.

I have replaced the cluster with one running weave to get un-blocked. If this would be really helpful I can kill my cluster and recreate a broken one.

The text was updated successfully, but these errors were encountered:

gambol99 · 2018-02-26T10:29:34Z

I wonder if this has anything to do with source/destination checking in aws ... Reading the docs it mentions a custom deployment to switch this off .. Can you check this has been switched off? .. or the output from the pod

mikesplain · 2018-02-26T20:36:11Z

I'm actually running into the same issue after upgrading to 1.8.1. I'm rolling back now to see if that fixes it. My src/dest check was on and the k8s-ec2-srcdst was blowing up:

I0223 20:09:12.447857       1 main.go:42] k8s-ec2-srcdst: v0.2.1
W0223 20:11:02.309232       1 reflector.go:334] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: watch of *v1.Node ended with: very short watch: github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Unexpected watch close - watch lasted less than a second and no items received
E0223 20:11:03.344003       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: getsockopt: connection refused
E0223 20:11:04.345590       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: getsockopt: connection refused
E0223 20:11:35.346182       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
E0223 20:12:06.352246       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
E0223 20:12:37.353185       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
E0223 20:13:08.354588       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
I0223 20:13:10.049928       1 srcdst_controller.go:96] Marking node ip-10-24-97-86.ec2.internal with SrcDstCheckDisabledAnnotation
E0223 20:13:10.064341       1 runtime.go:66] Observed a panic: &runtime.TypeAssertionError{interfaceString:"interface {}", concreteString:"cache.DeletedFinalStateUnknown", assertedString:"*v1.Node", missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Node)
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/home/travis/.gimme/versions/go1.9.linux.amd64/src/runtime/asm_amd64.s:509
/home/travis/.gimme/versions/go1.9.linux.amd64/src/runtime/panic.go:491
/home/travis/.gimme/versions/go1.9.linux.amd64/src/runtime/iface.go:172
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/pkg/controller/srcdst_controller.go:64
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/pkg/controller/srcdst_controller.go:51
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/client-go/tools/cache/controller.go:209
<autogenerated>:1
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/client-go/tools/cache/controller.go:320
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/client-go/tools/cache/delta_fifo.go:451
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/client-go/tools/cache/controller.go:150
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/client-go/tools/cache/controller.go:124
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/client-go/tools/cache/controller.go:124
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48
/home/travis/.gimme/versions/go1.9.linux.amd64/src/runtime/proc.go:185
/home/travis/.gimme/versions/go1.9.linux.amd64/src/runtime/asm_amd64.s:2337
E0226 20:24:36.103527       1 reflector.go:315] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to watch *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=81101379&timeoutSeconds=582&watch=true: dial tcp 100.64.0.1:443: getsockopt: connection refused
E0226 20:24:37.104560       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: getsockopt: connection refused
E0226 20:24:38.105463       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: getsockopt: connection refused
E0226 20:25:09.106146       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
E0226 20:25:40.107225       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
E0226 20:26:11.114946       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
E0226 20:26:42.115518       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout```

I also run kube2iam so it may be related... looking into it

gambol99 · 2018-02-26T20:54:39Z

cc @ottoyiu

mikesplain · 2018-02-26T21:22:01Z

I was able to replicate this. Looks like once k8s-ec2-srcdst has a panic like above it, locks up and does not restart nor continue to fulfill it's duties. Once I restarted it continued marking nodes as it should.

The only other useful info I can provide is that I was rolling-updating masters when I noticed this from 1.8.0 to 1.8.1. Based on timing, I think the master that k8s-ec2-srcdst was hitting was rolled over, causing the exception but it did not attempt to reconnect or whatnot.

ottoyiu · 2018-02-26T21:43:11Z

Seems like k8s-ec2-srcdst can't access 100.64.0.1, the kubernetes apiserver. Is it being scheduled on one of the masters? If so, is kube-proxy working as expected?

That said, this controller is due for a rewrite to something more modern since it's written in a very old informer controller style...

mikesplain · 2018-02-27T21:34:42Z

@ottoyiu yep, it's on the masters. Nothing odd being logged in kube-proxy, everything else seems fine. This doesn't seem to recover unless it is recreated, for me at least. Any way we can have the container fail when an error like that occurs so kubernetes will restart it?

fejta-bot · 2018-05-28T22:16:03Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-06-27T23:01:44Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-07-27T23:49:56Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

ottoyiu mentioned this issue Mar 7, 2018

Panic observed when a node gets deleted ottoyiu/k8s-ec2-srcdst#9

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 28, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 27, 2018

k8s-ci-robot closed this as completed Jul 27, 2018

mikesplain mentioned this issue Sep 5, 2018

Update k8s-ec2-srcdst to v0.2.2 #5746

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi AZ on existing VPC not working with Calico #4466

Multi AZ on existing VPC not working with Calico #4466

jeffutter commented Feb 20, 2018 •

edited

Loading

gambol99 commented Feb 26, 2018

mikesplain commented Feb 26, 2018 •

edited

Loading

gambol99 commented Feb 26, 2018

mikesplain commented Feb 26, 2018

ottoyiu commented Feb 26, 2018

mikesplain commented Feb 27, 2018

fejta-bot commented May 28, 2018

fejta-bot commented Jun 27, 2018

fejta-bot commented Jul 27, 2018

Multi AZ on existing VPC not working with Calico #4466

Multi AZ on existing VPC not working with Calico #4466

Comments

jeffutter commented Feb 20, 2018 • edited Loading

gambol99 commented Feb 26, 2018

mikesplain commented Feb 26, 2018 • edited Loading

gambol99 commented Feb 26, 2018

mikesplain commented Feb 26, 2018

ottoyiu commented Feb 26, 2018

mikesplain commented Feb 27, 2018

fejta-bot commented May 28, 2018

fejta-bot commented Jun 27, 2018

fejta-bot commented Jul 27, 2018

jeffutter commented Feb 20, 2018 •

edited

Loading

mikesplain commented Feb 26, 2018 •

edited

Loading