Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi AZ on existing VPC not working with Calico #4466

Closed
jeffutter opened this issue Feb 20, 2018 · 9 comments
Closed

Multi AZ on existing VPC not working with Calico #4466

jeffutter opened this issue Feb 20, 2018 · 9 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@jeffutter
Copy link

jeffutter commented Feb 20, 2018

  1. What kops version are you running?

1.8.1

  1. What Kubernetes version are you running?
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-09T21:51:54Z", GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.6", GitCommit:"6260bb08c46c31eea6cb538b34a9ceb3e406689c", GitTreeState:"clean", BuildDate:"2017-12-21T06:23:29Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
  1. What cloud provider are you using?

AWS

  1. What commands did you run? What is the simplest way to reproduce this issue?

Sorry, this isn't a very simple setup. I'm installing kops into existing VPCs using terraform output.

I generate the initial config with:

kops create cluster --master-zones "us-east-1a" --zones="us-east-1a,us-east-1b,us-east-1c" --topology=private --dns-zone="MY_ZONE_ID" --networking=calico --vpc="vpc-00000000" --state="s3://my-state-bucket" --node-size=t2.medium --master-size=t2.small --node-count=4 --master-count=1 --target=terraform --out=. mycluster.com

Then I kops edit cluste and set

networking:
    calico:
      crossSubnet: true

and alter the subnet section to reference my existing VPCs/Subnets.

Then I kops update cluster --out . --target=terraform mycluster.com and finally terraform plan and terraform apply.

  1. What happened after the commands executed?

The cluster was created but containers are unable to network with each other across availability zones. This mostly manifested itself as DNS errors as containers couldn't reach kube-dns running on other nodes. Nodes in the same AZ could talk to each other.

I redid the above process with weave networking and everything worked ok.

  1. What did you expect to happen?

I expected containers across availability zones to be able to talk to each other.

  1. Please provide your cluster manifest.

I have replaced the cluster with one running weave to get un-blocked. If this would be really helpful I can kill my cluster and recreate a broken one.

@gambol99
Copy link
Contributor

I wonder if this has anything to do with source/destination checking in aws ... Reading the docs it mentions a custom deployment to switch this off .. Can you check this has been switched off? .. or the output from the pod

@mikesplain
Copy link
Contributor

mikesplain commented Feb 26, 2018

I'm actually running into the same issue after upgrading to 1.8.1. I'm rolling back now to see if that fixes it. My src/dest check was on and the k8s-ec2-srcdst was blowing up:

I0223 20:09:12.447857       1 main.go:42] k8s-ec2-srcdst: v0.2.1
W0223 20:11:02.309232       1 reflector.go:334] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: watch of *v1.Node ended with: very short watch: github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Unexpected watch close - watch lasted less than a second and no items received
E0223 20:11:03.344003       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: getsockopt: connection refused
E0223 20:11:04.345590       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: getsockopt: connection refused
E0223 20:11:35.346182       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
E0223 20:12:06.352246       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
E0223 20:12:37.353185       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
E0223 20:13:08.354588       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
I0223 20:13:10.049928       1 srcdst_controller.go:96] Marking node ip-10-24-97-86.ec2.internal with SrcDstCheckDisabledAnnotation
E0223 20:13:10.064341       1 runtime.go:66] Observed a panic: &runtime.TypeAssertionError{interfaceString:"interface {}", concreteString:"cache.DeletedFinalStateUnknown", assertedString:"*v1.Node", missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Node)
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/home/travis/.gimme/versions/go1.9.linux.amd64/src/runtime/asm_amd64.s:509
/home/travis/.gimme/versions/go1.9.linux.amd64/src/runtime/panic.go:491
/home/travis/.gimme/versions/go1.9.linux.amd64/src/runtime/iface.go:172
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/pkg/controller/srcdst_controller.go:64
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/pkg/controller/srcdst_controller.go:51
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/client-go/tools/cache/controller.go:209
<autogenerated>:1
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/client-go/tools/cache/controller.go:320
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/client-go/tools/cache/delta_fifo.go:451
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/client-go/tools/cache/controller.go:150
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/client-go/tools/cache/controller.go:124
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/vendor/k8s.io/client-go/tools/cache/controller.go:124
/home/travis/gopath/src/github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48
/home/travis/.gimme/versions/go1.9.linux.amd64/src/runtime/proc.go:185
/home/travis/.gimme/versions/go1.9.linux.amd64/src/runtime/asm_amd64.s:2337
E0226 20:24:36.103527       1 reflector.go:315] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to watch *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=81101379&timeoutSeconds=582&watch=true: dial tcp 100.64.0.1:443: getsockopt: connection refused
E0226 20:24:37.104560       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: getsockopt: connection refused
E0226 20:24:38.105463       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: getsockopt: connection refused
E0226 20:25:09.106146       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
E0226 20:25:40.107225       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
E0226 20:26:11.114946       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
E0226 20:26:42.115518       1 reflector.go:205] github.com/ottoyiu/k8s-ec2-srcdst/cmd/k8s-ec2-srcdst/main.go:48: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout```

I also run kube2iam so it may be related... looking into it

@gambol99
Copy link
Contributor

cc @ottoyiu

@mikesplain
Copy link
Contributor

I was able to replicate this. Looks like once k8s-ec2-srcdst has a panic like above it, locks up and does not restart nor continue to fulfill it's duties. Once I restarted it continued marking nodes as it should.

The only other useful info I can provide is that I was rolling-updating masters when I noticed this from 1.8.0 to 1.8.1. Based on timing, I think the master that k8s-ec2-srcdst was hitting was rolled over, causing the exception but it did not attempt to reconnect or whatnot.

@ottoyiu
Copy link
Contributor

ottoyiu commented Feb 26, 2018

Seems like k8s-ec2-srcdst can't access 100.64.0.1, the kubernetes apiserver. Is it being scheduled on one of the masters? If so, is kube-proxy working as expected?

That said, this controller is due for a rewrite to something more modern since it's written in a very old informer controller style...

@mikesplain
Copy link
Contributor

@ottoyiu yep, it's on the masters. Nothing odd being logged in kube-proxy, everything else seems fine. This doesn't seem to recover unless it is recreated, for me at least. Any way we can have the container fail when an error like that occurs so kubernetes will restart it?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 28, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 27, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants