Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PodCIDR not set on the master after moving to a different EC2 instance #5437

Closed
tsuna opened this issue Jul 12, 2018 · 11 comments
Closed

PodCIDR not set on the master after moving to a different EC2 instance #5437

tsuna opened this issue Jul 12, 2018 · 11 comments

Comments

@tsuna
Copy link
Contributor

tsuna commented Jul 12, 2018

  1. What kops version are you running?

Version 1.10.0-alpha.1

  1. What Kubernetes version are you running?
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T22:29:25Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:05:37Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
  1. What cloud provider are you using?

AWS

  1. What commands did you run? What is the simplest way to reproduce this issue?
$ kops upgrade cluster $NAME --yes
W0712 15:49:11.154329   67809 s3context.go:210] Unable to read bucket encryption policy: will encrypt using AES256
ITEM	PROPERTY		OLD	NEW
Cluster	KubernetesVersion	1.9.6	1.10.3

Updates applied to configuration.
You can now apply these changes, using `kops update cluster foo`
$ kops update cluster $NAME --yes
W0712 15:50:01.445591   67811 s3context.go:210] Unable to read bucket encryption policy: will encrypt using AES256
I0712 15:50:03.724755   67811 executor.go:103] Tasks: 0 done / 73 total; 31 can run
I0712 15:50:04.132758   67811 executor.go:103] Tasks: 31 done / 73 total; 24 can run
I0712 15:50:04.618563   67811 executor.go:103] Tasks: 55 done / 73 total; 16 can run
I0712 15:50:05.900642   67811 executor.go:103] Tasks: 71 done / 73 total; 2 can run
I0712 15:50:06.270333   67811 executor.go:103] Tasks: 73 done / 73 total; 0 can run
I0712 15:50:06.270454   67811 dns.go:153] Pre-creating DNS records
I0712 15:50:06.585097   67811 update_cluster.go:290] Exporting kubecfg for cluster
kops has set your kubectl context to foo

Cluster changes have been applied to the cloud.


Changes may require instances to restart: kops rolling-update cluster
$ kops rolling-update cluster $NAME --yes
W0712 15:50:25.1.2.3.4   67814 s3context.go:210] Unable to read bucket encryption policy: will encrypt using AES256
NAME			STATUS		NEEDUPDATE	READY	MIN	MAX	NODES
master-us-west-2a	NeedsUpdate	1		0	1	1	1
nodes			NeedsUpdate	2		0	2	2	2
I0712 15:50:27.331775   67814 instancegroups.go:157] Draining the node: "ip-<snip>.us-west-2.compute.internal".
node "ip-<snip>.us-west-2.compute.internal" cordoned
node "ip-<snip>.us-west-2.compute.internal" cordoned
node "ip-<snip>.us-west-2.compute.internal" drained
I0712 15:50:28.300408   67814 instancegroups.go:333] Waiting for 1m30s for pods to stabilize after draining.
I0712 15:51:58.306755   67814 instancegroups.go:273] Stopping instance "i-xxx", node "ip-<snip>.us-west-2.compute.internal", in group "master-us-west-2a.masters.k8s.example.com" (this may take a while).
I0712 15:56:58.743357   67814 instancegroups.go:188] Validating the cluster.
I0712 15:57:28.924945   67814 instancegroups.go:246] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://api.k8s.example.com/api/v1/nodes: dial tcp 1.2.3.4:443: i/o timeout.
I0712 15:58:29.080624   67814 instancegroups.go:246] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://api.k8s.example.com/api/v1/nodes: dial tcp 1.2.3.4:443: i/o timeout.
I0712 15:58:59.462089   67814 instancegroups.go:246] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://api.k8s.example.com/api/v1/nodes: dial tcp 1.2.3.4:443: i/o timeout.
I0712 15:59:29.603733   67814 instancegroups.go:246] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://api.k8s.example.com/api/v1/nodes: dial tcp 1.2.3.4:443: i/o timeout.

In the above the IP 1.2.3.4 is the public IP of the old EC2 instance where the old master was running, which had just been terminated by kops.

  1. What happened after the commands executed?

I had to go to Route53 and update the A record in the zone for k8s.example.com to have it point to the new public IP of the EC2 instance where the new master was. Shortly after updating the A record, I could finally see:

I0712 15:59:30.758285   67814 instancegroups.go:249] Cluster validated.
  1. What did you expect to happen?

Validation should work, kops should've updated the A record in Route53.

  1. Please provide your cluster manifest.
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2018-06-22T23:09:49Z
  name: k8s.example.com
spec:
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://example.com-k8s-state-store/k8s.example.com
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    name: main
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.10.3
  masterPublicName: api.k8s.example.com
  networkCIDR: 172.20.0.0/16
  networking:
    kubenet: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.20.32.0/19
    name: us-west-2a
    type: Public
    zone: us-west-2a
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-06-22T23:09:49Z
  labels:
    kops.k8s.io/cluster: k8s.example.com
  name: master-us-west-2a
spec:
  image: kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11
  machineType: m3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2a
  role: Master
  subnets:
  - us-west-2a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-06-22T23:09:49Z
  labels:
    kops.k8s.io/cluster: k8s.example.com
  name: nodes
spec:
  image: kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11
  machineType: t2.small
  maxSize: 2
  minSize: 2
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
  role: Node
  subnets:
  - us-west-2a
  1. Anything else do we need to know?

The master still didn't come up successfully. I posted logs for kubelet, api-server, and controller-manager here: https://gist.github.com/tsuna/594fef65be39ecd7e0ffe05bf8113998

Of interest is Unable to update cni config: No networks found in /etc/cni/net.d/ (the directory is indeed empty), which I think led to a bunch of Jul 12 22:53:38 ip-172-x-y-z kubelet[1635]: E0712 22:53:38.225953 1635 kubelet.go:2130] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR and other connection errors trying to get to the api-server.

@tsuna
Copy link
Contributor Author

tsuna commented Jul 16, 2018

The fact that the master's API DNS entry wasn't updated in Route53 seems to be a dup of #5289 but there is still the other issue (which may or may not be related) of Kubenet does not have netConfig, which prevents the controller-manager from starting.

@tsuna tsuna changed the title kops didn't update Route53 record pointing to api-server, upgrade failed PodCIDR not set on the master after moving to a different EC2 instance Jul 19, 2018
@tsuna
Copy link
Contributor Author

tsuna commented Jul 19, 2018

So I figured it out. There were two issues:

  1. The Route53 entry to the external IP of the apiserver wasn't updated — work around: go into Route53 and manually update the DNS record to point to the IP of the new EC2 instance.
  2. The PodCIDR wasn't set on the new master node.

Since the first problem is already covered by #5289, I'm making this issue only about problem 2.

kubectl get nodes was showing old masters and doing kubectl get node ip-172-x-y-z.us-west-2.compute.internal --template={{.spec.podCIDR}} was returning <no value> on all except the oldest one (initially provisioned by kops), which was the only one correctly returning 100.96.0.0/24.

To fix this I had to kubectl edit node ip-172-x-y-z.us-west-2.compute.internal and manually set podCIDR: 100.96.0.0/24 in the spec. As soon as I did this kubelet reacted to the change:

Jul 19 05:23:31 ip-<snip> kubelet[1441]: E0719 05:23:31.296647    1441 kubelet.go:2130] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network 
plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
Jul 19 05:23:34 ip-<snip> kubelet[1441]: I0719 05:23:1.2.3.4    1441 kuberuntime_manager.go:917] updating runtime config through cri with podcidr 100.96.0.0/24
Jul 19 05:23:34 ip-<snip> kubelet[1441]: I0719 05:23:1.2.3.4    1441 docker_service.go:340] docker cri received runtime config &RuntimeConfig{NetworkConfig:&NetworkConfig{PodCidr:100.96.0.0/24,},
}
Jul 19 05:23:34 ip-<snip> kubelet[1441]: I0719 05:23:1.2.3.4    1441 kubenet_linux.go:258] CNI network config set to {
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "cniVersion": "0.1.0",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "name": "kubenet",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "type": "bridge",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "bridge": "cbr0",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "mtu": 9001,
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "addIf": "eth0",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "isGateway": true,
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "ipMasq": false,
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "hairpinMode": false,
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "ipam": {
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "type": "host-local",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "subnet": "100.96.0.0/24",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "gateway": "100.96.0.1",
Jul 19 05:23:34 ip-<snip> kubelet[1441]: "routes": [
Jul 19 05:23:34 ip-<snip> kubelet[1441]: { "dst": "0.0.0.0/0" }
Jul 19 05:23:34 ip-<snip> kubelet[1441]: ]
Jul 19 05:23:34 ip-<snip> kubelet[1441]: }
Jul 19 05:23:34 ip-<snip> kubelet[1441]: }
Jul 19 05:23:34 ip-<snip> kubelet[1441]: I0719 05:23:1.2.3.4    1441 kubelet_network.go:196] Setting Pod CIDR:  -> 100.96.0.0/24

H/T kubernetes/kubernetes#32900 for putting me on the right track.

Now the question is why did kops not set this properly on the new master node?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 17, 2018
@tsuna
Copy link
Contributor Author

tsuna commented Oct 17, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 17, 2018
@aelmanaa
Copy link

Had exactly the same issue. did anyone found the rootcause?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 21, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 23, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@antoninbeaufort
Copy link

/reopen
/remove-lifecycle rotten

@k8s-ci-robot
Copy link
Contributor

@antoninbeaufort: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants