Block changing nonMasqueradeCIDR #1738

justinsb · 2017-02-01T02:45:39Z

It does not end well, because service IPs are out-of-range and cannot be changed.

We should either come up with a way to rejig the service IPs, or just prohibit this validation entirely.

blakebarnett · 2017-02-20T19:48:42Z

For some additional context here, I discovered that calico is the problem when changing this CIDR in our clusters. It should be possible to change it and do a rolling-update to have all pods come up cleanly on the new network, but for some reason, even if you re-run Calico's config-calico one-time job again, it ADDS the new CIDR to the configuration rather than replacing the old one, and all the entries in etcd for calico's pod assignments stay the same.

caseydavenport · 2017-05-29T22:02:31Z

even if you re-run Calico's config-calico one-time job again, it ADDS the new CIDR to the configuration rather than replacing the old one

Yeah, that's expected. You can still delete the old one, but it's an extra calicoctl command that needs to be run.

and all the entries in etcd for calico's pod assignments stay the same.

A rolling-update of all Pods in the cluster will fix this so long as it's done after adding the new IP Pool and deleting the old one.

It seems reasonable to block changing this on a live cluster. It's going to require re-configuring a number of components and restarting lots of pods, so it's a pretty disruptive operation.

blakebarnett · 2017-05-30T17:10:42Z

Yeah, I got it to work, doing as you said but it was definitely not a simple/clean process :)

fejta-bot · 2017-12-25T12:50:38Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-24T13:38:26Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

chrislovecnm · 2018-01-24T14:44:08Z

/lifecycle frozen

gootik · 2023-03-02T19:25:04Z

@blakebarnett I know this is very old and so a long shot, but do you perhaps have the steps you went through to make the CIDR change? I'm in the same spot and was wondering how I can make it possible.

blakebarnett · 2023-03-02T19:27:41Z

My memory of it is pretty fuzzy, but I'm pretty sure after doing the change and then removing the old CIDR from the calico configuration, we just did a forced update of all the nodes and things came back online.

gootik · 2023-03-02T20:17:32Z

@blakebarnett Thank you! will give it a go and hope for the best :D. Thanks again

sin-ack · 2023-08-06T19:55:41Z

As a data point, I had to do this because the default nonMasqueradeCIDR overlaps with Tailscale's IP range (almost exactly, actually) causing pods to not be able to communicate while Tailscale was running. The procedure was painful so I'm going to note it down here for any future travellers who must change their cluster CIDR despite the warnings. This assumes you're running Calico, I haven't tested with other networking plugins (I'm just happy it's running again). You will have downtime.

Change nonMasqueradeCIDR to whatever range you need. I set it to 10.244.0.0/16 since I remembered it being a "safe" range from Flannel.
kops update cluster --yes
Install calicoctl
calicoctl get ippool default-ipv4-ippool -oyaml > new-ip-pool.yaml
Edit new-ip-pool.yaml to point to the second half of the IP range. I don't know why Calico does this, but it only takes half of the IP range. In my case I set it to 10.244.128.0/17
calicoctl delete ippool default-ipv4-ippool
calicoctl apply -f new-ip-pool.yaml

At this point you're going to have a cluster that's going to start acting wonky. Press forward.

⚠️ Nuke your entire cluster: kops rolling-update cluster --yes --cloudonly --force (make sure you have backups! I'd also recommend shutting down any ingresses first so the system stops receiving requests)
The cluster will eventually come back up, but you'll notice that the output of things like kubectl -n kube-system get po doesn't reflect reality. This is because kube-controller-manager isn't able to come up because of an error like this:
failed to mark cidr[100.64.4.0/24] at idx [0] as occupied for node: i-abcdef0123456: cidr 100.64.4.0/24 is out the range of cluster cidr 10.244.0.0/16
You need to manually remove all the nodes except the new master node (don't worry, your nodes can't join the cluster yet anyway). This will let kube-controller-manager start and try to sync the world back into sanity.
Now calico-node will enter into a crash loop because the install-cni container is trying to connect to the old Kubernetes endpoint:
2023-08-06 19:17:52.822 [ERROR][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Post "https://100.64.0.1:443/api/v1/namespaces/kube-system/serviceaccounts/calico-node/token": x509: certificate is valid for 10.244.0.1, 127.0.0.1, not 100.64.0.1
This is because when you changed the Kubernetes cluster CIDR, the service cluster IPs did not change. Those are very sticky for some reason and I don't know of any good way to "reset" the cluster IP of a service. You will need to manually edit the following services manually to proceed:
- default/kubernetes
- kube-system/kube-dns
You're gonna have to do this by copying the manifest, deleting it via kubectl, and re-applying it with clusterIP and clusterIPs updated to point to the new CIDR.

After all of that, your nodes will join your cluster and everything should start working again. One thing that's interesting to note is that despite services pointing at a stale cluster CIDR, they will still work because kube-proxy is fine routing any IP it seems. Just to be sure, I'd recommend going through kubectl get -A svc and fixing up all the services to get a new IP within the cluster CIDR. Good luck!

justinsb added the P0 label Feb 1, 2017

justinsb added this to the 1.5.1 milestone Feb 1, 2017

chrislovecnm mentioned this issue Feb 1, 2017

Changing nonMasqueradeCIDR: in cluster spec doesn't flag for rolling-update #1737

Closed

kenden mentioned this issue May 29, 2017

AWS pod to pod communication with 100.96.x.x/24 #2564

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 25, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 24, 2018

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 24, 2018

rifelpet removed this from the 1.5.2 milestone Apr 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block changing nonMasqueradeCIDR #1738

Block changing nonMasqueradeCIDR #1738

justinsb commented Feb 1, 2017

blakebarnett commented Feb 20, 2017

caseydavenport commented May 29, 2017

blakebarnett commented May 30, 2017

fejta-bot commented Dec 25, 2017

fejta-bot commented Jan 24, 2018

chrislovecnm commented Jan 24, 2018

gootik commented Mar 2, 2023

blakebarnett commented Mar 2, 2023

gootik commented Mar 2, 2023

sin-ack commented Aug 6, 2023

Block changing nonMasqueradeCIDR #1738

Block changing nonMasqueradeCIDR #1738

Comments

justinsb commented Feb 1, 2017

blakebarnett commented Feb 20, 2017

caseydavenport commented May 29, 2017

blakebarnett commented May 30, 2017

fejta-bot commented Dec 25, 2017

fejta-bot commented Jan 24, 2018

chrislovecnm commented Jan 24, 2018

gootik commented Mar 2, 2023

blakebarnett commented Mar 2, 2023

gootik commented Mar 2, 2023

sin-ack commented Aug 6, 2023