Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block changing nonMasqueradeCIDR #1738

Open
justinsb opened this issue Feb 1, 2017 · 10 comments
Open

Block changing nonMasqueradeCIDR #1738

justinsb opened this issue Feb 1, 2017 · 10 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. P0

Comments

@justinsb
Copy link
Member

justinsb commented Feb 1, 2017

It does not end well, because service IPs are out-of-range and cannot be changed.

We should either come up with a way to rejig the service IPs, or just prohibit this validation entirely.

@blakebarnett
Copy link

For some additional context here, I discovered that calico is the problem when changing this CIDR in our clusters. It should be possible to change it and do a rolling-update to have all pods come up cleanly on the new network, but for some reason, even if you re-run Calico's config-calico one-time job again, it ADDS the new CIDR to the configuration rather than replacing the old one, and all the entries in etcd for calico's pod assignments stay the same.

@caseydavenport
Copy link
Member

even if you re-run Calico's config-calico one-time job again, it ADDS the new CIDR to the configuration rather than replacing the old one

Yeah, that's expected. You can still delete the old one, but it's an extra calicoctl command that needs to be run.

and all the entries in etcd for calico's pod assignments stay the same.

A rolling-update of all Pods in the cluster will fix this so long as it's done after adding the new IP Pool and deleting the old one.

It seems reasonable to block changing this on a live cluster. It's going to require re-configuring a number of components and restarting lots of pods, so it's a pretty disruptive operation.

@blakebarnett
Copy link

Yeah, I got it to work, doing as you said but it was definitely not a simple/clean process :)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 25, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 24, 2018
@chrislovecnm
Copy link
Contributor

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 24, 2018
@rifelpet rifelpet removed this from the 1.5.2 milestone Apr 11, 2020
@gootik
Copy link

gootik commented Mar 2, 2023

@blakebarnett I know this is very old and so a long shot, but do you perhaps have the steps you went through to make the CIDR change? I'm in the same spot and was wondering how I can make it possible.

@blakebarnett
Copy link

My memory of it is pretty fuzzy, but I'm pretty sure after doing the change and then removing the old CIDR from the calico configuration, we just did a forced update of all the nodes and things came back online.

@gootik
Copy link

gootik commented Mar 2, 2023

@blakebarnett Thank you! will give it a go and hope for the best :D. Thanks again

@sin-ack
Copy link

sin-ack commented Aug 6, 2023

As a data point, I had to do this because the default nonMasqueradeCIDR overlaps with Tailscale's IP range (almost exactly, actually) causing pods to not be able to communicate while Tailscale was running. The procedure was painful so I'm going to note it down here for any future travellers who must change their cluster CIDR despite the warnings. This assumes you're running Calico, I haven't tested with other networking plugins (I'm just happy it's running again). You will have downtime.

  • Change nonMasqueradeCIDR to whatever range you need. I set it to 10.244.0.0/16 since I remembered it being a "safe" range from Flannel.
  • kops update cluster --yes
  • Install calicoctl
  • calicoctl get ippool default-ipv4-ippool -oyaml > new-ip-pool.yaml
  • Edit new-ip-pool.yaml to point to the second half of the IP range. I don't know why Calico does this, but it only takes half of the IP range. In my case I set it to 10.244.128.0/17
  • calicoctl delete ippool default-ipv4-ippool
  • calicoctl apply -f new-ip-pool.yaml

At this point you're going to have a cluster that's going to start acting wonky. Press forward.

  • ⚠️ Nuke your entire cluster: kops rolling-update cluster --yes --cloudonly --force (make sure you have backups! I'd also recommend shutting down any ingresses first so the system stops receiving requests)

  • The cluster will eventually come back up, but you'll notice that the output of things like kubectl -n kube-system get po doesn't reflect reality. This is because kube-controller-manager isn't able to come up because of an error like this:
    failed to mark cidr[100.64.4.0/24] at idx [0] as occupied for node: i-abcdef0123456: cidr 100.64.4.0/24 is out the range of cluster cidr 10.244.0.0/16
    You need to manually remove all the nodes except the new master node (don't worry, your nodes can't join the cluster yet anyway). This will let kube-controller-manager start and try to sync the world back into sanity.

  • Now calico-node will enter into a crash loop because the install-cni container is trying to connect to the old Kubernetes endpoint:
    2023-08-06 19:17:52.822 [ERROR][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Post "https://100.64.0.1:443/api/v1/namespaces/kube-system/serviceaccounts/calico-node/token": x509: certificate is valid for 10.244.0.1, 127.0.0.1, not 100.64.0.1
    This is because when you changed the Kubernetes cluster CIDR, the service cluster IPs did not change. Those are very sticky for some reason and I don't know of any good way to "reset" the cluster IP of a service. You will need to manually edit the following services manually to proceed:

    • default/kubernetes
    • kube-system/kube-dns

    You're gonna have to do this by copying the manifest, deleting it via kubectl, and re-applying it with clusterIP and clusterIPs updated to point to the new CIDR.

After all of that, your nodes will join your cluster and everything should start working again. One thing that's interesting to note is that despite services pointing at a stale cluster CIDR, they will still work because kube-proxy is fine routing any IP it seems. Just to be sure, I'd recommend going through kubectl get -A svc and fixing up all the services to get a new IP within the cluster CIDR. Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. P0
Projects
None yet
Development

No branches or pull requests

9 participants