Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap channel DaemonSets have updateStrategy: RollingUpdate #7971

Closed
stevenjm opened this issue Nov 20, 2019 · 11 comments · Fixed by #10167
Closed

Bootstrap channel DaemonSets have updateStrategy: RollingUpdate #7971

stevenjm opened this issue Nov 20, 2019 · 11 comments · Fixed by #10167

Comments

@stevenjm
Copy link
Contributor

stevenjm commented Nov 20, 2019

1. What kops version are you running? The command kops version, will display
this information.

Version 1.14.1 (git-b7c25f9a9)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"archive", BuildDate:"2019-09-21T09:43:39Z", GoVersion:"go1.12.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.10", GitCommit:"e3c134023df5dea457638b614ee17ef234dc34a6", GitTreeState:"clean", BuildDate:"2019-07-08T03:40:54Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops update cluster --target cloudformation --out /tmp using kops 1.14.1, on a cluster using weave for networking that was last updated with kops 1.12.3.

5. What happened after the commands executed?

As soon as the new version of weave was installed, the cluster initiated a rolling restart of the weave pods, which caused a brief but visible network interruption on each node.

6. What did you expect to happen?

I would not expect any weave pods to be replaced until I ask kops to perform a rolling update, given that they are essential for the pods on a node to function correctly.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

https://gist.github.com/stevenjm/c7eccd8b3fc20c675df61dfa3d6e9ecc

9. Anything else do we need to know?

In the case of weave, this is an explicit setting that appears to have been inherited from upstream. I suspect this makes sense for weave upstream, as they can't assume the cluster is managed in such a way that nodes can easily be updated in place, but kops allows us to make this assumption and set updatePolicy to OnDelete. This will let a kops rolling-update replace these pods for us.

For this to work correctly, the node's user data must include a checksum of the bootstrap channel so that kops knows that nodes needs to be replaced when weave is updated. See also my suggested approach to fixing #7970.

@rifelpet
Copy link
Member

I think that this is a good idea but we should look more into the implications of doing so. We'd need to make it clear to users that the DaemonSet updates wont take effect on their nodes until they have performed a rolling-update, but I think in general it would be good to have the new pod definitions only get used on new nodes. The only concern might be if a CNI upgrade isnt backwards compatible, this would increase the amount of time that the cluster is "split brained" but I suppose we could deal with that when the time comes.

Another situation where this would have helped is #7926

cc: @mikesplain

@rifelpet
Copy link
Member

Additionally we could add logic in kops rolling-update that not only marks nodes as NeedsUpdate if they arent on the latest LaunchConfiguration or LaunchTemplate but also if the node's DaemonSet pods are not of the latest version of the DaemonSet

@stevenjm
Copy link
Contributor Author

The only concern might be if a CNI upgrade isnt backwards compatible, this would increase the amount of time that the cluster is "split brained" but I suppose we could deal with that when the time comes.

In the specific case of weave, they guarantee backwards compatibility between adjacent major versions, per https://www.weave.works/docs/net/latest/operational-guide/tasks/#cluster-upgrade.

I am less familiar with the other CNI plugins as we don't use them here. It's possible this isn't a suitable approach for all of them.

@rifelpet
Copy link
Member

I would guess that most CNIs make a similar guarantee, but to be safe we could consider making this change on a per DaemonSet basis. I can add it to the agenda for the next office hours and we can solidify a plan there.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 23, 2020
@stevenjm
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 11, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 9, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 9, 2020
@stevenjm
Copy link
Contributor Author

stevenjm commented Aug 5, 2020

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 5, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 3, 2020
@rifelpet
Copy link
Member

rifelpet commented Nov 3, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants