Node upgrades: basic mechanism #6082

mbforbes · 2015-03-27T19:19:17Z

See issue #6079 as the roll-up for node upgrades, including the specification of the minimum requirement that this issue will fulfill, as well as improvements that will come after.

This is the worklist item for a minimum viable node upgrade mechanism for 1.0.

Plan:

in-place: yes; restart the node
provider: GCE, then GKE

Outline:

Provisioning a new node is blocked on allowing new nodes to dynamically join a running cluster (#6087). In the mean time, the mechanism will do an in-place upgrade, blowing away as much as is necessary.

For each node:
1. Stop running binaries
2. Get the node back to a "blank" state as much as possible. This involves trying to un-configure what salt did. Ideally, this looks like running salt in reverse.
3. Grab the latest config from the metadata server (including the new startup script)
4. Upgrade the kernel/OS
5. Reboot the machine, which also runs the new startup script, which also runs salt

alex-mohr · 2015-03-27T22:46:39Z

I thought salt was supposed to drive things to a specified end-state. Ideally, each salt stanza is idempotent, such that running it from e.g. base containerVM results in the same state as re-running it a second time on its previous output?

If so, would it also be reasonable to audit those salt blocks and make sure they can also take outputs from previous versions of the salt config and drive the node to the same state as running against bare container VM? Would that be more robust than an "unsalt" config?

That would be something like:

For each node:
1. Stop running binaries
2. Grab the latest salt config from the metadata server
3. Run salt

I wonder if we really wanted to bet on salt, would we make each stanza also responsible for stopping any relevant server before it twiddles the config if there's a change? Might be too prone to race conditions, though, so pausing the various servers might also be easier.

alex-mohr · 2015-03-27T22:55:04Z

I realize I may have misunderstand: does "stop running binaries" mean e.g. kubelet, or do you mean all containers as well?

Also, if the kernel version is changing, will need a reboot in there somewhere...?

multilinear · 2015-03-30T14:06:54Z

This sounds much much harder than just destroying and re-creating the machine from scratch. Idempotency in machine setup is a minefield none have yet crossed. There are so many gotchas, and it won't work at all if you happen to hit an inconsistent intermediate state, if a machine hangs, etc. There are a lot of ways it can break.

Given that kubernetes is designed primarily for virtualized environments... What's the advantage of trying to use already set up machines instead of just chucking them and getting new ones?

multilinear · 2015-03-30T14:07:47Z

Basically, I'm arguing that fast machine setup is a far easier problem than idempotent machine setup.

zmerlynn · 2015-03-30T14:29:49Z

We wanted to start with the "throw each node away" plan, but as pointed out
in the initial comment, it's blocked by dynamic clustering right now. We
can't add nodes dynamically for a few reasons (on GCE, for example, they
need a cbr0 allocation). Those are going away soon.

At the limit, though, we believe that people aren't going to want to deal
with downtime, so this (upgrade existing) may be the plan we need anyways.
On Mar 30, 2015 7:08 AM, "Matthew Brewer" notifications@github.com wrote:

Basically, I'm arguing that fast machine setup is a far easier problem
than idempotent machine setup.

—
Reply to this email directly or view it on GitHub
#6082 (comment)
.

multilinear · 2015-03-30T14:36:44Z

Aaahh, thanks for the explanation. That makes some sense.

It would certainly be nice if you can get it to work... it's just that it working 98% would be very very bad.

zmerlynn · 2015-03-30T14:39:06Z

@alex-mohr: Salt is only as state aware as the stanza files dictate. The biggest problem is that this is very brittle. If you do something like add a file in version X and forget to add a file.delete for it in version Y, but it's no longer applicable and yet still has a systemic effect, you just failed. So now you have to figure out how to test that, etc.

This is why I was talking about crazy things like ephemeral boot disks. But let me finish #6070 so we stop talking about generalities.

mbforbes · 2015-04-01T18:14:45Z

After a couple days of implementation, we discussed this path yesterday, and it turns out this actually isn't what we want to build.

The following matrix has rows of "what we're going to upgrade" and columns of "how:"

	in-place	new node
kubelet, kube-proxy	#6099	---
kubelet, kube-proxy, docker, kernel/OS	#6082	#6088

(This issue is the one in the bottom-left.)

The problem is that because we will have set version tuples (#4855), doing a kernel/OS upgrade with apt-get dist-upgrade or a Docker update will likely bring us into an unsupported version tuple without explicit mechanisms to limit what can be upgraded to. Of course, #6099 has exactly the same problem in that newer versions of kubelet and kube-proxy must be supported for the same Docker, kernel/OS version, but the scope is smaller and we control all of the pieces.

There are also the problems that @multilinear pointed out (and heavily discussed in #3333) about leftover node state. #6099 won't entirely avoid this either, but given we'll only be upgrading kukbe-* binaries, the considerations will be smaller in scope.

I'm closing this issue now as it's a non-priority, and moving work over to #6099 (as #6088 is still blocked by #6087).

mbforbes added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/upgrade labels Mar 27, 2015

mbforbes self-assigned this Mar 27, 2015

mbforbes added this to the v1.0 milestone Mar 27, 2015

mbforbes mentioned this issue Mar 27, 2015

Node upgrades #6079

Closed

10 tasks

mbforbes added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Mar 27, 2015

mbforbes mentioned this issue Mar 27, 2015

upgrade: Rolling system updates via delete/re-create nodes #5044

Closed

mbforbes closed this as completed Apr 1, 2015

mbforbes removed this from the v1.0 milestone Apr 1, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node upgrades: basic mechanism #6082

Node upgrades: basic mechanism #6082

mbforbes commented Mar 27, 2015

alex-mohr commented Mar 27, 2015

alex-mohr commented Mar 27, 2015

multilinear commented Mar 30, 2015

multilinear commented Mar 30, 2015

zmerlynn commented Mar 30, 2015

multilinear commented Mar 30, 2015

zmerlynn commented Mar 30, 2015

mbforbes commented Apr 1, 2015

Node upgrades: basic mechanism #6082

Node upgrades: basic mechanism #6082

Comments

mbforbes commented Mar 27, 2015

Plan:

Outline:

alex-mohr commented Mar 27, 2015

alex-mohr commented Mar 27, 2015

multilinear commented Mar 30, 2015

multilinear commented Mar 30, 2015

zmerlynn commented Mar 30, 2015

multilinear commented Mar 30, 2015

zmerlynn commented Mar 30, 2015

mbforbes commented Apr 1, 2015