Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

Kata Components should support "Live-Upgrade“ #492

Closed
WeiZhang555 opened this issue Jul 17, 2018 · 5 comments
Closed

Kata Components should support "Live-Upgrade“ #492

WeiZhang555 opened this issue Jul 17, 2018 · 5 comments

Comments

@WeiZhang555
Copy link
Member

WeiZhang555 commented Jul 17, 2018

This is from mailing list earlier, I think it's better to sync up to github issue for tracking more easily.

====================
Actually I also mentioned this in Vancouver, in my opinion, a breakage between kata-agent and kata-runtime should always be considered as a backward compatibility breakage.
This breakage is a "gap" between "project" and "product" for kata-containers, I'll elaborate why here.

Starting from our requirement for a mature cloud product in use of kata, we have SLA with our customers, which means we can't shutdown customers' service while we are updating Kata components, this feature is named as "live-upgrade", so running kata-runtime and agent of different versions will very likely happen:

  • 1). New runtime + old agent: updating kata-runtime when VM+old agent is running, kata-runtime shouldn't issue a command which will crash the agent.
  • 2). old runtime + new agent: rollback when new kata version has issues, in this case, some service could be started already, new agent should always handle commands from old runtime.

So what will happen if we miss 1) and 2)? We need to shutdown user's running workload whenever we want to upgrade/downgrade the kata-components, that will make our SLA a joke.
(Of course we can also choose to send them notification and let users shutdown their workload by themselves, but we definitely hope to do better and go further. )

So to guarantee the "live-upgrade" ability of kata-components(meas install kata rpm packages while workloads are still running), what we need to do for these components are:

1) kata-runtime

A. issue "versioned" command to kata-agent, can always communicate correctly with old kata-agent. (MUST)
B. disk persist data should be "versioned", kata-runtime can always handle old "version" of persist data to restore sandbox/container struct from disk to memory. (MUST)

2) kata-agent:

protocol needs to be versioned, can always handle commands from old kata-runtime. ("versioned" may be achieved by leveraging protobuf) (MUST)

3) kata-shim/kata-proxy

daemon process, no need to shut down while updating kata rpm package. So I don't see a problem currently, need to guarantee interact between kata-runtime and shim/proxy. (MUST)

4) qemu:

A. current status: NO WAY to upgrade now. running workload must be shutdown before installing newer version of qemu rpm package. (IMPOSSIBLE)
B. In future: qemu live-migration, live-replacement, live-patch etc. (BETTER HAVE)

5) guest kernel:

A. current status: after install kata rpm package with newer VM image, old workloads can keep running with old kernel, newly started workload will use new VM kernel. It's fine. (ALREADY HAVE)
B. in future: live patch. (BETTER HAVE)

summary

  1. We already break the backward compatibility, and we will break a lot more in near future definitely. Actually in Vancouver, the participants all agree that we can't promise the API won't be broken and current API isn't a stable version.
  2. Before we claim that kata can support "live upgrade" and kata is real production ready, I'm fine with the breakage and also fine with 1.0.1 or 1.1.0, maybe latter one looks better.
  3. After we claim that kata can support "live upgrade" in future, we should reject any modifications which will break the running workloads, unless this is really inevitable, by then, we need to upgrade kata version from x.0.0 to y.0.0.
    But I hope our kata developers can understand what a disaster this could be to a cloud provider like us :-(, and I hope this will never happen.
  4. Better document that we don't support "live upgrade" yet, and tell users that if you want to upgrade to this new kata-containers version, you must stop all you running kata containers, or there will be anticipated issues.
@caoruidong
Copy link
Member

I think config file should be also considered.

@egernst
Copy link
Member

egernst commented Nov 2, 2018

@WeiZhang555 Can we put in a requirement that we need only support roll-back by one version? With this, we’d be able to assume that the runtime will only ever be one version behind the agent that is running? What do you think?

@WeiZhang555
Copy link
Member Author

@egernst it's hard to say this is always OK. Regarding to one version behind, I assume you're talking about minor version (1.x.0 --> 1.y.0), supporting only one version behind indicates that we must update kata components step by step, e.g. 1.1.0 --> 1.6.0 needs 5 upgrades for safety.

Another situation is, in future we will have LTS version, suppose 1.5.0, 2.0.0, 2.5.0 are LTS versions, then it's very possible that users want to only use LTS version, which means they will want to upgrade from 1.5.0 to 2.0.0. Considering this, I will hope we can support more version downgrade at one time.

But, rolling back by one version can make the situation easier, we can start from this if we don't have a better choice.

@YvesChan
Copy link
Contributor

running kata-runtime and agent of different versions will very likely happen:

1). New runtime + old agent: updating kata-runtime when VM+old agent is running, kata-runtime shouldn't issue a command which will crash the agent.
2). old runtime + new agent: rollback when new kata version has issues, in this case, some service could be started already, new agent should always handle commands from old runtime.

While using containerd-shim-kata-v2, looks like this situation won't happen, since we no longer need a separated kata-runtime for sandbox/container management. Instead, we have a daemon process (containerd-shim-kata-v2) which support the v2 TASK API. Like the old containerd-shim, rpm upgrade of kata-runtime won't affect the running containerd-shim-kata-v2 process. Can we give a conclusion that there's no need for runtime component to support live-upgrade?

Since this issue have been opened for a while, just a little bit confused about the current status :)

zklei pushed a commit to zklei/runtime that referenced this issue Jun 13, 2019
According to the host link rawflags, setting link NOARP when needed

Fixes: kata-containers#492

Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
@egernst egernst added this to To do in release 1.10 Sep 19, 2019
@egernst egernst moved this from To do to backlog in release 1.10 Sep 19, 2019
@jodh-intel jodh-intel added this to To do in Issue backlog Aug 10, 2020
Old Backlog automation moved this from live-upgrade support to Done Apr 7, 2021
Issue backlog automation moved this from To do to Done Apr 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Issue backlog
  
Done
Old Backlog
  
Done
release 1.10
  
backlog
Development

No branches or pull requests

5 participants