Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

add vm factory support #303

Merged
merged 11 commits into from
Jul 21, 2018
Merged

add vm factory support #303

merged 11 commits into from
Jul 21, 2018

Conversation

bergwolf
Copy link
Member

@bergwolf bergwolf commented May 14, 2018

The PR adds vm factory support to kata runtime per design in the VM Factory plugin section. The vm factory controls how a new vm is created:

  • direct: vm is created directly
  • template: vm is created via vm template. A template vm is pre-created and saved. Later vm is just a clone of the template vm so that they readonly share a portion of initial memory (including kernel, initramfs and the kata agent). CPU and memory are hot plugged when necessary.
  • cache: vm is created via vm caches. A set of cached vm are pre-created and maintained alive. New vms are created by just picking a cached vm. CPU and memory are hot plugged when necessary.

To support vm factory, virtcontainers is refactored so that:

  1. hypervisor and sandbox are decoupled
  2. a new VM layer is added. A "VM" is a guest without any sandbox association and can be assigned to a sandbox when necessary.
  3. builtin proxy based agent can run internally without association with a sandbox.

A new kata CLI subcommand kata-runtime factory is added. It is used to manage vm factory init and destroy. Only direct and template factory are supported by kata CLI, because the cache factory requires a running daemon to maintain the cache vm channels. OTOH, all factories are available to use at the kata API level.

VMs created by vm factory still lack network support because right now network devices are configured before vm is created. We need network hotplug support (#287) to enable network configuration after vm is created. That said, this patch works otherwise and is large enough. So I'd like to get it merged and continue working on the network support when #287 is done.

Copy link
Contributor

@jodh-intel jodh-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sboeuf, @devimc - ptal.

[factory]
# VM templating support. Once enabled, new VMs are created from template
# using vm cloning. They will share the same initial kernel memory by mapping
# it readonly. When disabled, new VMs are created from scrash.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can you explain why users would want to use this feature (to save memory, speed up boot, etc)?
  • s/scrash/scratch/.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I'll add some text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we either need a new enable_debug= option for this section, or a comment explaining the the runtime enable_debug option affects factory logging?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf @jodh-intel I think it will be useful to add the security implications of this approach as well here so that users are aware of the side-effects of enabling this option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jodh-intel @amshinde The purpose of the PR is to add vm factory functionality. The CLI side change is just minimal to use vm factory. I can add more CLI changes after the PR is merged.

#enable_template = false

# VM caching support. When set to larger than 0, number of cache empty VMs
# will maintained alive. New VMs are created by just using the cached VMs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can you explain the impact of setting this? For example, if I set this to 5, does that mean the first container I start after a fresh boot will result in 6 instances of qemu being started (which will take longer than normal), but subsequently created containers will be fast (since they can make use of the running instances?

  • Also, what happens if those instances are running and qemu gets upgraded "underneath" them? Presumably, we should recommend all instances are stopped before upgrading.

  • How do you stop these instances (we need to document this)?

@@ -1126,6 +1126,10 @@ type Memory struct {
// MaxMem is the maximum amount of memory that can be made available
// to the guest through e.g. hot pluggable memory.
MaxMem string

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no corresponding change to Gopkg.toml.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The govmm PR is not merge yet. I'll update vendor properly once that one gets merged.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf could you please go to the govmm repo and address the comments. I'd like to see it merged before we move forward with this PR.

} else if config.Knobs.FileBackedMem == true {
if config.Memory.Size != "" && config.Memory.Path != "" {
dimmName := "dimm1"
objMemParam := "memory-backend-ram,id=" + dimmName + ",size=" + config.Memory.Size + ",mem-path=" + config.Memory.Path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been tested with qemu-lite and qemu v1.11 vanilla?


func New(count uint, b base.FactoryBase) base.FactoryBase {
if count < 1 {
return b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should count == 0 result in a log warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually OK to have count == 0 in which case we still return the base factory.

for {
vm, err := b.GetBaseVM()
if err != nil {
//glog.Errorf("cache factory get error when allocate vm: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out code here and below.


func (f *FactoryConfig) validate() error {
if f.Cache == 0 {
return fmt.Errorf("invalid factory cache number %d", f.Cache)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth display the entire FactoryConfig to help debugging here?

}

func (v *VM) assignSandbox(s *Sandbox) error {
// TODO: add vm symlinks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete?

@bergwolf bergwolf changed the title [wip] add vm factory support add vm factory support Jun 8, 2018
@bergwolf
Copy link
Member Author

bergwolf commented Jun 8, 2018

Removing wip and the do-not-merge tag. The PR is working and ready for review now. @sboeuf @jodh-intel @egernst @WeiZhang555 @laijs PTAL. In the meaning time, I'll fix up CI and post perf numbers.

@katabuilder
Copy link

PSS Measurement:
Qemu: 161919 KB
Proxy: 6752 KB
Shim: 10851 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 1999732 KB

@bergwolf
Copy link
Member Author

bergwolf commented Jun 8, 2018

Script to run: echo 3 |sudo tee /proc/sys/vm/drop_caches; free -h; for I in $(seq 100); do echo {$I}th; docker run -d --runtime kata busybox top;done;free -h

I've changed kata config to create 128MB guests by default.

w/o vm templating:

before test:
              total        used        free      shared  buff/cache   available
Mem:            31G        1.0G         29G        173M        722M         29G
Swap:          9.3G          4K        9.3G
after test:
              total        used        free      shared  buff/cache   available
Mem:            31G         15G         14G        190M        1.1G         15G
Swap:          9.3G          4K        9.3G

w/ vm templating:

              total        used        free      shared  buff/cache   available
Mem:            31G        1.2G         29G        253M        833M         29G
Swap:          9.3G          4K        9.3G
after test:
              total        used        free      shared  buff/cache   available
Mem:            31G        6.2G         23G        271M        1.1G         23G
Swap:          9.3G          4K        9.3G

So vm templating helped saving about 9GB for 100 guests, which is around 90MB~ for each 128MB guest. It's clear that vm templating can be very useful for density workload.

@egernst ^^

@WeiZhang555
Copy link
Member

Wow, this looks like an amazing optimization. And it's also quite a huge patch, need some time to read~

@katabuilder
Copy link

PSS Measurement:
Qemu: 161626 KB
Proxy: 8801 KB
Shim: 8899 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 1996880 KB

@grahamwhaley
Copy link
Contributor

Memory savings look good :-)
We have a script over in CC (that at some point I'll move over into the kata repo) that took a bunch of measurements for looking at size/density: https://github.com/clearcontainers/tests/blob/master/metrics/density/footprint_data.md

I presume this measurement was taken with KSM turned off. I have a feeling that some of those savings could also be found with KSM, but I suspect:

  • KSM would take longer and consume more CPU resource to find and merge the pages
  • some of the shared pages in QEMU are on the host side only, where KSM is not generally enabled (the host side pages are not marked as mergeable) - so I think we will get gains here that KSM cannot currently get.

So, those memory savings look good to me :-)

@bergwolf
Copy link
Member Author

@grahamwhaley Yes, these numbers were taken when KSM is off. And yes, one advantage of vm templating over KSM is that it does not consume extra CPU resources. However, it does not share the host side qemu memory either, same as KSM. All the sharing is done at the guest memory level, mostly including kernel, initramfs, kata-agent and other process memory like systemd if that is enabled.

Copy link
Contributor

@grahamwhaley grahamwhaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the whole idea, and the structure to my eyes looks right.
I've added some general feedback, but we need some core VM and VC devs to review this...

@@ -1126,6 +1126,10 @@ type Memory struct {
// MaxMem is the maximum amount of memory that can be made available
// to the guest through e.g. hot pluggable memory.
MaxMem string

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the patch title - hack/vendor made me think

Can you drop the hack from the subject pls?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That commit will be reworked once the govmm PR gets merged. The hack in the commit subject is to remind myself of the rework ;)

# using vm cloning. They will share the same initial kernel memory by mapping
# it readonly. When disabled, new VMs are created from scrash.
#
#enable_template = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is commented out, and I presume the default is therefore false/off, you may as well make the commented out line:

#enable_template = true

as that is the most likely line to be added or uncommented I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!


var factorySubCmds = []cli.Command{
initFactoryCommand,
destroyFactoryCommand,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be possible to have some sort of 'status' command as well - to show how many runtimes are shared, or even more important I think once we have the cache/pool, how many are cached?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible and it needs corresponding virtcontainers side of changes to support it. At this point, I would defer further improvements to future PRs since this one is already pretty big now.

return q.hotplugAddMemory(slot, memMB)
}

func (q *qemu) hotplugAddMemory(slot uint32, amount int) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought - is this tied to the vmtemplating PR, or could it (if need be, to simplify the PRs) be pulled out into a separate PR?
Also, /cc @devimc as this feels like it might overlap or interact with some of his work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VM templating needs cpu and memory hotplug support so I added them together.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is needed, but this is something that is not related to the whole refactoring, and I think you should submit this as a separate PR on top of which you should rebase this current PR.
This will help to split this huge PR and people will be more motivated in reviewing it ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not about refactoring. It is a dependency of vm factory functionality and vm factory is in fact the only consumer of the internal hotplug APIs ATM. I thought that it would lose the whole picture for reviewers if I split it.

And for reviewing, I thought it is smaller commit in a large PR that actually helps. But it seems you are right that people would like to just review it as a whole rather than looking into each commit. And the PR is huge to be reviewed as a whole (hence I get no comments on the core idea/API/refactoring part of the PR for a week :( ). I'll split some dependency commits to another PR to see if that helps.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf Thanks :)
And don't worry, I'm gonna spend some time today in reviewing this PR, this is a huge piece of work and I need to be focus to be able to properly review it.
Also, don't hesitate to ping people that should review this PR. Usually if you don't ask, they will not spend the time on such big PRs 😃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I plan to re-review this tomorrow AM too fwiw.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been planning to review this for a whole week, in the end I just started this work a few minutes ago 😆
Shame on myself :-)

@@ -6,6 +6,6 @@
package virtcontainers

type Factory interface {
GetVM(hypervisorType HypervisorType, hypervisorConfig HypervisorConfig) (*VM, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this whole commit be folded into the one that added factory.go in the first place? Seems this fixes something effectively 'missing' from that prior commit?

@@ -359,7 +359,7 @@ func newQemuHypervisorConfig(h hypervisor) (vc.HypervisorConfig, error) {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I would have liked to have seen the static check (lint, cyclo) fixes in (merged into) the original commits. Not a biggie, just noting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@katabuilder
Copy link

PSS Measurement:
Qemu: 163764 KB
Proxy: 4632 KB
Shim: 11067 KB

Memory inside container:
Total Memory: 2045972 KB
Free Memory: 1995640 KB

@jshachm
Copy link
Member

jshachm commented Jun 12, 2018

Just a brief looking into the code.(quiet a lot of code.. great job!)

hypervisor and sandbox are decoupled is a really good idea ~
IMO decouple will lead more chance to remain resources .So I think clear resource after exception and proper rollback are really need. I will cherry-pick and do a little more test on this to find out if there is any resource remaining.

Will feedback later~

Copy link

@sboeuf sboeuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments later...

@@ -62,7 +60,9 @@ type qemu struct {

const qmpCapErrMsg = "Failed to negoatiate QMP capabilities"

const defaultConsole = "console.sock"
const qmpSocket = "qmp.sock"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with using only one socket, but you increased the numbers of char in the socket name now. controlSocket and monitorSocket were shorter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main point is that there is no need for two qmp sockets. If soeket file name length is a concern, I can just use qmp instead. But we've already shortened socket parent path and qmp.sock survives BuildSocketPath check, so I'd prefer to making the naming clear such that a socket actually ends with its .sock suffix.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fair enough. We have have shortened the parent path, and I agree I prefer a socket name ending with .sock.

@@ -1126,6 +1126,10 @@ type Memory struct {
// MaxMem is the maximum amount of memory that can be made available
// to the guest through e.g. hot pluggable memory.
MaxMem string

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf could you please go to the govmm repo and address the comments. I'd like to see it merged before we move forward with this PR.

}

// SetLogger implements the VC function of the same name.
func (impl *VCImpl) SetLogger(logger logrus.FieldLogger) {
SetLogger(logger)
}

// SetFactory implements the VC function of the same name.
func (impl *VCImpl) SetFactory(factory Factory) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need the mock implementation to implement this new API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mock implementation has SetFactory.

return q.hotplugAddMemory(slot, memMB)
}

func (q *qemu) hotplugAddMemory(slot uint32, amount int) error {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is needed, but this is something that is not related to the whole refactoring, and I think you should submit this as a separate PR on top of which you should rebase this current PR.
This will help to split this huge PR and people will be more motivated in reviewing it ;)

kataLog.WithField("factory", factoryConfig).Info("load vm factory")
f, err := vf.NewFactory(factoryConfig, true)
if err != nil {
kataLog.WithError(err).Info("load vm factory failed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth adding a comment here explaining that if vm factory support fails, the system will "degrade" automatically to using the non-factory codepath.

In fact, I think we might want to add another variable to the config file to avoid this graceful degradation as some users only want to use VM factories:

[factory]

# If set, fail if VM factory support is not available.
## strict=true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed. But again, I'd like to push all these UX changes to later PRs. This PR is mostly to add core vm factory functionalities.

[factory]
# VM templating support. Once enabled, new VMs are created from template
# using vm cloning. They will share the same initial kernel memory by mapping
# it readonly. When disabled, new VMs are created from scrash.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we either need a new enable_debug= option for this section, or a comment explaining the the runtime enable_debug option affects factory logging?

f, err := vf.NewFactory(factoryConfig, true)
if err != nil {
kataLog.WithError(err).Error("load vm factory failed")
// ignore error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens when we destroy a factory that does not exist in the first place.

return err
}
flags := uintptr(syscall.MS_NOSUID | syscall.MS_NODEV)
opts := fmt.Sprintf("size=%dM", t.config.HypervisorConfig.DefaultMemSz+8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why +8?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An experience value that a vm device state file is smaller than 8MB, usually it is about 3MB~ and we need to align it by 8.

fmt.Println("vm factory initialized")
} else {
kataLog.Error("vm factory is not enabled")
fmt.Println("vm factory is not enabled")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

twice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kataLog does not print to stdout. But I think we should tell user if the command succeeds or not. Any suggestion how to do it properly with kataLog?

f.CloseFactory()
}
}
fmt.Println("vm factory destroyed")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't use kataLog ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kataLog does not print to stdout.

Copy link

@sboeuf sboeuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf The PR looks pretty good, but I don't think we need to introduce a new abstraction layer with this new VM interface. The factory should call directly into the hypervisor, there is no reason to add an extra layer here. And about the sandbox decoupling, it's not needed since we want the complexity to be hidden behind sandbox API.
We can do the proper parallelization from Frakti by using CreateSandbox() with the appropriate factory, and hotplug later by creating new containers. Each container will be responsible for hotplugging the resources that we need.
That being said, the same way we're going to introduce sandbox.AddStorage(), we could introduce sandbox.AddResources() if really needed.

@@ -62,7 +60,9 @@ type qemu struct {

const qmpCapErrMsg = "Failed to negoatiate QMP capabilities"

const defaultConsole = "console.sock"
const qmpSocket = "qmp.sock"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fair enough. We have have shortened the parent path, and I agree I prefer a socket name ending with .sock.

@@ -253,15 +250,6 @@ func (q *qemuArchBase) memoryTopology(memoryMb, hostMemoryMb uint64) govmmQemu.M
return memory
}

func (q *qemuArchBase) append9PVolumes(devices []govmmQemu.Device, volumes []Volume) []govmmQemu.Device {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of this function is not part of the decoupling between sandbox and hypervisor, is it ?
Shoud be part of it's own commit I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed.


package virtcontainers

type VM struct {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really get the need for such additional abstraction layer. Could you explain the rationales behind this ? My understanding so far is that it actually duplicates the hypervisor interface, and includes the agent handler, meaning it looks a lot like what sandbox is doing.

@@ -340,8 +346,7 @@ func (q *qemu) createSandbox(sandboxConfig SandboxConfig) error {
// bridge gets the first available PCI address i.e bridgePCIStartAddr
devices = q.arch.appendBridges(devices, q.state.Bridges)

devices = q.arch.append9PVolumes(devices, sandboxConfig.Volumes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf Can you add in the commit message that you are removing this method and the reason for doing this.


package virtcontainers

type Factory interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf Can you add documentation comments for Factory here? Will be useful include the purpose for Factory and FactoryBase in the commit message.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf I agree with @amshinde here. The difference between Factory and FactoryBase is very thin and confusing, and unless there's a very good reason to keep it, you should simplify this with Factory interface only (basically what you've defined in FactoryBase right now).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sboeuf That would push the vm config comparison and CPU/memory hotplug down to each of the factory implementations. factory/factory.go is not just a broker but also handles vm config validation/comparison and possible CPU/memory hotplugs. It's better to do it at the factory level instead of doing the same work in each of the factory implementations.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf I had missed this. This makes sense and it'd be great to explicitly explain this from the code by commenting about it. Otherwise, it's not obvious what difference we have between Factory and FactoryBase interfaces.


import vc "github.com/kata-containers/runtime/virtcontainers"

type FactoryBase interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, can you add documentation comments for FactoryBase and its purpose?

//
// cache implements base vm factory on top of other base vm factory.

package cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf Can you add description about the cache and template factory? I get no idea about what these are and how they differ from each other. I know you are planning to add implementation later, but comments about general flow giving an idea about what is stored and what is constructed runtime for each factory type will be useful to get a general understanding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in the PR description.

[factory]
# VM templating support. Once enabled, new VMs are created from template
# using vm cloning. They will share the same initial kernel memory by mapping
# it readonly. When disabled, new VMs are created from scrash.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf @jodh-intel I think it will be useful to add the security implications of this approach as well here so that users are aware of the side-effects of enabling this option.

@@ -100,6 +101,25 @@ func create(containerID, bundlePath, console, pidFilePath string, detach bool,
return err
}

if runtimeConfig.FactoryConfig.Template {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf Can we handle this in virtcontainers CreateSandbox itself rather than doing this in the cli?
All these implementation details about the factory could be handled by virtcontainers itself keeping the api simple.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kata CLI is the one reading the configuration files. And for frakti, unlike kata cli, we would only initialize the vm factory for only once, instead of doing it for every CreateSandbox. That is the reason factory config is not pushed into CreateSandbox.

},
}

var initFactoryCommand = cli.Command{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these commands intended to be called by frakti? I am trying to understand the use case here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not at all. frakti does not call any of the kata CLI commands. The CLI commands are minimal for kata cli to use vm factory and it helps testing.

@@ -103,3 +116,13 @@ func (t *template) createFromTemplateVM() (*vc.VM, error) {

return vc.NewVM(config)
}

func (t *template) checkTemplateVM() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs some comments what you are doing here and what those files mean.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

}

type factory struct {
base base.FactoryBase
}

func NewFactory(config FactoryConfig) (vc.Factory, error) {
func NewFactory(config FactoryConfig, fetchOnly bool) (vc.Factory, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments for the fetchOnly flag here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@bergwolf
Copy link
Member Author

@sboeuf The VM layer is needed because vm factory needs to work on a vm guest abstraction without sandbox association. The VM layer is on top of the hypervisor layer because hypervisor is just a set of driver interfaces that has no in-memory abstraction. We can add abstraction data structure in hypervisor.go but it would just look like the VM layer and it is not as clean as a separate VM layer.

With the VM layer, we have a proper layering that:

  • A VM is a guest without sandbox association
  • hypervisor.go is the hypervisor driver interface that implement a VM
  • A sandbox is a collection of containers
  • A VM can be assigned to a sandbox and once that is done, the sandbox becomes operational

And about the sandbox decoupling, it's not needed since we want the complexity to be hidden behind sandbox API.

The sandbox API cannot cover vm factory. For one thing, vm factory needs to be initialized before any sandbox is created. For another thing, unlike kata CLI, frakti would just initialize the factory for only once and use it as long as the daemon is running. So it needs to be a separate API that is not coupled with the sandbox APIs.

We can do the proper parallelization from Frakti by using CreateSandbox() with the appropriate factory, and hotplug later by creating new containers. Each container will be responsible for hotplugging the resources that we need.

I agree that it makes sense to let each container hotplug what it needs. But doing hotplug with vm factory still allows the factory to prepare VMs with different sizes. I would like to keep such ability. WDYT?

Config() vc.VMConfig

// GetBaseVM returns a paused VM created by the base factory.
GetBaseVM() (*vc.VM, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main advantage of a cache is of course speed. However, we need to be able to identify when the cache should be invalidated, for example when a new rootfs image containing a security fix is released.

As such, I wonder if we need to add a timestamp into vc.VM that shows when the cache item was created (not when it was used). That will give us a basic way to look for "stale images" I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I think the system upgrade best practice will need to include destroying the vm factory and re-initialize it after upgrade. This would apply to all vm factory users, including kata cli and frakti. For example in frakti, we would destroy the vm factory whenever the daemon quits and always initialize it upon startup.

You are right that it is possible to handle it automatically inside vm factory but it can be very tricky and error prone since an upgrade can happen at any time. Given that frakti is the only user of cache vm factory, I would prefer to push the burden up to frakti where it is already properly handled.

}
}

err = q.qmpMonitorCh.qmp.ExecSetMigrateArguments(q.qmpMonitorCh.ctx, fmt.Sprintf("exec:cat>%s", q.config.DevicesStatePath))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any issues around creating the image with "version X" of qemu and then trying to use it with "version Y" of qemu? I'm thinking about build-support as well (qemu-lite compared with qemu-vanilla compared with distro-packaged qemu).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such case, the qmp command might fail. And I would classify it as a configuration mistake... It's just like the version mismatch when you upgrade kata-runtime but do not correspondingly upgrade kata-agent and end up with certain grpc failures.

@sboeuf
Copy link

sboeuf commented Jun 19, 2018

@bergwolf

The sandbox API cannot cover vm factory. For one thing, vm factory needs to be initialized before any sandbox is created. For another thing, unlike kata CLI, frakti would just initialize the factory for only once and use it as long as the daemon is running. So it needs to be a separate API that is not coupled with the sandbox APIs.

I clearly understand the need that you have here, and this means you need access to the hypervisor interface so that you can run a VM and stop it to save its state, but I thought it would be better to make sure that we can do that through the hypervisor interface itself. And because you have removed dependency on the sandbox through the hypervisor interface, you should be able to do that.

I agree that it makes sense to let each container hotplug what it needs. But doing hotplug with vm factory still allows the factory to prepare VMs with different sizes. I would like to keep such ability. WDYT?

Ok, but if you need different size of VMs, why don't you simply create them with the expected amount directly. IMO the hotplug should be reserved for the containers since they are the variable in the equation (don't know what the CRI layer will ask for in advance).

@bergwolf
Copy link
Member Author

@sboeuf

you need access to the hypervisor interface so that you can run a VM and stop it to save its state

Except for accessing the hypervisor interface, we also need an in-memory data structure to represent the guest without a sandbox, so that we can create it (vm.NewVM()) and assign it to a sandbox (vm.assignSandbox()). The other VM APIs (Pause, Save, Stop, Resume, Kill, and hotplug) are just wrappers around hypervisor interface so that we do not export the entire set of hypervisor interface.

why don't you simply create them with the expected amount directly

For performance consideration, it is much slower to create a new VM and wait for agent to be up, than simply hotplug the requested CPU/Memory.

@sboeuf
Copy link

sboeuf commented Jun 20, 2018

@bergwolf

Except for accessing the hypervisor interface, we also need an in-memory data structure to represent the guest without a sandbox, so that we can create it (vm.NewVM()) and assign it to a sandbox (vm.assignSandbox()). The other VM APIs (Pause, Save, Stop, Resume, Kill, and hotplug) are just wrappers around hypervisor interface so that we do not export the entire set of hypervisor interface.

vm.assignSandbox() is a consequence of the way you've decoupled things. If you think the other way around, the sandbox gets the info regarding an existing hypervisor and that's all it needs. About vm.NewVM(), I can see that you expect to check on the agent after you started the VM, but you don't strictly need to do that. The VM could be a simple entity started through the hypervisor interface without the need to know about the agent. And it's only when you create the sandbox that you provide the VM you want for your hypervisor, which leaves the sandbox taking care of calling into the agent to check if it properly started inside the VM.

For performance consideration, it is much slower to create a new VM and wait for agent to be up, than simply hotplug the requested CPU/Memory.

That's because you assume you have to wait for the agent to be up inside the VM when you're creating it from the factory. I don't think we need this, and as explained above, this is more something that should be checked in the context of the sandbox.

@sboeuf
Copy link

sboeuf commented Jul 17, 2018

@amshinde @devimc @grahamwhaley @egernst it'd be nice if you could take a look ;)

It is not used and we actully cannot append multiple 9pfs volumes to
a guest.

Signed-off-by: Peng Tao <bergwolf@gmail.com>
A hypervisor implementation does not need to depend on a sandbox
structure. Decouple them in preparation for vm factory.

Signed-off-by: Peng Tao <bergwolf@gmail.com>
The boolean return value is not necessary.

Signed-off-by: Peng Tao <bergwolf@gmail.com>
There are a few changes we need on kata agent to introduce vm factory
support:
1. decouple agent creation from sandbox config
2. setup agent without creating a sandbox
3. expose vm storage path and share mount point

Signed-off-by: Peng Tao <bergwolf@gmail.com>
1. support qemu migration save operation
2. setup vm templating parameters per hypervisor config
3. create vm storage path when it does not exist. This can happen when
an empty guest is created without a sandbox.

Signed-off-by: Peng Tao <bergwolf@gmail.com>
As representation of a guest without actual sandbox attached to it.
This prepares for vm factory support.

Signed-off-by: Peng Tao <bergwolf@gmail.com>
Add vm factory support per design in the VM Factory plugin section.
The vm factory controls how a new vm is created:

1. direct: vm is created directly
2. template: vm is created via vm template. A template vm is pre-created
   and saved. Later vm is just a clone of the template vm so that they
   readonly share a portion of initial memory (including kernel, initramfs
   and the kata agent). CPU and memory are hot plugged when necessary.
3. cache: vm is created via vm caches. A set of cached vm are pre-created
   and maintained alive. New vms are created by just picking a cached vm.
   CPU and memory are hot plugged when necessary.

Fixes: kata-containers#303

Signed-off-by: Peng Tao <bergwolf@gmail.com>
Add SetFactory to allow virtcontainers consumers to set a vm factory.
And use it to create new VMs whenever the factory is set.

Signed-off-by: Peng Tao <bergwolf@gmail.com>
Add enable_template option to the config file.
When it is set, enable the vm template factory.

cache factory cannot be used by kata cli directly because
it requires a running daemon to maintain the cache VMs.

`kata-runtime factory init` would initialize the vm factory and
`kata-runtime factory destroy` would destroy the vm factory.

When configured, a vm factory is loaded before creating new sandboxes.

Signed-off-by: Peng Tao <bergwolf@gmail.com>
For one thing, it is not used by any kata components. For another thing,
it breaks vm factory hypervisor config check.

Signed-off-by: Peng Tao <bergwolf@gmail.com>
Add UTs to all factory components.

Signed-off-by: Peng Tao <bergwolf@gmail.com>
@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 162966 KB
Proxy: 5090 KB
Shim: 8883 KB

Memory inside container:
Total Memory: 2045968 KB
Free Memory: 2007096 KB

@opendev-zuul
Copy link

opendev-zuul bot commented Jul 19, 2018

Build failed (third-party-check pipeline) integration testing with
OpenStack. For information on how to proceed, see
http://docs.openstack.org/infra/manual/developers.html#automated-testing

@bergwolf
Copy link
Member Author

This is now blocked by kata-containers/tests/issues/513 and kata-containers/tests/issues/506. restarting CI did not help.... I know it all passed yesterday. But codecov was not happy so I had to update again.

@bergwolf
Copy link
Member Author

recheck

@opendev-zuul
Copy link

opendev-zuul bot commented Jul 20, 2018

Build succeeded (third-party-check pipeline).

@sboeuf
Copy link

sboeuf commented Jul 20, 2018

@kata-containers/runtime PTAL, this is important to review this PR, so that we can move forward with it. It's been around for a long time, and @bergwolf reworked it many times. PTAL !

@laijs
Copy link
Contributor

laijs commented Jul 21, 2018

LGTM.
Well done @bergwolf, thanks for having done it.

Approved with PullApprove

@laijs
Copy link
Contributor

laijs commented Jul 21, 2018

It seems that codecov/patch is not reliable. see kata-containers/tests#508
And, as @sboeuf said, it's been around for a long time, and @bergwolf reworked it many times, I merge it despite the codecov/patch result.

@laijs laijs merged commit 14d25b8 into kata-containers:master Jul 21, 2018
@egernst egernst mentioned this pull request Aug 22, 2018
@grahamwhaley
Copy link
Contributor

Hi @bergwolf - I know this is an old and merged item (that base VM templating/factory is merged already). I was wondering, are you planning at some point on running some metrics tests to publish maybe on vm factory gains for density and boot? I think having these numbers available for Kata might be a great thing.
I've not had time to run up the factory and try it - but the report gen might help as a first pass easy look?

Maybe we should open an Issue to track this, and then we might be able to have some collab to come up with some sort of blog etc.?
/cc @annabellebertooch @brecky777

@sboeuf sboeuf added the feature New functionality label Sep 12, 2018
@bergwolf bergwolf deleted the vmfactory branch September 13, 2018 03:28
zklei pushed a commit to zklei/runtime that referenced this pull request Jun 13, 2019
agent: do not quit on grpc serve errors
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature New functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.