New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decentralized Libvirt #663

Merged
merged 21 commits into from Feb 1, 2018

Conversation

Projects
None yet
4 participants
@davidvossel
Member

davidvossel commented Jan 18, 2018

Overview

This patch moves KubeVirt's design from one that depends on a centralized libvirtd per a node, to one that uses a decentralized libvirtd per a VM pod.

With this new decentralized approach, each VM's qemu process now lives directly within the VM Pod's cgroups and namespaces which means that any storage/network devices in the Pod are available to the VM.

Notable Changes

Cloud-init and Registry Disks

Generation and lifecycle management of ephemeral disks have moved from virt-handler to virt-launcher.

This data is now completely self contained (no shared host mounts) within the VM Pod, which means cleanup occurs automatically as a part of the kubelet tearing down the Pod's environment.

Notifications Server and Domain Informer

Previously virt-handler received events about lifecycle changes to a domain through a libvirt event callback.

Now virt-handler receives domain lifecycle events through its notification server. Virt-handler starts a notification server that listens on a unix socket. Each virt-launcher acts as a client to this notification server and forwards domain lifecycle events to it.

The virt-handler domain informer uses this notification server for its Watch function. The informer's List function iterates over every known virt-launcher present on the local host and requests the latest information about all defined domains.

Virt-launcher Command Server

Each virt-launcher starts a command server that listens on a unix socket as part of the virt-launcher process's initialization.

By design, Virt-launcher has no connection to the k8s api server. The command server allows virt-handler to manage the VM's lifecycle by posting VM specs to virt-launcher to start/stop.

This command server is also how the virt-handler's Domain informer perform's its List function. There's a directory of unix sockets belonging to each virt-launcher. The domain informer List function iterates over each of these sockets and creates a cache of all the active domains on the local node.

Migrations have been disabled

The reasoning for this is migrations depend on network access to the libvirtd process managing the VM. Our plan for networking has the IP provided to each VM pod being taken over by the VM itself, which means processes running in the pod (other than the qemu process) will not have network access in the near future.

This doesn't mean we are abandoning live migrations. It just means we are accepting that migrations are a feature we're willing to sacrifice in the short term in order to simplify the move to a more desirable overall KubeVirt design.

Re-enabling migrations is being tracked in this issue #676

Libvirtd and Virtlogd

The libvirtd and virtlogd processes are now launched as part of virt-launcher's initialization sequence.

Originally I had libvirtd and virtlogd in their own respective containers in the VM pod, however this caused issues with startup and shutdown ordering.

Virt-launcher intercepts posix signals to shutdown and uses that as a signal to begin gracefully shutting down the VM. We need to ensure that the libvirtd process does not shutdown until after the VM has exited. This was hard to guarantee with libvirtd not being in the same container and controlled by virt-launcher.

Code Removal and Relocation

  • Everything under /pkg/virt-handler/virtwrap moved to /pkg/virt-launcher/virtwrap
  • The isolation pkg was used for detecting cgroups and namespaces of a qemu process. We don't nee this anymore
  • The config-disk pkg was used as a job wrapper around cloud-init to prevent blocking virt-handler. Cloud-init disk creation has now moved to virt-launcher.
  • libvirt daemonset manifests have been removed
  • libvirt-kubevirt container has been removed. Virt-launcher now consumes the libvirt base container.
  • Migrations CRD manifest has been removed. We can reintroduce this CRD once migrations are enabled again.
  • Virt-handler's migration info rest endpoint has been removed.

Networking

Nothing with involved with networking was impacted by this patch series. The VM pods remain in the host network namespace for now simply because the Pod network work hasn't been completed yet.

Testing Changes

  • The introduction of libvirtd per a pod impacted the VM startup time slightly. As a result, some functional test's timeouts have been increased.
  • Unit tests for all new functionality (command server, notify server, domain informer) have been introduced.
  • No loss of test coverage is expected. Any unit tests that were removed had their functionality moved to another unit test or where dropped entirely as a result of the functionality no longer existing.
  • Migration Functional tests remain in the src tree but are inactive.
  • Tests that involve multiple VMs launching in parallel have "fail test on warning" disabled. Libvirt complains about making macvtap devices sometimes because two libvirtds are attempting to do something in parallel. The VMs still launch fine, it just involves a retry. This will get fixed with the networking changes.

Issues resolved by these changes.

Fixes #421, fixes #364, fixes #196

@davidvossel davidvossel requested review from fabiand, vladikr and rmohr Jan 18, 2018

// - storage prep
// - network prep
// - cloud-init
func (l *LibvirtDomainManager) preStartHook(vm *v1.VirtualMachine, domain *api.Domain) error {

This comment has been minimized.

@davidvossel

davidvossel Jan 18, 2018

Member

@vladikr This is an entry point you can use for setting up anything in the environment that needs to be setup for networking. It occurs right before the VM starts.

@davidvossel

This comment has been minimized.

Member

davidvossel commented Jan 19, 2018

retest this please

@davidvossel

This comment has been minimized.

Member

davidvossel commented Jan 24, 2018

@fabiand @rmohr I don't consider this WIP anymore. All the functionality and tests are present now.

@davidvossel

This comment has been minimized.

Member

davidvossel commented Jan 24, 2018

The description has been updated with a detailed account of what has been changed in this patch series.

@fabiand

Thanks for the extensiove description.

Overall this looks good!

There are a few things (i.e. talking to launcher) where I am unsure how we wanna do this on the long run, but for now all of this works for me.

I think we should merge this to really expose this work to broader testing.

@rmohr

This comment has been minimized.

Member

rmohr commented Jan 25, 2018

retest this please

@vladikr

It looks good to me and I didn't run into any issues so far while working on networking. I'd prefer to have it merged before posting the networking PR.

@davidvossel

This comment has been minimized.

Member

davidvossel commented Jan 25, 2018

retest this please

1 similar comment
@davidvossel

This comment has been minimized.

Member

davidvossel commented Jan 29, 2018

retest this please

@rmohr

Some initial thoughts and questions.

ln -s cpu,cpuacct cpuacct,cpu
mount -o remount,ro /host-sys/fs/cgroup
fi
#if [ ! -d "/host-sys/fs/cgroup/cpuacct,cpu" ]; then

This comment has been minimized.

@rmohr

rmohr Jan 30, 2018

Member

Since we don't need to bind mount the cgroups anymore, we can remove the code.

This comment has been minimized.

@davidvossel

davidvossel Jan 30, 2018

Member

yep, i'll remove that now. it was commented out for that reason

This comment has been minimized.

@davidvossel
// is down, the domain needs to be deleted from the cache.
err = d.domainInformer.GetStore().Delete(domain)
if err != nil {
return err

This comment has been minimized.

@rmohr

rmohr Jan 30, 2018

Member

That sounds conceptually wrong. I think that we need to make sure that the listwatcher transforms a "down" to a cache delete.

This comment has been minimized.

@davidvossel

davidvossel Jan 30, 2018

Member

Under normal operation, the listwatcher will process all of this correctly. Delete events come in and domains are removed from the cache by the informer.

This is an edge case where the informer never receives the Delete event from the VM Pod. The watchdog expires and we remove the cache entry here. This condition could occur if virt-handler is down when the final delete notification is sent, or if virt-launcher forcibly exits in a way that results in the final delete notification not being sent.

This comment has been minimized.

@rmohr

rmohr Jan 30, 2018

Member

This condition could occur if virt-handler is down when the final delete notification is sent or if virt-launcher forcibly exits in a way that results in the final delete notification not being sent.

That should normally not be necessary. A few mechanisms should be sufficient to prevent that:

  • warming up the domain cache with the vm cluster-state. If we then don't find a domain, matching the expectation based on the vm, the informer should delete the object from the cache automatically
  • If we know which socket belongs to which vm, we can infer from that, that if the socket connection gets closed, or if no one listens on the socket, which vm got removed.

@davidvossel can you think of a scenario where we can miss a delete if the above conditions are met?

This comment has been minimized.

@davidvossel

davidvossel Jan 30, 2018

Member

i'll give this part some more attention this afternoon. There were some things i was trying to avoid with regards to coupling the client connection to the domain's status. I think this can be simplified though.

This comment has been minimized.

@davidvossel

davidvossel Jan 30, 2018

Member

@rmohr

I've removed the watchdog informer entirely. The domain informer now polls for stale watchdog files and fires a delete event when one is encountered.

I'm hesitant to tie any of this Delete event logic directly to the unix socket, which is why i'm still using the watchdog files. I don't want the presence, or lack of presence of the unix socket file to infer a stale domain needs to be processed.

This comment has been minimized.

@rmohr

rmohr Jan 31, 2018

Member

works for me. Just modifying the cache outside of the informer is a no go afaik (locking, ...).

return err
}
func (c *VirtLauncherClient) ListDomains() ([]*api.Domain, error) {

This comment has been minimized.

@rmohr

rmohr Jan 30, 2018

Member

Shouldn't that be ListDomain?

This comment has been minimized.

@davidvossel

davidvossel Jan 30, 2018

Member

This translates to ListAllDomains on the backend.

*
*/
package cache

This comment has been minimized.

@rmohr

rmohr Jan 30, 2018

Member

Looks like a better place for this file would be inside virt-handler?

This comment has been minimized.

@davidvossel

davidvossel Jan 30, 2018

Member

yup, i agree. I'll move this

This comment has been minimized.

@davidvossel
ShutdownVirtualMachine(vm *v1.VirtualMachine) error
KillVirtualMachine(vm *v1.VirtualMachine) error
SyncSecret(vm *v1.VirtualMachine, usageType string, usageID string, secretValue string) error
ListDomains() ([]*api.Domain, error)

This comment has been minimized.

@rmohr

rmohr Jan 30, 2018

Member

Shouldn't that be more something like GetDomain()(*api.Domain, error)?

This comment has been minimized.

@davidvossel

davidvossel Jan 30, 2018

Member

I wanted to use the ListAllDomains libvirt api call rather than having the cmd server maintain any state related to the name of the domain it is managing. I could hide that we're using the ListAllDomain's function and just return the first domain entry for the command client's GetDomain function. I didn't see a reason to do that though.

This comment has been minimized.

@rmohr

rmohr Jan 30, 2018

Member

That is very confusing, since we know that it can always only ever return one domain (I think we even know the name of the domain to look up).

This comment has been minimized.

@davidvossel

davidvossel Jan 30, 2018

Member

how do we know the name of the domain to look up? This function is used by the domain informer's List function to sync the cache. There's no knowledge on the client side as to what domain they're requesting at that point.

This comment has been minimized.

@rmohr

rmohr Jan 30, 2018

Member

Still, it can just return one domain, or am I at the wrong place in the code?

This comment has been minimized.

@davidvossel

davidvossel Jan 30, 2018

Member

yes, we can hind the fact that the command server is using the ListAllDomain libivrt api function and just call the command client's function GetDomain. I'm fine with that

This comment has been minimized.

@davidvossel

davidvossel Jan 30, 2018

Member

fixed :)

}
args := &Args{
VMJSON: string(vmJSON),

This comment has been minimized.

@rmohr

rmohr Jan 30, 2018

Member

I would feel better if we understand the issue we are facing, which we are working around by using json.

This comment has been minimized.

@davidvossel

davidvossel Jan 30, 2018

Member
type ResourceList map[ResourceName]resource.Quantity

Anything that uses that map (like memory or cpu requests/limits) gets lost if we don't serialize/deserialize the VirtualMachine as a JSON object.

This comment has been minimized.

@rmohr

rmohr Jan 30, 2018

Member

Could you give me a before RPC and after RPC example?

This comment has been minimized.

@davidvossel

davidvossel Jan 30, 2018

Member

before

apiVersion: kubevirt.io/v1alpha1
kind: VirtualMachine
metadata:
  name: testvm
spec:   
  terminationGracePeriodSeconds: 0                                  
  domain:
    resources:
      requests:
        memory: 64M

after

kind: VirtualMachine
metadata:
  name: testvm
spec:   
  terminationGracePeriodSeconds: 0                                  
  domain:
    resources:
      requests:
        memory:

the memory quantity gets lost

This comment has been minimized.

@rmohr

rmohr Jan 31, 2018

Member

ok, so the reason is that the Quantity type has hidden fields, and the serializer by default only takes public fields. Therefore it keeps the format, but not the actual value ... A custom MarshalBinary() might solve that. However not sure if it is easy to do ...

This comment has been minimized.

@rmohr

rmohr Jan 31, 2018

Member

ok, so I think I have a solution. Could you implement "MarshalBinary()" and "UnmarshalBinary()" on our VM type? In there just marshal the whole struct into json and convert it to bytes. Then you can remove all custom json marshalling actions regarding to RPC. That would give me a better feeling.

This comment has been minimized.

@rmohr

rmohr Jan 31, 2018

Member

I think on the long run, we should use protopuf for rpc, since k8s takes care that their types work with protobuf (and aparently don't care about normal rpc).

This comment has been minimized.

@davidvossel

davidvossel Jan 31, 2018

Member

Could you implement "MarshalBinary()" and "UnmarshalBinary()" on our VM type

that seems a little overkill. Is there a simple way of doing this? The binary encoding that golang provides is the same thing that rpc uses, which stripped the memory requests.

This comment has been minimized.

@davidvossel

davidvossel Jan 31, 2018

Member

i see what you're getting at now. We'll just use the a MarshalBinary function and wrap the json encode in it. I'm fine with doing that.

This comment has been minimized.

@davidvossel
srvErr := make(chan error)
go func() {
defer close(srvErr)
err := notifyserver.RunServer(d.virtShareDir, d.stopChan, d.eventChan)

This comment has been minimized.

@rmohr

rmohr Jan 31, 2018

Member

The whole notify server might also be better placed in the virt-handler package.

This comment has been minimized.

@davidvossel

davidvossel Jan 31, 2018

Member

i'm torn about that one. The notify client depends on libvirt. I could move the server to virt-handler but I'd want to keep the client package under virt-launcher. i thought it was better to keep them both together in one place.

This comment has been minimized.

@davidvossel

davidvossel Jan 31, 2018

Member

fixed. I moved notify server and cmd client to virt-handler

return nil
}
func (s *Launcher) Start(args *cmdclient.Args, reply *cmdclient.Reply) error {

This comment has been minimized.

@rmohr

rmohr Jan 31, 2018

Member

I think that Start is misleading. Having something like Sync in the name is more correct. Also right now, virt-handler would try to unpause a VM, if it is in an unexpected pause mode, so to some degree it really already tries to synchronize the vm and the domain state.

This comment has been minimized.

@davidvossel
}
args := &Args{
VMJSON: string(vmJSON),

This comment has been minimized.

@rmohr

rmohr Jan 31, 2018

Member

ok, so the reason is that the Quantity type has hidden fields, and the serializer by default only takes public fields. Therefore it keeps the format, but not the actual value ... A custom MarshalBinary() might solve that. However not sure if it is easy to do ...

}
args := &Args{
VMJSON: string(vmJSON),

This comment has been minimized.

@rmohr

rmohr Jan 31, 2018

Member

ok, so I think I have a solution. Could you implement "MarshalBinary()" and "UnmarshalBinary()" on our VM type? In there just marshal the whole struct into json and convert it to bytes. Then you can remove all custom json marshalling actions regarding to RPC. That would give me a better feeling.

davidvossel added some commits Jan 19, 2018

Switch to decentralized libvirtd-per-pod
Signed-off-by: David Vossel <davidvossel@gmail.com>
Move ListAllDomains to DomainManager
Signed-off-by: David Vossel <davidvossel@gmail.com>
remove unused domain connection from cmd server and update mock gener…
…ated

Signed-off-by: David Vossel <davidvossel@gmail.com>
Command client/server unit test suite
Signed-off-by: David Vossel <davidvossel@gmail.com>
notify client/server unit tests
Signed-off-by: David Vossel <davidvossel@gmail.com>
Unit tests for virtwrap/cache
Signed-off-by: David Vossel <davidvossel@gmail.com>
add advanced domain notify tests
Signed-off-by: David Vossel <davidvossel@gmail.com>
DomainManager ListAllDomains unit tests
Signed-off-by: David Vossel <davidvossel@gmail.com>
adjust console test timeouts
Signed-off-by: David Vossel <davidvossel@gmail.com>

davidvossel added some commits Jan 24, 2018

Remove private shared directory now that libvirtd is in each VM pod
Signed-off-by: David Vossel <davidvossel@gmail.com>
adjust timeout for config functional tests
Signed-off-by: David Vossel <davidvossel@gmail.com>
allow domain xml to be modified during preStartHook
Signed-off-by: David Vossel <davidvossel@gmail.com>
ensure cmd socket is cleaned up
Signed-off-by: David Vossel <davidvossel@gmail.com>
Replaces ListDomains function with GetDomain in cmd-server client
Signed-off-by: David Vossel <davidvossel@gmail.com>
Removes dead code from libvirt.sh startup file
Signed-off-by: David Vossel <davidvossel@gmail.com>
Move domain informer to virt-handler
Signed-off-by: David Vossel <davidvossel@gmail.com>
Remove watchdog informer and wrap watchdog logic entirely into domain…
… informer

Signed-off-by: David Vossel <davidvossel@gmail.com>
move command client to virt-handler
Signed-off-by: David Vossel <davidvossel@gmail.com>
move notify-server to virt-handler
Signed-off-by: David Vossel <davidvossel@gmail.com>
Rename StartVirtualMachine to SyncVirtualMachine
Signed-off-by: David Vossel <davidvossel@gmail.com>
Add Binary Marshaler to VM object
Signed-off-by: David Vossel <davidvossel@gmail.com>
@rmohr

rmohr approved these changes Feb 1, 2018

@rmohr rmohr merged commit 8a77fdc into kubevirt:master Feb 1, 2018

4 of 6 checks passed

check-patch.el7.x86_64 Running test
Details
standard-ci Running tests
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.6%) to 66.119%
Details
kubevirt-functional-tests/jenkins/pr/vagrant-dev All is well still in progress...
Details
kubevirt-functional-tests/jenkins/pr/vagrant-release All is well still in progress...
Details

@rmohr rmohr referenced this pull request Feb 1, 2018

Merged

Decentralized pod networking #686

fabiand added a commit to fabiand/kubevirt that referenced this pull request Feb 4, 2018

docs: Update architecture diagram to reflect kubevirt#663
Fixes kubevirt#698

Signed-off-by: Fabian Deutsch <fabiand@fedoraproject.org>

rmohr added a commit that referenced this pull request Feb 5, 2018

Merge pull request #707 from fabiand/archDiagUpdate
docs: Update architecture diagram to reflect #663

rmohr added a commit to fromanirh/kubevirt that referenced this pull request Feb 7, 2018

docs: Update architecture diagram to reflect kubevirt#663
Fixes kubevirt#698

Signed-off-by: Fabian Deutsch <fabiand@fedoraproject.org>

@fabiand fabiand referenced this pull request Feb 14, 2018

Closed

Move to a decentralized approach #645

3 of 3 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment