Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upDecentralized Libvirt #663
Conversation
// - storage prep | ||
// - network prep | ||
// - cloud-init | ||
func (l *LibvirtDomainManager) preStartHook(vm *v1.VirtualMachine, domain *api.Domain) error { |
This comment has been minimized.
This comment has been minimized.
davidvossel
Jan 18, 2018
Author
Member
@vladikr This is an entry point you can use for setting up anything in the environment that needs to be setup for networking. It occurs right before the VM starts.
f562b9d
to
879d939
This comment has been minimized.
This comment has been minimized.
retest this please |
879d939
to
0105418
dfbe2c5
to
86cbcdf
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
The description has been updated with a detailed account of what has been changed in this patch series. |
Thanks for the extensiove description. Overall this looks good! There are a few things (i.e. talking to launcher) where I am unsure how we wanna do this on the long run, but for now all of this works for me. I think we should merge this to really expose this work to broader testing. |
This comment has been minimized.
This comment has been minimized.
retest this please |
It looks good to me and I didn't run into any issues so far while working on networking. I'd prefer to have it merged before posting the networking PR. |
This comment has been minimized.
This comment has been minimized.
retest this please |
acc6550
to
2c6795e
This comment has been minimized.
This comment has been minimized.
retest this please |
Some initial thoughts and questions. |
ln -s cpu,cpuacct cpuacct,cpu | ||
mount -o remount,ro /host-sys/fs/cgroup | ||
fi | ||
#if [ ! -d "/host-sys/fs/cgroup/cpuacct,cpu" ]; then |
This comment has been minimized.
This comment has been minimized.
rmohr
Jan 30, 2018
Member
Since we don't need to bind mount the cgroups anymore, we can remove the code.
This comment has been minimized.
This comment has been minimized.
davidvossel
Jan 30, 2018
Author
Member
yep, i'll remove that now. it was commented out for that reason
This comment has been minimized.
This comment has been minimized.
// is down, the domain needs to be deleted from the cache. | ||
err = d.domainInformer.GetStore().Delete(domain) | ||
if err != nil { | ||
return err |
This comment has been minimized.
This comment has been minimized.
rmohr
Jan 30, 2018
Member
That sounds conceptually wrong. I think that we need to make sure that the listwatcher transforms a "down" to a cache delete.
This comment has been minimized.
This comment has been minimized.
davidvossel
Jan 30, 2018
Author
Member
Under normal operation, the listwatcher will process all of this correctly. Delete events come in and domains are removed from the cache by the informer.
This is an edge case where the informer never receives the Delete event from the VM Pod. The watchdog expires and we remove the cache entry here. This condition could occur if virt-handler is down when the final delete notification is sent, or if virt-launcher forcibly exits in a way that results in the final delete notification not being sent.
This comment has been minimized.
This comment has been minimized.
rmohr
Jan 30, 2018
Member
This condition could occur if virt-handler is down when the final delete notification is sent or if virt-launcher forcibly exits in a way that results in the final delete notification not being sent.
That should normally not be necessary. A few mechanisms should be sufficient to prevent that:
- warming up the domain cache with the vm cluster-state. If we then don't find a domain, matching the expectation based on the vm, the informer should delete the object from the cache automatically
- If we know which socket belongs to which vm, we can infer from that, that if the socket connection gets closed, or if no one listens on the socket, which vm got removed.
@davidvossel can you think of a scenario where we can miss a delete if the above conditions are met?
This comment has been minimized.
This comment has been minimized.
davidvossel
Jan 30, 2018
Author
Member
i'll give this part some more attention this afternoon. There were some things i was trying to avoid with regards to coupling the client connection to the domain's status. I think this can be simplified though.
This comment has been minimized.
This comment has been minimized.
davidvossel
Jan 30, 2018
Author
Member
I've removed the watchdog informer entirely. The domain informer now polls for stale watchdog files and fires a delete event when one is encountered.
I'm hesitant to tie any of this Delete event logic directly to the unix socket, which is why i'm still using the watchdog files. I don't want the presence, or lack of presence of the unix socket file to infer a stale domain needs to be processed.
This comment has been minimized.
This comment has been minimized.
rmohr
Jan 31, 2018
Member
works for me. Just modifying the cache outside of the informer is a no go afaik (locking, ...).
return err | ||
} | ||
|
||
func (c *VirtLauncherClient) ListDomains() ([]*api.Domain, error) { |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
* | ||
*/ | ||
|
||
package cache |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ShutdownVirtualMachine(vm *v1.VirtualMachine) error | ||
KillVirtualMachine(vm *v1.VirtualMachine) error | ||
SyncSecret(vm *v1.VirtualMachine, usageType string, usageID string, secretValue string) error | ||
ListDomains() ([]*api.Domain, error) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
davidvossel
Jan 30, 2018
Author
Member
I wanted to use the ListAllDomains libvirt api call rather than having the cmd server maintain any state related to the name of the domain it is managing. I could hide that we're using the ListAllDomain's function and just return the first domain entry for the command client's GetDomain function. I didn't see a reason to do that though.
This comment has been minimized.
This comment has been minimized.
rmohr
Jan 30, 2018
Member
That is very confusing, since we know that it can always only ever return one domain (I think we even know the name of the domain to look up).
This comment has been minimized.
This comment has been minimized.
davidvossel
Jan 30, 2018
Author
Member
how do we know the name of the domain to look up? This function is used by the domain informer's List function to sync the cache. There's no knowledge on the client side as to what domain they're requesting at that point.
This comment has been minimized.
This comment has been minimized.
rmohr
Jan 30, 2018
Member
Still, it can just return one domain, or am I at the wrong place in the code?
This comment has been minimized.
This comment has been minimized.
davidvossel
Jan 30, 2018
Author
Member
yes, we can hind the fact that the command server is using the ListAllDomain libivrt api function and just call the command client's function GetDomain. I'm fine with that
This comment has been minimized.
This comment has been minimized.
} | ||
|
||
args := &Args{ | ||
VMJSON: string(vmJSON), |
This comment has been minimized.
This comment has been minimized.
rmohr
Jan 30, 2018
Member
I would feel better if we understand the issue we are facing, which we are working around by using json.
This comment has been minimized.
This comment has been minimized.
davidvossel
Jan 30, 2018
Author
Member
type ResourceList map[ResourceName]resource.Quantity
Anything that uses that map (like memory or cpu requests/limits) gets lost if we don't serialize/deserialize the VirtualMachine as a JSON object.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
davidvossel
Jan 30, 2018
Author
Member
before
apiVersion: kubevirt.io/v1alpha1
kind: VirtualMachine
metadata:
name: testvm
spec:
terminationGracePeriodSeconds: 0
domain:
resources:
requests:
memory: 64M
after
kind: VirtualMachine
metadata:
name: testvm
spec:
terminationGracePeriodSeconds: 0
domain:
resources:
requests:
memory:
the memory quantity gets lost
This comment has been minimized.
This comment has been minimized.
rmohr
Jan 31, 2018
Member
ok, so the reason is that the Quantity type has hidden fields, and the serializer by default only takes public fields. Therefore it keeps the format, but not the actual value ... A custom MarshalBinary() might solve that. However not sure if it is easy to do ...
This comment has been minimized.
This comment has been minimized.
rmohr
Jan 31, 2018
Member
ok, so I think I have a solution. Could you implement "MarshalBinary()" and "UnmarshalBinary()" on our VM type? In there just marshal the whole struct into json and convert it to bytes. Then you can remove all custom json marshalling actions regarding to RPC. That would give me a better feeling.
This comment has been minimized.
This comment has been minimized.
rmohr
Jan 31, 2018
Member
I think on the long run, we should use protopuf for rpc, since k8s takes care that their types work with protobuf (and aparently don't care about normal rpc).
This comment has been minimized.
This comment has been minimized.
davidvossel
Jan 31, 2018
Author
Member
Could you implement "MarshalBinary()" and "UnmarshalBinary()" on our VM type
that seems a little overkill. Is there a simple way of doing this? The binary encoding that golang provides is the same thing that rpc uses, which stripped the memory requests.
This comment has been minimized.
This comment has been minimized.
davidvossel
Jan 31, 2018
Author
Member
i see what you're getting at now. We'll just use the a MarshalBinary function and wrap the json encode in it. I'm fine with doing that.
This comment has been minimized.
This comment has been minimized.
1fdb091
to
f99aebf
srvErr := make(chan error) | ||
go func() { | ||
defer close(srvErr) | ||
err := notifyserver.RunServer(d.virtShareDir, d.stopChan, d.eventChan) |
This comment has been minimized.
This comment has been minimized.
rmohr
Jan 31, 2018
Member
The whole notify server might also be better placed in the virt-handler package.
This comment has been minimized.
This comment has been minimized.
davidvossel
Jan 31, 2018
Author
Member
i'm torn about that one. The notify client depends on libvirt. I could move the server to virt-handler but I'd want to keep the client package under virt-launcher. i thought it was better to keep them both together in one place.
This comment has been minimized.
This comment has been minimized.
return nil | ||
} | ||
|
||
func (s *Launcher) Start(args *cmdclient.Args, reply *cmdclient.Reply) error { |
This comment has been minimized.
This comment has been minimized.
rmohr
Jan 31, 2018
Member
I think that Start
is misleading. Having something like Sync
in the name is more correct. Also right now, virt-handler would try to unpause a VM, if it is in an unexpected pause mode, so to some degree it really already tries to synchronize the vm and the domain state.
This comment has been minimized.
This comment has been minimized.
} | ||
|
||
args := &Args{ | ||
VMJSON: string(vmJSON), |
This comment has been minimized.
This comment has been minimized.
rmohr
Jan 31, 2018
Member
ok, so the reason is that the Quantity type has hidden fields, and the serializer by default only takes public fields. Therefore it keeps the format, but not the actual value ... A custom MarshalBinary() might solve that. However not sure if it is easy to do ...
…ated Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
… informer Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
f99aebf
to
0dff512
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
8a77fdc
into
kubevirt:master
Fixes kubevirt#698 Signed-off-by: Fabian Deutsch <fabiand@fedoraproject.org>
Fixes kubevirt#698 Signed-off-by: Fabian Deutsch <fabiand@fedoraproject.org>
davidvossel commentedJan 18, 2018
•
edited
Overview
This patch moves KubeVirt's design from one that depends on a centralized libvirtd per a node, to one that uses a decentralized libvirtd per a VM pod.
With this new decentralized approach, each VM's qemu process now lives directly within the VM Pod's cgroups and namespaces which means that any storage/network devices in the Pod are available to the VM.
Notable Changes
Cloud-init and Registry Disks
Generation and lifecycle management of ephemeral disks have moved from virt-handler to virt-launcher.
This data is now completely self contained (no shared host mounts) within the VM Pod, which means cleanup occurs automatically as a part of the kubelet tearing down the Pod's environment.
Notifications Server and Domain Informer
Previously virt-handler received events about lifecycle changes to a domain through a libvirt event callback.
Now virt-handler receives domain lifecycle events through its notification server. Virt-handler starts a notification server that listens on a unix socket. Each virt-launcher acts as a client to this notification server and forwards domain lifecycle events to it.
The virt-handler domain informer uses this notification server for its Watch function. The informer's List function iterates over every known virt-launcher present on the local host and requests the latest information about all defined domains.
Virt-launcher Command Server
Each virt-launcher starts a command server that listens on a unix socket as part of the virt-launcher process's initialization.
By design, Virt-launcher has no connection to the k8s api server. The command server allows virt-handler to manage the VM's lifecycle by posting VM specs to virt-launcher to start/stop.
This command server is also how the virt-handler's Domain informer perform's its List function. There's a directory of unix sockets belonging to each virt-launcher. The domain informer List function iterates over each of these sockets and creates a cache of all the active domains on the local node.
Migrations have been disabled
The reasoning for this is migrations depend on network access to the libvirtd process managing the VM. Our plan for networking has the IP provided to each VM pod being taken over by the VM itself, which means processes running in the pod (other than the qemu process) will not have network access in the near future.
This doesn't mean we are abandoning live migrations. It just means we are accepting that migrations are a feature we're willing to sacrifice in the short term in order to simplify the move to a more desirable overall KubeVirt design.
Re-enabling migrations is being tracked in this issue #676
Libvirtd and Virtlogd
The libvirtd and virtlogd processes are now launched as part of virt-launcher's initialization sequence.
Originally I had libvirtd and virtlogd in their own respective containers in the VM pod, however this caused issues with startup and shutdown ordering.
Virt-launcher intercepts posix signals to shutdown and uses that as a signal to begin gracefully shutting down the VM. We need to ensure that the libvirtd process does not shutdown until after the VM has exited. This was hard to guarantee with libvirtd not being in the same container and controlled by virt-launcher.
Code Removal and Relocation
Networking
Nothing with involved with networking was impacted by this patch series. The VM pods remain in the host network namespace for now simply because the Pod network work hasn't been completed yet.
Testing Changes
Issues resolved by these changes.
Fixes #421, fixes #364, fixes #196