Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: Live migration for bridged pod network #182

Closed

Conversation

@kubevirt-bot kubevirt-bot added the dco-signoff: no Indicates the PR's author has not DCO signed all their commits. label Jul 6, 2022
@kvaps
Copy link
Member Author

kvaps commented Jul 6, 2022

/assign @maiqueb @rthallisey

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps kvaps force-pushed the live-migration-for-bridged-pod-network branch from af65532 to 17fb66c Compare July 6, 2022 13:55
@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Jul 6, 2022
@maiqueb
Copy link
Contributor

maiqueb commented Jul 6, 2022

/cc @AlonaKaplan

Copy link
Contributor

@maiqueb maiqueb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to come up w/ this proposal.

I'm mostly worried / concerned about the workloads that are bound to certain addresses.

I'm unsure we can accept this in without offering a way to protect those.

@AlonaKaplan would you review this proposal as well ?

Comment on lines 12 to 14
The live-migration of virtual machines between the pods with different MAC address will invoke NIC reattaching procedure. This might affect some applications which are binding to the specific interface inside the VM.

The live-migration of virtual machines between the pods with same MAC address will invoke the procedure to link down and up for the VM to renew DHCP lease with IP address and routes inside the VM. This is less destructive operation, but still may affect some workloads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we care a lot about this type of workloads. I don't think you can have this as non-goals - you need to account for this in your design imo.

@AlonaKaplan can you chime in ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well yeah, the IP is changed so processes inside the guest the are using the IP will be affected. Same as external processes. Documenting it as first stage should be enough.

design-proposals/live-migration-for-bridged-pod-network.md Outdated Show resolved Hide resolved
design-proposals/live-migration-for-bridged-pod-network.md Outdated Show resolved Hide resolved
design-proposals/live-migration-for-bridged-pod-network.md Outdated Show resolved Hide resolved
@kubevirt-bot kubevirt-bot added dco-signoff: no Indicates the PR's author has not DCO signed all their commits. and removed dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jul 8, 2022
@kvaps kvaps force-pushed the live-migration-for-bridged-pod-network branch 2 times, most recently from 47b1b30 to eb0d5dd Compare July 13, 2022 14:58
@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Jul 13, 2022
Copy link
Member

@EdDev EdDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this interesting proposal!

I like it and at the same time a bit worried.
It is not very clear to me how much potential adoption it may have due to the disruptions and limitations. But perhaps this is a good opportunity to try it and see.

I have commented inline about a few questions and concerns.

Thank you for bringing this to the table.


# Design

To add a new feature gate `NetworkAwareLiveMigration`, which enables two additional methods in live-migration procedure:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think the name better be specific here, e.g. DisruptiveBridgeBindingMigration.
  • Could you please elaborate on when in the (migration) flow you plan to execute the detach/attach operation?
  • Could you please elaborate on when in the (migration) flow you plan to execute the link down/up operation?
  • If I understand correctly, the IP address is always expected to change, no matter if the mac is changed or not. Right?

Copy link
Member Author

@kvaps kvaps Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name better be specific here, e.g. DisruptiveBridgeBindingMigration.

I wouldn't specify any specific mode in feature gate name. The next PR is upcoming, it will bring macvtap binding mode which also will work with this feature.

  • Could you please elaborate on when in the (migration) flow you plan to execute the detach/attach operation?
  • Could you please elaborate on when in the (migration) flow you plan to execute the link down/up operation?

That was described here

- If MAC address changed: detach / attach interface after live migration to have correct MAC address set
The live-migration of virtual machines between the pods with different MAC address will invoke NIC reattaching procedure. This may affect some applications which are binding to the specific interface inside the VM.
- If MAC address is not changed: link down / link up interface to force VM request new IP address and routes from DHCP
The live-migration of virtual machines between the pods with same MAC address will invoke the procedure to link down and up for the VM to renew DHCP lease with IP address and routes inside the VM. This is less destructive operation, but still may affect some workloads listening on particular IP addresses.

If you talking about when exactly is it going to happen, sure, I will add the information that this will happened after the successful live-migration

If I understand correctly, the IP address is always expected to change, no matter if the mac is changed or not. Right?

Not exactly, some CNIs allow to specify IP-address in annotation, this way IP-address will not be changed, but we still need to link down and link up the interface to make sure that VM has correct routes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the criteria for this feature to graduate from a feature flag to being default enabled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the criteria for this feature to graduate from a feature flag to being default enabled?

it would need to function for the majority of guest OSs and not cause undesirable behavior for existing installs. I'm unsure if this will reach a point where we can make it on by default, but it's a possibility.


## Motivation
Masquerade and slirp are not so performant compared to bridge mode.
We want to use the less latency modes with the opportunity to live-migrate the virtual machines.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be interested to understand the motivation to have a disruptive migration.
If such a migration can cause up to a minute of disruption, aren't there other means to recover a VM on a different node? I was under the impression that live-migration with high disruption of connectivity can cause more harm then good (applications will still run while disruption occurs, vs applications with the whole VM freeze and get restored).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If such a migration can cause up to a minute of disruption, aren't there other means to recover a VM on a different node?

From my tests this is happening within a few seconds not minutes, so even tcp sessions do not have time to break.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do tcp connections survive an ip address change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some CNIs support specifying IP-address per pod. So in this way IP-address is not changed, but still requires to renew DHCP lease for updating routes inside the VM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They won't survive. It will not survive even a link flickering... those TCP connections will just reset.
But this is the case with masquerade as well, so this is not something new.
I was more worried about the duration of the downtime (e.g. in the case of attach/detach).

Copy link
Member Author

@kvaps kvaps Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I just performed a test, I used the flowing commands:

server vm:

mkdir cgi-bin
printf '%s\n' '#!/bin/sh' 'cat /dev/urandom' > cgi-bin/1.sh
chmod +x cgi-bin/1.sh
python -m http.server --cgi

client vm:

curl 10.10.10.1:8000/cgi-bin/1.sh > /dev/null

Now I can assure that the tcp-connections are surviving in case if IP address is not changed.

Also I checked flood ping in both cases. When MAC-address is changes and when is not, result:

When MAC-address is the same (link down and link up interface):

47 packets lost
[fedora@vm1 ~]$ sudo ping -f 10.10.10.2
PING 10.10.10.2 (10.10.10.2) 56(84) bytes of data.
...............................................
--- 10.10.10.2 ping statistics ---
72544 packets transmitted, 72497 received, 0.0647883% packet loss, time 14943ms
rtt min/avg/max/mdev = 0.069/0.117/8.138/0.160 ms, ipg/ewma 0.205/0.106 ms

When MAC-address is changed (reattach network card):

54 packets lost
[fedora@vm1 ~]$ sudo ping -f 10.10.10.2
PING 10.10.10.2 (10.10.10.2) 56(84) bytes of data.
......................................................
--- 10.10.10.2 ping statistics ---
102443 packets transmitted, 102389 received, 0.0527122% packet loss, time 20785ms
rtt min/avg/max/mdev = 0.069/0.116/39.293/0.222 ms, pipe 4, ipg/ewma 0.202/0.355 ms

- live-migrate client VM
- Run ping from client to server

# Implementation Phases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the topic of maintaining this special flow.
Unfortunately, our code is not well fitted to introduce a new option and at the same time keep it isolated and centralized. When the additions are scattered across many areas, it makes it harder to maintain it.

Well, it may be well worth investing in maintaining it by the sig-network if it has enough attraction, interest and at the end real usage.
I think it is worth raising these points now, so we could be clear that even if this feature is accepted as an Alpha, its existence for the long run is dependent on adoption and usage. I guess this is similar to the process of Kuberenetes, but we should have one as well, especially for such controversy features.

Comment on lines +17 to +22
## User Stories
* As a user / admin, I want to have an opportunity for live-migration of a VM with bridged pod-network.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to have some user stories that focus on why we'd want to live migrate with bridge mode, given the limitations this functionality imposes on applications.

So for example, what kinds of applications and scenarios tolerate this kind of live migration where the IP and Mac change for a VM during the runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually be more interested in what apps / scenarios do not tolerate this kind of live migration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may affect some applications which are binding to the specific interface inside the VM.

Isn't it enough? Or should I specify cases with exact applications (eg. apache2 server configured to bind on pod IP instead of 0.0.0.0 and so on)?

Comment on lines +45 to +54
- Create two VMs: client and server in bridge mode
- Wait for launch
- Get IP address for server VM
- Run ping from client to server
- live-migrate client VM
- Run ping from client to server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like i mentioned in the PR, i'm curious how pinging the same server IP after the live migration works. My expectation was that the VMI after the migration is responding to a new IP address

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise. I share your expectation, and I think all the established tcp connections will go down.

Copy link
Member Author

@kvaps kvaps Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is same approach we having in test for macvtap-cni. It's not the same ping command, but two different:

eg:

  • Run ping from client to server (client has an IP 10.10.10.1, server has IP 10.10.10.20, server responding to 10.10.10.1)
  • live-migrate client VM (client has an IP 10.10.10.2, server has IP 10.10.10.20, ping is not running)
  • Run ping from client to server (client has an IP 10.10.10.2, server has IP 10.10.10.20, server responding to 10.10.10.2)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The server is not the one migrated, it is the client (line 49). So initiating a ping from the client should work fine.
  • The test is on ICMP packets, not TCP.
  • This is not much different from masquerade binding tests. In masquerade binding, it visible IP is also changing, so this is not something new.
  • TCP connections are dropped/reset in masquerade binding as well, our existing migration is not smooth as well.

I would add here TCP/UDP tests as well, including ones with a service (to show that even if the IP changes, access through the service is returned eventually).


## Motivation
Masquerade and slirp are not so performant compared to bridge mode.
We want to use the less latency modes with the opportunity to live-migrate the virtual machines.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do tcp connections survive an ip address change?

@AlonaKaplan
Copy link
Member

Hi @kvaps thank you very much for the proposal and the PR. The idea generally looks good.
We're currently working on moving the binding mechanism outside of kubevirt and implementing each binding as a separate CNI.
This will allow to easily integrate/update new/existing bindings.
Until this change is introduced, we try to avoid introducing dramatic changes to the existing bindings.

@maiqueb is currently working on it and we will update this PR once moving the bridge binding to a CNI will be ready.

@maiqueb
Copy link
Contributor

maiqueb commented Jul 27, 2022

Hi @kvaps thank you very much for the proposal and the PR. The idea generally looks good. We're currently working on moving the binding mechanism outside of kubevirt and implementing each binding as a separate CNI. This will allow to easily integrate/update new/existing bindings. Until this change is introduced, we try to avoid introducing dramatic changes to the existing bindings.

@maiqueb is currently working on it and we will update this PR once moving the bridge binding to a CNI will be ready.

I want to make it clear that you should adjust your expectations: I started to look into this very recently, and currently can't provide a time-line for it. I'll CC you both on the design document once I have a plan set up for it. Unfortunately, this is not my current primary focus, meaning it will take some time.

The motivation behind it is to enable you (and anyone interested in providing another binding type / custom binding behavior) to do so, experiment with it, while keeping the maintainability of the KubeVirt project to something manageable.

I would welcome any sort of input / collaboration for this task, going forward @kvaps .

@kvaps
Copy link
Member Author

kvaps commented Jul 27, 2022

@AlonaKaplan @maiqueb this PR does not bring are new binding modes to KubeVirt, its just allowing to live-migrate in case if you use bridge.
Maybe you talk about another my PR kubevirt/kubevirt#7648 or do you mean both of them?


## Motivation
Masquerade and slirp are not so performant compared to bridge and macvtap modes.
Masquerade and slirp are not so performant compared to bridge mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I never thought about performance as a reason to use bridge binding on the primary network.
If you need better performance, why not using a secondary network?
Can you please elaborate about the use case.

Copy link
Member Author

@kvaps kvaps Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no other options to expose VM as Kubernetes service.
We still need to communicate with the other pods in the cluster and reuse cloud load-balancers or MetalLB.
We forced to use pod network in any case we want to communicate with Kubernetes.

The pod networking also provides unified security policies which are just working out-of-the-box.

Copy link
Member

@AlonaKaplan AlonaKaplan Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try to use masquerade with multiqueue?
That's the result of a comparison we did in the past (masquerade binding, 8vcpu 8multiqueue)-

Masquerade overhead is minimized with multiqueue.
Masquerade vs. bridge binding
TCP Stream: -37% (nomq) -13% (mq)
UDP Stream:  -2% (nomq) -6% (mq)
TCP Latency: -24% (nomq) -3% (mq)
UDP Latency: -6% (nomq) -2% (mq)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlonaKaplan which tool you used to benchmark this?

@AlonaKaplan
Copy link
Member

@AlonaKaplan @maiqueb this PR does not bring are new binding modes to KubeVirt, its just allowing to live-migrate in case if you use bridge. Maybe you talk about another my PR kubevirt/kubevirt#7648 or do you mean both of them?

This and the other PR made us think about it. And we realized there is a need to allow easily adding new bindings/extend the existing ones.

Bridge binding on the primary network is a legacy binding. We still support it, we are fixing bugs on it. But we try to avoid adding new functionality to it.
For example, IPv6 support was never added to it.
We encourage users to use masquerade as the primary binding. If better performance is needed masquerade can be used with multi queues.

However, once each binding will be implemented as a separate CNI it will be possible to add whatever binding CNI you need or extend an existing one - without the need to get an approval from the kubevirt team.

@AlonaKaplan
Copy link
Member

@kvaps throwing another idea that you may use until the CNI binding mechanism will be introduced.
We have a hook point for additional setup on the target pod.
You can add the plug/unlug/link/unlink code to a hook sidecar.

@kubevirt-bot kubevirt-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 18, 2022
@rthallisey
Copy link
Contributor

/remove-lifecycle stale

@kvaps I think this is close. Do you plan on doing another revision?

@kubevirt-bot kubevirt-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2022
@kvaps
Copy link
Member Author

kvaps commented Dec 27, 2022

This patch works nice in our fork.

It can't be implemented using hooks, as KubeVirt blocking all the migrations with bridge binding on pod networking.

So I'm waiting for plug-in interface to implement this as external plugin for KubeVirt.

The design of this feature is not going to change any way, so I'd like to keep this proposal open

Co-authored-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Co-authored-by: Ryan Hallisey <rhallisey@nvidia.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps kvaps force-pushed the live-migration-for-bridged-pod-network branch from eb0d5dd to f8fc617 Compare March 2, 2023 08:49
@kubevirt-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign davidvossel for approval by writing /assign @davidvossel in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xagent003
Copy link

I would like this too... we are using Calico with a dedicate IPPool for VMs and specifying a fixed IP from this via a calico annotation. So VM has same IP as virt-lauincher Pod, and this IP is always the same even if it is recreated on a new node.

We don't want masquerade because of extra added NAT even between 2 VMs talking on same network, and and VMs won't know their actual IP configured inside.

With Multus, and a L2 model and DHCP and kubemacpool, this is fine for some cases but doesn't let us scale network creations as it requires us to tag infra and add physical gateways for each L2 network. We tried kube-ovn to allow virtual networks, but hit many issues, apparently it doesn't work well and isnt tested as a secondary CNI nor with Kubevirt. Multus networks also aren't first class K8s citizens and there is no firewalling/security group features really

So, we'd really like to use a primary CNI like Calico for its feature set and L3 networking, use a fixed IP for VMs, and have it be in bridge mode so VM doesn't have private IP and extra NAT. Why would live migration be an issue in this case?

@xuzhenglun
Copy link

xuzhenglun commented Apr 18, 2023

Is anybody know why a pod network with bridge mode would be a problem? or why multus with bridge works fine. what if I use multus default interface with bridge mode?

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  labels:
    kubevirt.io/os: linux
  name: test-vm
  namespace: kubevirt
spec:
  runStrategy: Manual
  template:
    spec:
      domain:
        cpu:
          cores: 2
        devices:
          disks:
          - disk: {}
            name: disk0
          interfaces:
          - bridge: {}
            name: default
        machine:
          type: q35
        resources:
          requests:
            memory: 8192M
      networks:
      - multus:
          default: true                                   # apply multus interface to default eth0
          networkName: calico-default      # underlay CNI, which used by pod network by default
        name: default
...

In the current kubevirt code, the below usage is allowed to live migrate.

@kubevirt-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@kubevirt-bot kubevirt-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2023
@kvaps
Copy link
Member Author

kvaps commented Jul 26, 2023

I migrated my fork from kvaps/community to deckhouse/3p-kubevirt-community, now it will be supported by deckhouse project
fyi

@kvaps
Copy link
Member Author

kvaps commented Jul 26, 2023

/remove-lifecycle stale

@kubevirt-bot kubevirt-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 26, 2023
@kubevirt-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@kubevirt-bot kubevirt-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2023
@rthallisey
Copy link
Contributor

/remove-lifecycle stale

@kubevirt-bot kubevirt-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2023
@kubevirt-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@kubevirt-bot kubevirt-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024
@aburdenthehand
Copy link
Contributor

@kvaps Heya - now that plugins have landed, what's the state of this proposal, and do you need something to get it over the line?

@kvaps
Copy link
Member Author

kvaps commented Feb 21, 2024

Hi @aburdenthehand, I think this approach can be implemented using new pluggable API, but this will not work with default in-tree plugins and third-party out-tree plugins.

Right now I have no sponsors interested in this feature, so I think, it can be clused.
But before this, I'd like to pose a question to the community: Do you need this functionality?

@kvaps
Copy link
Member Author

kvaps commented Mar 13, 2024

Okay I'll close it for now.
If someone interested in this feature just leave a comment here.

/close

@kubevirt-bot
Copy link

@kvaps: Closed this PR.

In response to this:

Okay I'll close it for now.
If someone interested in this feature just leave a comment here.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants