Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

masquerade, migration: hardcode the bridge MAC addr #5000

Conversation

maiqueb
Copy link
Contributor

@maiqueb maiqueb commented Feb 10, 2021

What this PR does / why we need it:
Since the bridge MAC address no longer changes, the ARP tables do
not need to stabilize, which preserves connectivity from the
outside world to the guest.

This is tested in this commit since the Eventually block in the
vmi_networking_test for masquerade binding migration can be safely
removed.

Reserve the hardcoded mac address when masquerade binding is used; as such, throw an error whenever a user attempts to
specify that MAC on a masquerade interface.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4999

Special notes for your reviewer:

Release note:

NONE

@kubevirt-bot kubevirt-bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/S labels Feb 10, 2021
@maiqueb
Copy link
Contributor Author

maiqueb commented Feb 10, 2021

/assign @EdDev

@maiqueb maiqueb force-pushed the lost-ip-connectivity-after-migration-masquerade-binding branch from 6960209 to 8eb934a Compare February 10, 2021 12:16
@maiqueb maiqueb changed the title masquerade, ipv6m, migration: hardcode the MAC addr masquerade, migration: hardcode the bridge MAC addr Feb 10, 2021
@maiqueb
Copy link
Contributor Author

maiqueb commented Feb 10, 2021

/retest

Copy link
Member

@EdDev EdDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplicity, but then one point comes up: What reserves this static MAC address from being used by:

  • The pod interface (eth0).
  • A secondary interface CNI (multus).
  • The VM/VMI user on the interface spec.
  • The KMP.

Comment on lines 49 to 52
// the hardcoded MAC must be as low as possible, to prevent libvirt from
// assigning a lower MAC to a tap device attached to the bridge - which
// would trigger the bridge's MAC to update. This also applies for the
// dummy connected to the bridge on masquerade binding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unable to recreate this on my 3.10.0-1160.6.1.el7.x86_64 kernel.
Once I set the bridge iface MAC, it is not changed if I attach a dummy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlonaKaplan do you remember the reference we had for this ?

// assigning a lower MAC to a tap device attached to the bridge - which
// would trigger the bridge's MAC to update. This also applies for the
// dummy connected to the bridge on masquerade binding.
const hardcodedMasqueradeMAC = "02:00:00:00:00:00"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about staticMasqueradeBridgeMAC ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@maiqueb
Copy link
Contributor Author

maiqueb commented Feb 12, 2021

I like the simplicity, but then one point comes up: What reserves this static MAC address from being used by:

  • The pod interface (eth0).
  • A secondary interface CNI (multus).
  • The VM/VMI user on the interface spec.
  • The KMP.

Nothing. In any of those.

@maiqueb
Copy link
Contributor Author

maiqueb commented Feb 19, 2021

@EdDev I'd like to point out that (unless my math is seriously wrong) this PR addresses exactly (2/(2^40))*100 % of the scenarios if multus is not involved, and ((2+M)/2^40 )*100 % of the scenarios for M multus interfaces.

I.e. this solves a real problem for an insanely high percentage of possibilities. Yes, it does have a couple of corner cases. I think we could document this limitation (not sure where / how).

In the current refactoring of the pod interfaces I'm planning, I intend to propose a way to also cache the bridge interface between the handler / launcher processes, which would allow us to revert this, and use whatever generated MAC the kernel happily assigned for us.

2^40 => number of possible MAC addresses if we force the first octect to be 02
2 + M => number of 'collision domains (pod interface + VM interface + M multus interfaces). I'm leaving KubeMacPool out of this, for now since I'm too ignorant to model it.

@RamLavi : one question: does kubemacpool allow you to 'black-list' mac addresses ? I.e. "dear kubemacpool; whatever you do, do not assign this particular MAC". This black list would preferably be 'scoped' per interface.

@maiqueb
Copy link
Contributor Author

maiqueb commented Feb 19, 2021

@EdDev I'd like to point out that (unless my math is seriously wrong) this PR addresses exactly (2/(2^40))*100 % of the scenarios if multus is not involved, and ((2+M)/2^40 )*100 % of the scenarios for M multus interfaces.

I.e. this solves a real problem for an insanely high percentage of possibilities. Yes, it does have a couple of corner cases. I think we could document this limitation (not sure where / how).

In the current refactoring of the pod interfaces I'm planning, I intend to propose a way to also cache the bridge interface between the handler / launcher processes, which would allow us to revert this, and use whatever generated MAC the kernel happily assigned for us.

2^40 => number of possible MAC addresses if we force the first octect to be 02
2 + M => number of 'collision domains (pod interface + VM interface + M multus interfaces). I'm leaving KubeMacPool out of this, for now since I'm too ignorant to model it.

@RamLavi : one question: does kubemacpool allow you to 'black-list' mac addresses ? I.e. "dear kubemacpool; whatever you do, do not assign this particular MAC". This black list would preferably be 'scoped' per interface.

If I understand correctly, in kubemacpool you have to specify the prefix from which you're (you as in KMP) assigning MACs. It would be a matter of using whatever prefix does not start w/ 02:00:00 . I think that is assumable .

@RamLavi
Copy link
Contributor

RamLavi commented Feb 21, 2021

@RamLavi : one question: does kubemacpool allow you to 'black-list' mac addresses ? I.e. "dear kubemacpool; whatever you do, do not assign this particular MAC". This black list would preferably be 'scoped' per interface.

@maiqueb, Kubemacpool assigns MACs from the pool, defined by the range set when KMP is deployed. BUT, it does allow you to set a MAC Address manually, that may/may not be outside of this range. In other words, there is no "black-list" currently implemented.

@EdDev
Copy link
Member

EdDev commented Feb 21, 2021

@EdDev I'd like to point out that (unless my math is seriously wrong) this PR addresses exactly (2/(2^40))*100 % of the scenarios if multus is not involved, and ((2+M)/2^40 )*100 % of the scenarios for M multus interfaces.

I think it is enough to consider that some engineer will just think of the same address you managed to invent for similar reasons, ending up in a bad setup will little help from the system to understand what is going on.

Playing with probabilities can reason for concurrency bugs, leaks and other one-in-a-million issues. But I do not think it should be accepted.

I.e. this solves a real problem for an insanely high percentage of possibilities. Yes, it does have a couple of corner cases. I think we could document this limitation (not sure where / how).

When the poor operator and/or support engineer will have to debug this, I think we will have hard time to reason this.

At the moment a static MAC address is introduced and reserved, it makes sense to add a validation check to avoid MAC duplication.
Especially for cases where the MAC can be explicitly set by the user in the spec (including KMP).
In the context of this change, I think the minimum is to assert so no one will add this address on an interface.

@maiqueb
Copy link
Contributor Author

maiqueb commented Feb 22, 2021

@EdDev I'd like to point out that (unless my math is seriously wrong) this PR addresses exactly (2/(2^40))*100 % of the scenarios if multus is not involved, and ((2+M)/2^40 )*100 % of the scenarios for M multus interfaces.

I think it is enough to consider that some engineer will just think of the same address you managed to invent for similar reasons, ending up in a bad setup will little help from the system to understand what is going on.

Playing with probabilities can reason for concurrency bugs, leaks and other one-in-a-million issues. But I do not think it should be accepted.

I.e. this solves a real problem for an insanely high percentage of possibilities. Yes, it does have a couple of corner cases. I think we could document this limitation (not sure where / how).

When the poor operator and/or support engineer will have to debug this, I think we will have hard time to reason this.

At the moment a static MAC address is introduced and reserved, it makes sense to add a validation check to avoid MAC duplication.

Makes sense.

Especially for cases where the MAC can be explicitly set by the user in the spec (including KMP).
In the context of this change, I think the minimum is to assert so no one will add this address on an interface.

Copy link
Contributor

@RamLavi RamLavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this may collide with Kubemacpool.
02:00:00:00:00:00 is inside KMP's default range (the first one actually), and although it is advised to change the range prefix to be random, we can't assume ppl will change that.

I think every mac reserving should also be supported on KMP's side, otherwise we might end up with this mac colliding with the mac on the virt-launcher pod bridge.

@maiqueb
Copy link
Contributor Author

maiqueb commented Feb 25, 2021

I'm afraid this may collide with Kubemacpool.
02:00:00:00:00:00 is inside KMP's default range (the first one actually), and although it is advised to change the range prefix to be random, we can't assume ppl will change that.

I think every mac reserving should also be supported on KMP's side, otherwise we might end up with this mac colliding with the mac on the virt-launcher pod bridge.

That's half the problem. How will KMP play along with anything that requires a reserved mac address ? My short understanding is: it doesn't . And that includes whatever it is the users have running on their VMs.

Furthermore, I'm unsure if we can assume KMP is a dependency for kubevirt. AFAIU, it is not.

@phoracek would you shed some light into this ? How can anything with a specific mac address requirement (for whatever weird reason) co-exist with KMP ? What's KubeVirt's take on this ?

@EdDev
Copy link
Member

EdDev commented Mar 4, 2021

/sig network

@kubevirt-bot kubevirt-bot added sig/network needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 4, 2021
@phoracek
Copy link
Member

phoracek commented Mar 8, 2021

I'm afraid this may collide with Kubemacpool.
02:00:00:00:00:00 is inside KMP's default range (the first one actually), and although it is advised to change the range prefix to be random, we can't assume ppl will change that.
I think every mac reserving should also be supported on KMP's side, otherwise we might end up with this mac colliding with the mac on the virt-launcher pod bridge.

That's half the problem. How will KMP play along with anything that requires a reserved mac address ? My short understanding is: it doesn't . And that includes whatever it is the users have running on their VMs.

Furthermore, I'm unsure if we can assume KMP is a dependency for kubevirt. AFAIU, it is not.

Correct.

@phoracek would you shed some light into this ? How can anything with a specific mac address requirement (for whatever weird reason) co-exist with KMP ? What's KubeVirt's take on this ?

I believe that if the MAC is from the managed pool, we would either reject it (if taken) or register it as taken. If it is outside the pool, we don't care.

@RamLavi please keep me honest here.

@maiqueb
Copy link
Contributor Author

maiqueb commented Apr 14, 2021

@RamLavi / @phoracek / @EdDev let's re-start this discussion, the problem won't go away on its own.

AFAIU, we either need to implement a mechanism to force the bridge MAC to persist across migration / reboots, or hard-code it, and be done with the issue. If we hard-code, we'd need a mechanism in place to make sure that MAC is not used by the list in #5000 (review) .

It would require kubemacpool to have an exception list of sorts (I think it is far from a bad idea having it anyway).

I would rather hard-code it, since it is a lot simpler.

Is there any alternative I am not seeing ?

@EdDev
Copy link
Member

EdDev commented Apr 14, 2021

@RamLavi / @phoracek / @EdDev let's re-start this discussion, the problem won't go away on its own.

AFAIU, we either need to implement a mechanism to force the bridge MAC to persist across migration / reboots, or hard-code it, and be done with the issue. If we hard-code, we'd need a mechanism in place to make sure that MAC is not used by the list in #5000 (review) .

It would require kubemacpool to have an exception list of sorts (I think it is far from a bad idea having it anyway).

I would rather hard-code it, since it is a lot simpler.

Is there any alternative I am not seeing ?

I would suggest to define a reserved range and place it in a configmap (on the kubevirt CR?).
Then whoever needs to sync on what this range is, will read it from there (kubevirt, kmp, kubevirtci services, etc).
It will have also the advantage that an operator can change this range if it somehow collides with some network devices in his network.
What do you think?

Then it is a matter of just using a mac in this range in our code (in our case, just the first address in the pool).

But I am also ok with starting a simple single IP reserved address in kubevirt. If anyone, including KMP, will try to allocate it, kubevirt should reject it.

@@ -1450,6 +1450,28 @@ var _ = Describe("Validating VMICreate Admitter", func() {
Expect(len(causes)).To(Equal(1))
Expect(causes[0].Field).To(Equal("fake.domain.devices.interfaces[0].name"))
})
It("should reject a masquerade interface with a specified MAC address which is reserved by the BindMechanism", func() {
vm := v1.NewMinimalVMI("testvm")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a vmi, not a vm

Suggested change
vm := v1.NewMinimalVMI("testvm")
vmi := v1.NewMinimalVMI("testvmi")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@RamLavi
Copy link
Contributor

RamLavi commented Apr 20, 2021

I'm afraid this may collide with Kubemacpool.
02:00:00:00:00:00 is inside KMP's default range (the first one actually), and although it is advised to change the range prefix to be random, we can't assume ppl will change that.
I think every mac reserving should also be supported on KMP's side, otherwise we might end up with this mac colliding with the mac on the virt-launcher pod bridge.

That's half the problem. How will KMP play along with anything that requires a reserved mac address ? My short understanding is: it doesn't . And that includes whatever it is the users have running on their VMs.
Furthermore, I'm unsure if we can assume KMP is a dependency for kubevirt. AFAIU, it is not.

Correct.

@phoracek would you shed some light into this ? How can anything with a specific mac address requirement (for whatever weird reason) co-exist with KMP ? What's KubeVirt's take on this ?

I believe that if the MAC is from the managed pool, we would either reject it (if taken) or register it as taken. If it is outside the pool, we don't care.

@RamLavi please keep me honest here.

well since KMP only monitors the vm and the change proposed seems to hardcode the mac to the vmi, I fear that KMP won't catch/reject/register it.

@maiqueb
Copy link
Contributor Author

maiqueb commented May 5, 2021

@RamLavi / @phoracek / @EdDev let's re-start this discussion, the problem won't go away on its own.
AFAIU, we either need to implement a mechanism to force the bridge MAC to persist across migration / reboots, or hard-code it, and be done with the issue. If we hard-code, we'd need a mechanism in place to make sure that MAC is not used by the list in #5000 (review) .
It would require kubemacpool to have an exception list of sorts (I think it is far from a bad idea having it anyway).
I would rather hard-code it, since it is a lot simpler.
Is there any alternative I am not seeing ?

I would suggest to define a reserved range and place it in a configmap (on the kubevirt CR?).
Then whoever needs to sync on what this range is, will read it from there (kubevirt, kmp, kubevirtci services, etc).
It will have also the advantage that an operator can change this range if it somehow collides with some network devices in his network.
What do you think?

Then it is a matter of just using a mac in this range in our code (in our case, just the first address in the pool).

But I am also ok with starting a simple single IP reserved address in kubevirt. If anyone, including KMP, will try to allocate it, kubevirt should reject it.

The current proposal does this; we can then build on this, and afterwards introduce a range of reserved MACs on kubevirt CR.

@maiqueb
Copy link
Contributor Author

maiqueb commented May 6, 2021

/retest

Copy link
Contributor

@oshoval oshoval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, please see inside

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 7, 2021
@maiqueb maiqueb force-pushed the lost-ip-connectivity-after-migration-masquerade-binding branch from afe3bb4 to 78f831b Compare May 7, 2021 10:58
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 7, 2021
pkg/network/reserved_macs.go Outdated Show resolved Hide resolved

return nil
}, 120*time.Second).Should(Succeed())
Expect(ping(podIP)).To(Succeed())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to check the mac address of the bridge on the virt-launcher pod?
If will be ugly (because we do not assert the API, but some inner detail), but depending on this ping timing is unsafe.

Copy link
Contributor Author

@maiqueb maiqueb May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow the meaning of unsafe here; the point of preserving the MAC between migrations is to remove the need for the ARP tables to stabilize, which should preserve connectivity from the outside world to the guest, once the VM is considered running.

Is there something I am missing here ?

As a side note, yes, it is possible to check the mac of the bridge. And yes, it is ugly. And it doesn't prove anything about preserving the connectivity of the VM once migrated, which is this PR's objective.

EDIT: above I mean that "preserving the connectivity of the VM once migrated" is the issue this PR is trying to fix.

Copy link
Contributor Author

@maiqueb maiqueb May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I'm running these 2 tests in a loop w/ untilItFails flag.

$ FUNC_TEST_ARGS='-untilItFails -race' make functest
...
Ran 2 of 813 Specs in 197.550 seconds
SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 811 Skipped
PASS | FOCUSED

All tests passed...
Will keep running them until they fail.
This was attempt #19
Maybe you should stop now?
...
Ran 2 of 813 Specs in 188.358 seconds
SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 811 Skipped
PASS | FOCUSED

All tests passed...
Will keep running them until they fail.
This was attempt #25
Dave, this conversation can serve no purpose anymore. Goodbye.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something I am missing here ?

The ping may fail for other reasons, like guest CPU right after the migration or MAC address duplication on the switch (the mac appears on a different port, so it will not reach to the right destination).
I do not think it is safe to assume this is the only reason to have echo packets lost.

As a side note, yes, it is possible to check the mac of the bridge. And yes, it is ugly. And it doesn't prove anything about preserving the connectivity of the VM once migrated, which is this PR's objective.

EDIT: above I mean that "preserving the connectivity of the VM once migrated" is the issue this PR is trying to fix.

I would use "improve" not "solve" here. I am just pointing out that there may be other factors that affect a smooth migration and connectivity.

There is also a logical problem with what you are saying: If the mac is equivalent, we can be sure the ping will work fine (lets assume this is true). However, you cannot assume the other direction, right? So what protects us from someone breaking this feature eventually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What feature ? That you migrate & immediately ping once the VM is reported as ready ?

This test does.

What test are you missing ? I don't think there's a need to test that we get an hard coded mac address from the bridge ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just worried that we are not testing this behavior (that the mac is explicit set to a hard-coded address and is expected to be the same on all similar virt-launchers).

I guess my logical claim is this: If this will get broken and the mac address will no longer be the same as we expect, then the ping may do pass from time to time, becoming flaky and letting us wander what happened.

If you are not worry about that, then lets proceed, not covering this explicitly is not worse than what we have today.
If we will encounter this in the future, we can handle it then.

@maiqueb maiqueb force-pushed the lost-ip-connectivity-after-migration-masquerade-binding branch from 78f831b to f2e37c9 Compare May 12, 2021 09:22
Copy link
Member

@EdDev EdDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am good with this change, but some questions popped up and it will be nice to verify this before we proceed.

For the reserved mac, I would reserved it for all interfaces in general, no matter the binding. It is easier to do so and can simplify KMP logic. If we will need to reserve another mac in the future, then it will enter the same category and the logic will be kept simple.
WDYT?

@@ -258,6 +259,8 @@ func validateInterfaceNetworkBasics(field *k8sfield.Path, networkExists bool, id
causes = appendStatusCauseForSlirpNotEnabled(field, causes, idx)
} else if iface.Masquerade != nil && networkData.Pod == nil {
causes = appendStatusCauseForMasqueradeWithourPodNetwork(field, causes, idx)
} else if iface.Masquerade != nil && network.IsReserved(iface.MacAddress) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen in case we have 1 interface with masq and the 2nd with bridge binding?
Initially I thought it may be a problem, but on the other hand it is on a different LAN. Will this cause problems with the ARP resolution? (I think we need to check this)

This also raised another point which I do not know how we can mitigate: If the secondary network is for example bridge-binding, this mac may just exists in the network (e.g. as the default-gateway), and in such a case we may also have a problem with the arp table (not sure if the ARP arp table accepts having a single mac mapped to two different IP addresses).

Can you check this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen in case we have 1 interface with masq and the 2nd with bridge binding?
Initially I thought it may be a problem, but on the other hand it is on a different LAN. Will this cause problems with the ARP resolution? (I think we need to check this)

I don't see how, since those are different L2 networks.

This also raised another point which I do not know how we can mitigate: If the secondary network is for example bridge-binding, this mac may just exists in the network (e.g. as the default-gateway), and in such a case we may also have a problem with the arp table (not sure if the ARP arp table accepts having a single mac mapped to two different IP addresses).

Again, I don't see how. And furthermore, I don't see how this differs from a standard problem of having the same mac address in different L2 networks, which, I don't think is a problem.

Can you check this?

Never the less, I can. Below you can find the spec I used; it uses 2 VMs, both with the masquerade pod interface and a secondary macvtap network.

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: macvtap
  annotations:
    k8s.v1.cni.cncf.io/resourceName: macvtap.network.kubevirt.io/eth0
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "macvtap",
      "mtu": 1500
    }'
---
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachineInstance
metadata:
  name: vm1
spec:
  domain:
    devices:
      disks:
      - disk:
          bus: virtio
        name: rootfs
      - disk:
          bus: virtio
        name: cloudinit
      interfaces:
      - masquerade: {}
        name: default
        ports:
        - name: ssh
          port: 22
      - macvtap: {}
        name: macvtap
        macAddress: 02:00:00:00:00:00
    machine:
      type: ""
    resources:
      requests:
        memory: 1024M
  networks:
  - name: default
    pod: {}
  - multus:
      networkName: macvtap
    name: macvtap
  terminationGracePeriodSeconds: 0
  volumes:
  - containerDisk:
      image: registry:5000/kubevirt/fedora-cloud-container-disk-demo:devel
    name: rootfs
  - cloudInitNoCloud:
      userData: |
        #!/bin/bash
        echo "fedora" |passwd fedora --stdin
        dhclient eth1
    name: cloudinit
---
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachineInstance
metadata:
  name: vm2
spec:
  domain:
    devices:
      disks:
      - disk:
          bus: virtio
        name: rootfs
      - disk:
          bus: virtio
        name: cloudinit
      interfaces:
      - masquerade: {}
        name: default
        ports:
        - name: ssh
          port: 22
      - macvtap: {}
        name: macvtap
    machine:
      type: ""
    resources:
      requests:
        memory: 1024M
  networks:
  - name: default
    pod: {}
  - multus:
      networkName: macvtap
    name: macvtap
  terminationGracePeriodSeconds: 0
  volumes:
  - containerDisk:
      image: registry:5000/kubevirt/fedora-cloud-container-disk-demo:devel
    name: rootfs
  - cloudInitNoCloud:
      userData: |
        #!/bin/bash
        echo "fedora" |passwd fedora --stdin
        dhclient eth1
    name: cloudinit

You can find below the macs / ips for the 2 VMs, along with their neighbor tables.

VM1

$ [fedora@vm1 ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:19:fa:ee brd ff:ff:ff:ff:ff:ff
    altname enp1s0
    inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
       valid_lft 86296654sec preferred_lft 86296654sec
    inet6 fe80::5054:ff:fe19:faee/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 02:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff   <===== same MAC the gateway of eth0 has
    altname enp2s0
    inet 192.168.66.139/24 brd 192.168.66.255 scope global noprefixroute eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::e03c:bcdd:7d90:7e45/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

# ping gateway - has 02:00:00:... mac address
[fedora@vm1 ~]$ ping 10.0.2.1
PING 10.0.2.1 (10.0.2.1) 56(84) bytes of data.
64 bytes from 10.0.2.1: icmp_seq=1 ttl=64 time=1.56 ms
64 bytes from 10.0.2.1: icmp_seq=2 ttl=64 time=0.927 ms

--- 10.0.2.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1003ms
rtt min/avg/max/mdev = 0.927/1.244/1.562/0.317 ms
[fedora@vm1 ~]$ ip n
192.168.66.131 dev eth1 lladdr 62:89:2e:cd:75:d7 STALE
10.0.2.1 dev eth0 lladdr 02:00:00:00:00:00 REACHABLE <===== as you can see here
192.168.66.2 dev eth1 lladdr 16:c1:6d:e8:62:dc STALE
fe80::b440:bff:fe6e:c1df dev eth1 lladdr 16:c1:6d:e8:62:dc router STALE

VM2

[fedora@vm2 ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:3f:06:10 brd ff:ff:ff:ff:ff:ff
    altname enp1s0
    inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
       valid_lft 86296796sec preferred_lft 86296796sec
    inet6 fe80::5054:ff:fe3f:610/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 62:89:2e:cd:75:d7 brd ff:ff:ff:ff:ff:ff
    altname enp2s0
    inet 192.168.66.131/24 brd 192.168.66.255 scope global noprefixroute eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::dcf9:be0:d53b:d725/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

# 2 shells, one pinging vm1's IP - 192.168.66.139 - another pinging the gateway - 10.0.2.1
[fedora@vm2 ~]$ ping -q 192.168.66.139
[fedora@vm2 ~]$ ping -q 10.0.2.1

# gets us this neighbor table
[fedora@vm2 ~]$ ip n
192.168.66.2 dev eth1 lladdr 16:c1:6d:e8:62:dc STALE
192.168.66.139 dev eth1 lladdr 02:00:00:00:00:00 REACHABLE
10.0.2.1 dev eth0 lladdr 02:00:00:00:00:00 REACHABLE
fe80::b440:bff:fe6e:c1df dev eth1 lladdr 16:c1:6d:e8:62:dc router STALE

The ticks to update the neighbor state happen every BASE_REACHABLE_TIME/ 2, and the entries remain in the reachable state for a while, between BASE_REACHABLE_TIME/ 2 and 3/2 * BASE_REACHABLE_TIME as seen in the implementation.

In the system above, w/ a 5.6.6. kernel, the base_reachable_time for the interfaces is 30 seconds:

[fedora@vm2 ~]$ cat /proc/sys/net/ipv4/neigh/eth0/base_reachable_time
30
[fedora@vm2 ~]$ cat /proc/sys/net/ipv4/neigh/eth1/base_reachable_time
30

I see that the neighbors persist in reachable state for around 10-30 seconds, which, given the numbers described above make sense to me.

And the traffic works, which I think is the most important metric here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@maiqueb
Copy link
Contributor Author

maiqueb commented May 13, 2021

I am good with this change, but some questions popped up and it will be nice to verify this before we proceed.

For the reserved mac, I would reserved it for all interfaces in general, no matter the binding. It is easier to do so and can simplify KMP logic. If we will need to reserve another mac in the future, then it will enter the same category and the logic will be kept simple.
WDYT?

I personally don't see that as a reason for imposing restrictions on bindings that do not require them. As such, would avoid doing it.

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 14, 2021
Copy link
Member

@EdDev EdDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
This needs rebase now.

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label May 14, 2021
Since the bridge MAC address no longer changes, the ARP tables do
not need to stabilize, which preserves connectivity from the
outside world to the guest.

This is tested in this commit since the `Eventually` block in the
vmi_networking_test for masquerade binding migration can be safely
removed.

The lowest locally administrated unicast MAC address was chosen for
the bridge.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
For now, we reserve a single hardcoded MAC address,
02:00:00:00:00:00, the lowest locally administrated unicast address.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
@maiqueb maiqueb force-pushed the lost-ip-connectivity-after-migration-masquerade-binding branch from f2e37c9 to d55538d Compare May 17, 2021 09:30
@kubevirt-bot kubevirt-bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 17, 2021
@kubevirt-bot
Copy link
Contributor

kubevirt-bot commented May 17, 2021

@maiqueb: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubevirt-e2e-k8s-1.17 c69ac25 link /test pull-kubevirt-e2e-k8s-1.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@maiqueb
Copy link
Contributor Author

maiqueb commented May 17, 2021

/retest

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label May 17, 2021
@AlonaKaplan
Copy link
Member

/approve

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AlonaKaplan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 20, 2021
@kubevirt-bot kubevirt-bot merged commit d3c5f2c into kubevirt:master May 20, 2021
@maiqueb maiqueb deleted the lost-ip-connectivity-after-migration-masquerade-binding branch May 20, 2021 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/network size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

masquerade binding: connectivity lost after VM migration
7 participants