New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
masquerade, migration: hardcode the bridge MAC addr #5000
masquerade, migration: hardcode the bridge MAC addr #5000
Conversation
/assign @EdDev |
6960209
to
8eb934a
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the simplicity, but then one point comes up: What reserves this static MAC address from being used by:
- The pod interface (eth0).
- A secondary interface CNI (multus).
- The VM/VMI user on the interface spec.
- The KMP.
// the hardcoded MAC must be as low as possible, to prevent libvirt from | ||
// assigning a lower MAC to a tap device attached to the bridge - which | ||
// would trigger the bridge's MAC to update. This also applies for the | ||
// dummy connected to the bridge on masquerade binding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unable to recreate this on my 3.10.0-1160.6.1.el7.x86_64
kernel.
Once I set the bridge iface MAC, it is not changed if I attach a dummy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlonaKaplan do you remember the reference we had for this ?
// assigning a lower MAC to a tap device attached to the bridge - which | ||
// would trigger the bridge's MAC to update. This also applies for the | ||
// dummy connected to the bridge on masquerade binding. | ||
const hardcodedMasqueradeMAC = "02:00:00:00:00:00" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about staticMasqueradeBridgeMAC
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Nothing. In any of those. |
@EdDev I'd like to point out that (unless my math is seriously wrong) this PR addresses exactly (2/(2^40))*100 % of the scenarios if multus is not involved, and ((2+M)/2^40 )*100 % of the scenarios for M multus interfaces. I.e. this solves a real problem for an insanely high percentage of possibilities. Yes, it does have a couple of corner cases. I think we could document this limitation (not sure where / how). In the current refactoring of the pod interfaces I'm planning, I intend to propose a way to also cache the bridge interface between the handler / launcher processes, which would allow us to revert this, and use whatever generated MAC the kernel happily assigned for us. 2^40 => number of possible MAC addresses if we force the first octect to be @RamLavi : one question: does kubemacpool allow you to 'black-list' mac addresses ? I.e. "dear kubemacpool; whatever you do, do not assign this particular MAC". This black list would preferably be 'scoped' per interface. |
If I understand correctly, in kubemacpool you have to specify the prefix from which you're (you as in KMP) assigning MACs. It would be a matter of using whatever prefix does not start w/ |
@maiqueb, Kubemacpool assigns MACs from the pool, defined by the range set when KMP is deployed. BUT, it does allow you to set a MAC Address manually, that may/may not be outside of this range. In other words, there is no "black-list" currently implemented. |
I think it is enough to consider that some engineer will just think of the same address you managed to invent for similar reasons, ending up in a bad setup will little help from the system to understand what is going on. Playing with probabilities can reason for concurrency bugs, leaks and other one-in-a-million issues. But I do not think it should be accepted.
When the poor operator and/or support engineer will have to debug this, I think we will have hard time to reason this. At the moment a static MAC address is introduced and reserved, it makes sense to add a validation check to avoid MAC duplication. |
Makes sense.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid this may collide with Kubemacpool.
02:00:00:00:00:00 is inside KMP's default range (the first one actually), and although it is advised to change the range prefix to be random, we can't assume ppl will change that.
I think every mac reserving should also be supported on KMP's side, otherwise we might end up with this mac colliding with the mac on the virt-launcher pod bridge.
That's half the problem. How will KMP play along with anything that requires a reserved mac address ? My short understanding is: it doesn't . And that includes whatever it is the users have running on their VMs. Furthermore, I'm unsure if we can assume KMP is a dependency for kubevirt. AFAIU, it is not. @phoracek would you shed some light into this ? How can anything with a specific mac address requirement (for whatever weird reason) co-exist with KMP ? What's KubeVirt's take on this ? |
/sig network |
Correct.
I believe that if the MAC is from the managed pool, we would either reject it (if taken) or register it as taken. If it is outside the pool, we don't care. @RamLavi please keep me honest here. |
@RamLavi / @phoracek / @EdDev let's re-start this discussion, the problem won't go away on its own. AFAIU, we either need to implement a mechanism to force the bridge MAC to persist across migration / reboots, or hard-code it, and be done with the issue. If we hard-code, we'd need a mechanism in place to make sure that MAC is not used by the list in #5000 (review) . It would require kubemacpool to have an exception list of sorts (I think it is far from a bad idea having it anyway). I would rather hard-code it, since it is a lot simpler. Is there any alternative I am not seeing ? |
I would suggest to define a reserved range and place it in a configmap (on the kubevirt CR?). Then it is a matter of just using a mac in this range in our code (in our case, just the first address in the pool). But I am also ok with starting a simple single IP reserved address in kubevirt. If anyone, including KMP, will try to allocate it, kubevirt should reject it. |
@@ -1450,6 +1450,28 @@ var _ = Describe("Validating VMICreate Admitter", func() { | |||
Expect(len(causes)).To(Equal(1)) | |||
Expect(causes[0].Field).To(Equal("fake.domain.devices.interfaces[0].name")) | |||
}) | |||
It("should reject a masquerade interface with a specified MAC address which is reserved by the BindMechanism", func() { | |||
vm := v1.NewMinimalVMI("testvm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a vmi, not a vm
vm := v1.NewMinimalVMI("testvm") | |
vmi := v1.NewMinimalVMI("testvmi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
well since KMP only monitors the vm and the change proposed seems to hardcode the mac to the vmi, I fear that KMP won't catch/reject/register it. |
The current proposal does this; we can then build on this, and afterwards introduce a range of reserved MACs on kubevirt CR. |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, please see inside
pkg/virt-api/webhooks/validating-webhook/admitters/vmi-create-admitter_test.go
Show resolved
Hide resolved
afe3bb4
to
78f831b
Compare
|
||
return nil | ||
}, 120*time.Second).Should(Succeed()) | ||
Expect(ping(podIP)).To(Succeed()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to check the mac address of the bridge on the virt-launcher pod?
If will be ugly (because we do not assert the API, but some inner detail), but depending on this ping timing is unsafe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow the meaning of unsafe here; the point of preserving the MAC between migrations is to remove the need for the ARP tables to stabilize, which should preserve connectivity from the outside world to the guest, once the VM is considered running
.
Is there something I am missing here ?
As a side note, yes, it is possible to check the mac of the bridge. And yes, it is ugly. And it doesn't prove anything about preserving the connectivity of the VM once migrated, which is this PR's objective.
EDIT: above I mean that "preserving the connectivity of the VM once migrated" is the issue this PR is trying to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I'm running these 2 tests in a loop w/ untilItFails
flag.
$ FUNC_TEST_ARGS='-untilItFails -race' make functest
...
Ran 2 of 813 Specs in 197.550 seconds
SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 811 Skipped
PASS | FOCUSED
All tests passed...
Will keep running them until they fail.
This was attempt #19
Maybe you should stop now?
...
Ran 2 of 813 Specs in 188.358 seconds
SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 811 Skipped
PASS | FOCUSED
All tests passed...
Will keep running them until they fail.
This was attempt #25
Dave, this conversation can serve no purpose anymore. Goodbye.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something I am missing here ?
The ping may fail for other reasons, like guest CPU right after the migration or MAC address duplication on the switch (the mac appears on a different port, so it will not reach to the right destination).
I do not think it is safe to assume this is the only reason to have echo packets lost.
As a side note, yes, it is possible to check the mac of the bridge. And yes, it is ugly. And it doesn't prove anything about preserving the connectivity of the VM once migrated, which is this PR's objective.
EDIT: above I mean that "preserving the connectivity of the VM once migrated" is the issue this PR is trying to fix.
I would use "improve" not "solve" here. I am just pointing out that there may be other factors that affect a smooth migration and connectivity.
There is also a logical problem with what you are saying: If the mac is equivalent, we can be sure the ping will work fine (lets assume this is true). However, you cannot assume the other direction, right? So what protects us from someone breaking this feature eventually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What feature ? That you migrate & immediately ping once the VM is reported as ready ?
This test does.
What test are you missing ? I don't think there's a need to test that we get an hard coded mac address from the bridge ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just worried that we are not testing this behavior (that the mac is explicit set to a hard-coded address and is expected to be the same on all similar virt-launchers).
I guess my logical claim is this: If this will get broken and the mac address will no longer be the same as we expect, then the ping may do pass from time to time, becoming flaky and letting us wander what happened.
If you are not worry about that, then lets proceed, not covering this explicitly is not worse than what we have today.
If we will encounter this in the future, we can handle it then.
78f831b
to
f2e37c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am good with this change, but some questions popped up and it will be nice to verify this before we proceed.
For the reserved mac, I would reserved it for all interfaces in general, no matter the binding. It is easier to do so and can simplify KMP logic. If we will need to reserve another mac in the future, then it will enter the same category and the logic will be kept simple.
WDYT?
pkg/virt-api/webhooks/validating-webhook/admitters/vmi-create-admitter_test.go
Show resolved
Hide resolved
@@ -258,6 +259,8 @@ func validateInterfaceNetworkBasics(field *k8sfield.Path, networkExists bool, id | |||
causes = appendStatusCauseForSlirpNotEnabled(field, causes, idx) | |||
} else if iface.Masquerade != nil && networkData.Pod == nil { | |||
causes = appendStatusCauseForMasqueradeWithourPodNetwork(field, causes, idx) | |||
} else if iface.Masquerade != nil && network.IsReserved(iface.MacAddress) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen in case we have 1 interface with masq and the 2nd with bridge binding?
Initially I thought it may be a problem, but on the other hand it is on a different LAN. Will this cause problems with the ARP resolution? (I think we need to check this)
This also raised another point which I do not know how we can mitigate: If the secondary network is for example bridge-binding, this mac may just exists in the network (e.g. as the default-gateway), and in such a case we may also have a problem with the arp table (not sure if the ARP arp table accepts having a single mac mapped to two different IP addresses).
Can you check this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen in case we have 1 interface with masq and the 2nd with bridge binding?
Initially I thought it may be a problem, but on the other hand it is on a different LAN. Will this cause problems with the ARP resolution? (I think we need to check this)
I don't see how, since those are different L2 networks.
This also raised another point which I do not know how we can mitigate: If the secondary network is for example bridge-binding, this mac may just exists in the network (e.g. as the default-gateway), and in such a case we may also have a problem with the arp table (not sure if the ARP arp table accepts having a single mac mapped to two different IP addresses).
Again, I don't see how. And furthermore, I don't see how this differs from a standard problem of having the same mac address in different L2 networks, which, I don't think is a problem.
Can you check this?
Never the less, I can. Below you can find the spec I used; it uses 2 VMs, both with the masquerade pod interface and a secondary macvtap network.
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: macvtap
annotations:
k8s.v1.cni.cncf.io/resourceName: macvtap.network.kubevirt.io/eth0
spec:
config: '{
"cniVersion": "0.3.1",
"type": "macvtap",
"mtu": 1500
}'
---
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachineInstance
metadata:
name: vm1
spec:
domain:
devices:
disks:
- disk:
bus: virtio
name: rootfs
- disk:
bus: virtio
name: cloudinit
interfaces:
- masquerade: {}
name: default
ports:
- name: ssh
port: 22
- macvtap: {}
name: macvtap
macAddress: 02:00:00:00:00:00
machine:
type: ""
resources:
requests:
memory: 1024M
networks:
- name: default
pod: {}
- multus:
networkName: macvtap
name: macvtap
terminationGracePeriodSeconds: 0
volumes:
- containerDisk:
image: registry:5000/kubevirt/fedora-cloud-container-disk-demo:devel
name: rootfs
- cloudInitNoCloud:
userData: |
#!/bin/bash
echo "fedora" |passwd fedora --stdin
dhclient eth1
name: cloudinit
---
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachineInstance
metadata:
name: vm2
spec:
domain:
devices:
disks:
- disk:
bus: virtio
name: rootfs
- disk:
bus: virtio
name: cloudinit
interfaces:
- masquerade: {}
name: default
ports:
- name: ssh
port: 22
- macvtap: {}
name: macvtap
machine:
type: ""
resources:
requests:
memory: 1024M
networks:
- name: default
pod: {}
- multus:
networkName: macvtap
name: macvtap
terminationGracePeriodSeconds: 0
volumes:
- containerDisk:
image: registry:5000/kubevirt/fedora-cloud-container-disk-demo:devel
name: rootfs
- cloudInitNoCloud:
userData: |
#!/bin/bash
echo "fedora" |passwd fedora --stdin
dhclient eth1
name: cloudinit
You can find below the macs / ips for the 2 VMs, along with their neighbor tables.
VM1
$ [fedora@vm1 ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc fq_codel state UP group default qlen 1000
link/ether 52:54:00:19:fa:ee brd ff:ff:ff:ff:ff:ff
altname enp1s0
inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
valid_lft 86296654sec preferred_lft 86296654sec
inet6 fe80::5054:ff:fe19:faee/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 02:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff <===== same MAC the gateway of eth0 has
altname enp2s0
inet 192.168.66.139/24 brd 192.168.66.255 scope global noprefixroute eth1
valid_lft forever preferred_lft forever
inet6 fe80::e03c:bcdd:7d90:7e45/64 scope link noprefixroute
valid_lft forever preferred_lft forever
# ping gateway - has 02:00:00:... mac address
[fedora@vm1 ~]$ ping 10.0.2.1
PING 10.0.2.1 (10.0.2.1) 56(84) bytes of data.
64 bytes from 10.0.2.1: icmp_seq=1 ttl=64 time=1.56 ms
64 bytes from 10.0.2.1: icmp_seq=2 ttl=64 time=0.927 ms
--- 10.0.2.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1003ms
rtt min/avg/max/mdev = 0.927/1.244/1.562/0.317 ms
[fedora@vm1 ~]$ ip n
192.168.66.131 dev eth1 lladdr 62:89:2e:cd:75:d7 STALE
10.0.2.1 dev eth0 lladdr 02:00:00:00:00:00 REACHABLE <===== as you can see here
192.168.66.2 dev eth1 lladdr 16:c1:6d:e8:62:dc STALE
fe80::b440:bff:fe6e:c1df dev eth1 lladdr 16:c1:6d:e8:62:dc router STALE
VM2
[fedora@vm2 ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc fq_codel state UP group default qlen 1000
link/ether 52:54:00:3f:06:10 brd ff:ff:ff:ff:ff:ff
altname enp1s0
inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
valid_lft 86296796sec preferred_lft 86296796sec
inet6 fe80::5054:ff:fe3f:610/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 62:89:2e:cd:75:d7 brd ff:ff:ff:ff:ff:ff
altname enp2s0
inet 192.168.66.131/24 brd 192.168.66.255 scope global noprefixroute eth1
valid_lft forever preferred_lft forever
inet6 fe80::dcf9:be0:d53b:d725/64 scope link noprefixroute
valid_lft forever preferred_lft forever
# 2 shells, one pinging vm1's IP - 192.168.66.139 - another pinging the gateway - 10.0.2.1
[fedora@vm2 ~]$ ping -q 192.168.66.139
[fedora@vm2 ~]$ ping -q 10.0.2.1
# gets us this neighbor table
[fedora@vm2 ~]$ ip n
192.168.66.2 dev eth1 lladdr 16:c1:6d:e8:62:dc STALE
192.168.66.139 dev eth1 lladdr 02:00:00:00:00:00 REACHABLE
10.0.2.1 dev eth0 lladdr 02:00:00:00:00:00 REACHABLE
fe80::b440:bff:fe6e:c1df dev eth1 lladdr 16:c1:6d:e8:62:dc router STALE
The ticks to update the neighbor state happen every BASE_REACHABLE_TIME
/ 2, and the entries remain in the reachable state for a while, between BASE_REACHABLE_TIME
/ 2 and 3/2 * BASE_REACHABLE_TIME
as seen in the implementation.
In the system above, w/ a 5.6.6. kernel, the base_reachable_time
for the interfaces is 30 seconds:
[fedora@vm2 ~]$ cat /proc/sys/net/ipv4/neigh/eth0/base_reachable_time
30
[fedora@vm2 ~]$ cat /proc/sys/net/ipv4/neigh/eth1/base_reachable_time
30
I see that the neighbors persist in reachable state for around 10-30 seconds, which, given the numbers described above make sense to me.
And the traffic works, which I think is the most important metric here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I personally don't see that as a reason for imposing restrictions on bindings that do not require them. As such, would avoid doing it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
This needs rebase now.
Since the bridge MAC address no longer changes, the ARP tables do not need to stabilize, which preserves connectivity from the outside world to the guest. This is tested in this commit since the `Eventually` block in the vmi_networking_test for masquerade binding migration can be safely removed. The lowest locally administrated unicast MAC address was chosen for the bridge. Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
For now, we reserve a single hardcoded MAC address, 02:00:00:00:00:00, the lowest locally administrated unicast address. Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
f2e37c9
to
d55538d
Compare
@maiqueb: The following test failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: AlonaKaplan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Since the bridge MAC address no longer changes, the ARP tables do
not need to stabilize, which preserves connectivity from the
outside world to the guest.
This is tested in this commit since the
Eventually
block in thevmi_networking_test for masquerade binding migration can be safely
removed.
Reserve the hardcoded mac address when masquerade binding is used; as such, throw an error whenever a user attempts to
specify that MAC on a masquerade interface.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #4999
Special notes for your reviewer:
Release note: