Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incusd/forknet: Handle wifi detach #395

Merged
merged 1 commit into from
Jan 16, 2024
Merged

incusd/forknet: Handle wifi detach #395

merged 1 commit into from
Jan 16, 2024

Conversation

stgraber
Copy link
Member

Closes #385

Closes lxc#385

Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
@stgraber
Copy link
Member Author

@mcondarelli this adds similar logic for detach as is used for attach.

So the wifi device name will be restored in the container before the phy is moved back, dragging the interface with it.

I'm also adding a check for whether the device even exists in the container's namespace.
That check will most likely trip in the OpenWRT case leading to a clearer error in the log at least for those cases where we can't properly detach the device due to it having been renamed.

@hallyn hallyn merged commit cf47ad6 into lxc:main Jan 16, 2024
25 checks passed
@mcondarelli
Copy link

@stgraber I don't know if change already found its way in daily, but I see no change in behavior.
Current device stanza is:

  wlan0:
    name: phy0-ap0
    nictype: physical
    parent: wlx00c0cab2bd76
    type: nic

At Container shutdown incusd logs:

time="2024-01-17T09:33:51+01:00" level=error msg="Failed to stop device" device=wlan0 err="Failed to detach interface: \"phy0-ap0\" to \"wlx00c0cab2bd76\": Failed to run: /opt/incus/bin/incusd forknet detach -- /proc/17966/fd/4 15907 phy0-ap0 wlx00c0cab2bd76: exit status 1 (Error: Couldn't restore host interface \"wlx00c0cab2bd76\" as container interface \"phy0-ap0\" couldn't be found)" instance=openwrt instanceType=container project=default

I will retry tomorrow.

@stgraber
Copy link
Member Author

Yeah, the log message has changed, it now tells you that the device can't be restored due to the renaming done by OpenWRT.

@mcondarelli
Copy link

@stgraber Sorry, I don't understand.

I understood naming "as OpenWrt wants" (i.e.: phy0-ap0) should have worked.

message says: "container interface "phy0-ap0" couldn't be found", but interface is there.

WAIT! Container interface name includes quotes?
If so I gave to redo all my tests!

In any case there's another glitch:

Apparently Container cannot access some of the Interface settings, I hit that wall with "reg" settings:

root@incus:~# incus start openwrt
root@incus:~# incus exec openwrt -- ip l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: br-lan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 00:16:3e:f2:5a:6f brd ff:ff:ff:ff:ff:ff
3: phy0-ap0: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN qlen 1000
    link/ether 00:c0:ca:b2:bd:76 brd ff:ff:ff:ff:ff:ff
7: eth0@br-lan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-lan state UP qlen 1000
    link/ether 00:16:3e:f2:5a:6f brd ff:ff:ff:ff:ff:ff
8: eth1@phy0-ap0: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 00:16:3e:57:99:42 brd ff:ff:ff:ff:ff:ff
9: eth2@if10: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 00:16:3e:ed:92:d6 brd ff:ff:ff:ff:ff:ff
root@incus:~# incus exec openwrt -- iw reg get
global
country 00: DFS-UNSET
	(755 - 928 @ 2), (N/A, 20), (N/A), PASSIVE-SCAN
	(2402 - 2472 @ 40), (N/A, 20), (N/A)
	(2457 - 2482 @ 20), (N/A, 20), (N/A), AUTO-BW, PASSIVE-SCAN
	(2474 - 2494 @ 20), (N/A, 20), (N/A), NO-OFDM, PASSIVE-SCAN
	(5170 - 5250 @ 80), (N/A, 20), (N/A), AUTO-BW, PASSIVE-SCAN
	(5250 - 5330 @ 80), (N/A, 20), (0 ms), DFS, AUTO-BW, PASSIVE-SCAN
	(5490 - 5730 @ 160), (N/A, 20), (0 ms), DFS, PASSIVE-SCAN
	(5735 - 5835 @ 80), (N/A, 20), (N/A), PASSIVE-SCAN
	(57240 - 63720 @ 2160), (N/A, 0), (N/A)

root@incus:~# incus exec openwrt -- iw reg set IT
command failed: Operation not permitted (-1)
root@incus:~# iw reg set IT
root@incus:~# incus exec openwrt -- iw reg get
global
country IT: DFS-ETSI
	(2400 - 2483 @ 40), (N/A, 20), (N/A)
	(5150 - 5250 @ 80), (N/A, 23), (N/A), NO-OUTDOOR, AUTO-BW
	(5250 - 5350 @ 80), (N/A, 20), (0 ms), NO-OUTDOOR, DFS, AUTO-BW
	(5470 - 5725 @ 160), (N/A, 26), (0 ms), DFS
	(5725 - 5875 @ 80), (N/A, 13), (N/A)
	(5945 - 6425 @ 160), (N/A, 23), (N/A), NO-OUTDOOR
	(57000 - 66000 @ 2160), (N/A, 40), (N/A)

root@incus:~# incus stop openwrt
root@incus:~# tail /var/log/incus/incusd.log
time="2024-01-17T09:21:06+01:00" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"
time="2024-01-17T09:33:51+01:00" level=error msg="Failed to stop device" device=wlan0 err="Failed to detach interface: \"phy0-ap0\" to \"wlx00c0cab2bd76\": Failed to run: /opt/incus/bin/incusd forknet detach -- /proc/17966/fd/4 15907 phy0-ap0 wlx00c0cab2bd76: exit status 1 (Error: Couldn't restore host interface \"wlx00c0cab2bd76\" as container interface \"phy0-ap0\" couldn't be found)" instance=openwrt instanceType=container project=default
time="2024-01-17T09:45:57+01:00" level=warning msg=" - Couldn't find the CGroup hugetlb controller, hugepage limits will be ignored"
time="2024-01-17T09:45:57+01:00" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"
time="2024-01-17T15:33:01+01:00" level=warning msg=" - Couldn't find the CGroup hugetlb controller, hugepage limits will be ignored"
time="2024-01-17T15:33:01+01:00" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"
time="2024-01-17T15:34:49+01:00" level=error msg="Failed to stop device" device=wlan0 err="Failed to detach interface: \"phy0-ap0\" to \"wlx00c0cab2bd76\": Failed to run: /opt/incus/bin/incusd forknet detach -- /proc/3324/fd/4 12891 phy0-ap0 wlx00c0cab2bd76: exit status 1 (Error: Couldn't restore host interface \"wlx00c0cab2bd76\" as container interface \"phy0-ap0\" couldn't be found)" instance=openwrt instanceType=container project=default
time="2024-01-17T15:38:18+01:00" level=warning msg=" - Couldn't find the CGroup hugetlb controller, hugepage limits will be ignored"
time="2024-01-17T15:38:18+01:00" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"
time="2024-01-17T15:45:28+01:00" level=error msg="Failed to stop device" device=wlan0 err="Failed to detach interface: \"phy0-ap0\" to \"wlx00c0cab2bd76\": Failed to run: /opt/incus/bin/incusd forknet detach -- /proc/1569/fd/4 754 phy0-ap0 wlx00c0cab2bd76: exit status 1 (Error: Couldn't restore host interface \"wlx00c0cab2bd76\" as container interface \"phy0-ap0\" couldn't be found)" instance=openwrt instanceType=container project=default
root@incus:~# 

Now I have a meeting with my dentist, I'll be back later...

@stgraber
Copy link
Member Author

Hmm, that makes me wonder if OpenWRT does anything funny on shutdown.

Can you try incus stop -f so the container doesn't get to know it's being stopped.

@mcondarelli
Copy link

It indeed seems OpenWrt does some "late renaming".

Using incus stop -f openwrt with the "right name" restores Interface with no error message.

The following excerpt is right after incus server reboot:

mcon@cinderella:~$ ssh root@incus
Linux incus 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Wed Jan 17 15:39:15 2024 from 192.168.7.12
root@incus:~# iw reg set IT
root@incus:~# incus start openwrt
root@incus:~# incus exec openwrt -- iw reg get
global
country IT: DFS-ETSI
	(2400 - 2483 @ 40), (N/A, 20), (N/A)
	(5150 - 5250 @ 80), (N/A, 23), (N/A), NO-OUTDOOR, AUTO-BW
	(5250 - 5350 @ 80), (N/A, 20), (0 ms), NO-OUTDOOR, DFS, AUTO-BW
	(5470 - 5725 @ 160), (N/A, 26), (0 ms), DFS
	(5725 - 5875 @ 80), (N/A, 13), (N/A)
	(5945 - 6425 @ 160), (N/A, 23), (N/A), NO-OUTDOOR
	(57000 - 66000 @ 2160), (N/A, 40), (N/A)

root@incus:~# incus stop -f openwrt
root@incus:~# tail /var/log/incus/incusd.log
time="2024-01-17T09:45:57+01:00" level=warning msg=" - Couldn't find the CGroup hugetlb controller, hugepage limits will be ignored"
time="2024-01-17T09:45:57+01:00" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"
time="2024-01-17T15:33:01+01:00" level=warning msg=" - Couldn't find the CGroup hugetlb controller, hugepage limits will be ignored"
time="2024-01-17T15:33:01+01:00" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"
time="2024-01-17T15:34:49+01:00" level=error msg="Failed to stop device" device=wlan0 err="Failed to detach interface: \"phy0-ap0\" to \"wlx00c0cab2bd76\": Failed to run: /opt/incus/bin/incusd forknet detach -- /proc/3324/fd/4 12891 phy0-ap0 wlx00c0cab2bd76: exit status 1 (Error: Couldn't restore host interface \"wlx00c0cab2bd76\" as container interface \"phy0-ap0\" couldn't be found)" instance=openwrt instanceType=container project=default
time="2024-01-17T15:38:18+01:00" level=warning msg=" - Couldn't find the CGroup hugetlb controller, hugepage limits will be ignored"
time="2024-01-17T15:38:18+01:00" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"
time="2024-01-17T15:45:28+01:00" level=error msg="Failed to stop device" device=wlan0 err="Failed to detach interface: \"phy0-ap0\" to \"wlx00c0cab2bd76\": Failed to run: /opt/incus/bin/incusd forknet detach -- /proc/1569/fd/4 754 phy0-ap0 wlx00c0cab2bd76: exit status 1 (Error: Couldn't restore host interface \"wlx00c0cab2bd76\" as container interface \"phy0-ap0\" couldn't be found)" instance=openwrt instanceType=container project=default
time="2024-01-17T18:39:53+01:00" level=warning msg=" - Couldn't find the CGroup hugetlb controller, hugepage limits will be ignored"
time="2024-01-17T18:39:53+01:00" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"
root@incus:~# ip l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 60:02:92:57:66:c1 brd ff:ff:ff:ff:ff:ff
3: enxa0cec8b43055: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether a0:ce:c8:b4:30:55 brd ff:ff:ff:ff:ff:ff
5: ORANGE: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:16:3e:ed:dd:b3 brd ff:ff:ff:ff:ff:ff
6: incusbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:16:3e:54:c8:5c brd ff:ff:ff:ff:ff:ff
11: wlx00c0cab2bd76: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:c0:ca:b2:bd:76 brd ff:ff:ff:ff:ff:ff
root@incus:~# 

I am asking advice to OpenWrt experts.

Note: what is incusd complaining about ("Couldn't find the CGroup...")? Should I care?

@stgraber
Copy link
Member Author

Nah, those warnings are basically normal on modern cgroup2 systems as cgroup2 just isn't quite at parity with cgroup1 on a few things.

@mcondarelli
Copy link

Anyways using a container with USB physical passthrough doesn't seem really ready for prime time, especially when WiFi is factored in.
I'll try to reshuffle my architecture to avoid it, if possible, otherwise I'll go back to VM.
Many thanks for your support.

@stgraber
Copy link
Member Author

With the recent change I made, things seems to be working as expected here. The main issue is OpenWRT renaming interfaces on startup and shut down which prevents proper interface restoration.

I don't know if there's some way to disable that behavior and then pass interfaces with the expected name right from the start, avoiding all those complications.

@mcondarelli
Copy link

I am not deep enough in the matter to make meaningful suggestions but I see some "rough edges" and I don't really understand where they come from.

  1. There are features unavailable to Container supposedly "owning" device (e.g.: trying to "iw reg set IT" in container results in "permission denied" while I can still do it on host even while I don't see the device anymore).
  2. People at OpenWrt say (IRC transcript, I hope I represented situation right):

07:59:39 PM - mcon: In the meantime: is there some documentation about how and when net devices get renamed by OpenWrt?
08:00:51 PM - Poster: for physical devices I don't think they get renamed (ethN, phyN-apX, etc)
08:01:09 PM - Poster: bridges are virtual devices that get named on creation
08:01:25 PM - Poster: though there are interfaces, which may be confused with devices
08:03:31 PM - Poster: Are you able to ping the DHCP server's IP address from the br-lan interface?
08:07:50 PM - mcon: Poster: wireless devices get surely renamed wlan0(or whatever) -> phy0-ap0 and they are also renamed at shutdown so phy0-ap0 is not available anymore.
08:08:44 PM - Poster: oh yeah I think that came with 23.05 and deals with virtual APs
08:10:21 PM - PaulFertser: mcon: renaming is done this way https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=package/network/services/hostapd/patches/601-ucode_support.patch;h=cfdb51f356cc1b4e0f09caf6d2f1c74e0ff403b3;hb=HEAD#l529
08:10:48 PM - mcon: Poster: If I can get detailed info then Incus people can "undo" interface pass-over.
08:11:25 PM - PaulFertser: mcon: called from https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=package/network/services/hostapd/files/hostapd.uc;h=dfddf8185ba8ef5ff72f43210aa4f1cfd32e7949;hb=HEAD#l484
08:13:02 PM - Poster: I am not familiar with Incus but I think the native behavior for connecting wireless to wired is to let OpenWRT create the bridge between them, if Incus is taking control of the wireless interface the bridge itself may not function
08:13:27 PM - PaulFertser: mcon: also are you sure it's a matter of renaming and availability? With cfg80211 you have a "phy" object and on each object you can have many different VIFs, you can create them, delete them, change type and rename them as you see fit, manually with "iw" if you like. You can see the list of current VIFs with "iw dev".
08:13:50 PM - PaulFertser: mcon: even if OpenWrt deletes all VIFs on wifi shutdown so what.
08:14:06 PM - Poster: you might want to play with adding a virtual ethernet device for Incus then add that device to the br-lan bridge membership
08:14:29 PM - Poster: but I don't know what Incus is doing to know for sure
08:15:32 PM - PaulFertser: mcon: btw, with mac80211 drivers you can always add a VIF of type monitor to any PHY no matter what other VIFs are up at the same time.
08:17:12 PM - Poster: Is this what we're working with? https://linuxcontainers.org/incus/docs/main/explanation/networks/
08:17:33 PM - mcon: Poster: Incus is handing over the whole interface, in this case (I don't know details either) and IFF I setup the same name OpwnWrt is expecting (phy0-ap0) AND I kill container without proper shutdown (and further renaming) interface is correctly restored.
08:18:33 PM - Poster: Are you attempting "Physical network" ?
08:19:26 PM - mcon: Poster: "nictype = physical" to be precise.
08:19:54 PM - PaulFertser: mcon: and for proper OpenWrt experience you need to allow OpenWrt to control VIFs in any way, including adding arbitrary amount of them with arbitrary names. E.g. to handle several SSIDs (for IoT, for guests etc) and probably to connect to other networks etc.
08:21:07 PM - Poster: ok yeah I am not sure that's the right use case, Bridge network utilizing the existing br-lan is probably what would be needed
08:22:40 PM - Poster: Is the intent for the container to get an IP address on the same network for br-lan ?
08:23:55 PM - mcon: PaulFertser: That is what I thought nictype: physical would do, but I see problems. Probably using a full VM would be better.
08:25:25 PM - PaulFertser: mcon: it's interesting to get containers working properly too
08:26:09 PM - Poster: I would replace physical with bridge, if you can work out adding the interface to br-lan I would guess things would then begin working
08:27:06 PM - Poster: br-lan would qualify as "underlying native Linux bridges"
08:27:44 PM - PaulFertser: mcon: did I answer your renaming question?
08:29:24 PM - mcon: PaulFertser: I still have to digest the info, but surely it seems the right spot to look at 😉
08:33:59 PM - PaulFertser: mcon: I think it's nothing OpenWrt specific, you can try renaming manually with "ip l s dev eth0 name eth0-new" and it will do the same operation.
08:37:44 PM - mcon: PaulFertser: No. I was unclear. Problem is (AFAIK!) Incus needs to hand over to Container interface and "remove" it from host; at container shutdown it has to do the reverse but current code relies on finding the same name it handed over to reverse operation. I don't know if there's a better way, but this is how it is done now. Changing the name in the container prevents returning interface to host.
08:39:30 PM - mcon: Further problem is mechanism should work for all kind of interfaces, not just for WiFi (which already is "special case" because of double interface)
08:45:17 PM - PaulFertser: mcon: it's not double, that's the point. It's arbitrary many VIFs.
08:45:27 PM - PaulFertser: (and phy isn't a network interface at all)
08:45:58 PM - PaulFertser: mcon: I think my point is that they shouldn't depend on interface names at all
08:48:50 PM - mcon: PaulFertser: I think you're right, but I'm not an Incus maintainer... (by good measure!) I can have a look to relevant code though.
08:49:14 PM - PaulFertser: mcon: they might be unaware about the multiple VIF thing
08:52:04 PM - mcon: PaulFertser: I'll have a look to the code (which is in go, I know next to nothing about) and then talk again with Incus developers.

My current plan is:

  • avoid passing WiFi adapter at all
  • prepare a "dumb Access Point" on host
  • prepare a bridge between enp2s0 and AP
  • pass bridge to Container as eth0

This should eradicate the problem at the expense of forfeiting OpenWrt AP handling.

Other possibility is to go the VM way (which is also acceptable now I solved the incus-agent puzzle). Reason not to do it was preserve resources, but that's marginal at best.

@stgraber
Copy link
Member Author

For things not working with permission errors and the like, those would be potential kernel bugs. All Incus does is move the interface and phy over to the container and do the reverse on shutdown, anything that happens between those two, Incus has no control over.

It's not too unusual to see those kind of issues, especially in areas that are not very commonly used in containers. The kernel developers may have put in capability checks that are not container aware, leading to such failures.

Occasionally it's also done on purpose, if the feature would for example allow altering the device firmware or cause large amounts of memory/cpu to be used on the system. Basically anything that we wouldn't expect a normal user to be able to do is likely to be blocked by the kernel, at least until someone has a good hard look at it.

@stgraber
Copy link
Member Author

About what was said on OpenWRT IRC, I'm aware of the fact that a phy can have many interfaces on top of it, but LXC and Incus don't have support for wireless phy as the main object that users interact with. Instead they rely on passing in network interface names, which would usually be one of the interfaces on the phy.

In any case, the problem still remains, we need to be able to restore both the phy and all attaches interfaces to the host system. The tools that are used for that unfortunately all rely on interface names. If those names have changed, then there's nothing we can really do other than fail and let the user pick up the pieces :(

@mcondarelli
Copy link

Understood (actually: it's way over my head to be really able to really understand it, but I surely accept your word on it ;) ).

Can you confirm I won't have this kind of problems with a VM?

That should be really able to hand over a USB device to VM's kernel for full handling, right?

@stgraber
Copy link
Member Author

Yeah, the VM case won't have this problem as it gets its own kernel with full permissions and just sees the adapter as a regular USB device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

incus stop <instance> does not seem to release nictype=physical devices
3 participants