Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network: Adds IPVLAN support #5716

Merged
merged 4 commits into from May 9, 2019

Conversation

4 participants
@tomponline
Copy link
Member

commented May 2, 2019

Signed-off-by: Thomas Parrott thomas.parrott@canonical.com

@tomponline tomponline force-pushed the tomponline:tp-ipvlan branch 8 times, most recently from 5ed399f to c3dde8e May 2, 2019

@tomponline tomponline marked this pull request as ready for review May 7, 2019

@tomponline tomponline force-pushed the tomponline:tp-ipvlan branch 5 times, most recently from 94a37de to 3145db7 May 7, 2019

@tomponline

This comment has been minimized.

Copy link
Member Author

commented May 7, 2019

jenkins: test this please

@tomponline tomponline force-pushed the tomponline:tp-ipvlan branch from 3145db7 to 1d2ecb5 May 7, 2019

@tomponline

This comment has been minimized.

Copy link
Member Author

commented May 7, 2019

@stgraber this is ready for review, tests seem fine (one rsync error which I think is unrelated).

@stgraber

This comment has been minimized.

Copy link
Member

commented May 7, 2019

Small nit, your git config should be set so the Signed-off-by uses your Firstname and Lastname

Show resolved Hide resolved doc/api-extensions.md Outdated
Show resolved Hide resolved doc/containers.md Outdated
Show resolved Hide resolved lxc/network.go Outdated
Show resolved Hide resolved lxc/network.go Outdated
Show resolved Hide resolved doc/containers.md Outdated

@tomponline tomponline force-pushed the tomponline:tp-ipvlan branch 3 times, most recently from 2cecdc6 to 2c79cee May 8, 2019

@tomponline

This comment has been minimized.

Copy link
Member Author

commented May 8, 2019

@stgraber ready for review again, thanks

@tomponline tomponline force-pushed the tomponline:tp-ipvlan branch from 2c79cee to 4212200 May 8, 2019

@stgraber

This comment has been minimized.

Copy link
Member

commented May 8, 2019

Ok, so I'm going to take the lxc_features part of this and move it into its own PR as it needs to be split into separate commits too and we need to move the storage of those flags to the State struct with a startup time check to avoid constant jump to cgo.

We can then rebase this branch onto master once merged and just fix the IPv4 and IPv6 addresses list to be comma separated (with optional spaces).

@stgraber stgraber force-pushed the tomponline:tp-ipvlan branch from b4d1dd8 to 344989d May 8, 2019

@stgraber

This comment has been minimized.

Copy link
Member

commented May 8, 2019

Did a rebase locally, switched to using OS.LXCFeatures and switched from space delimited to comma delimited with optional spaces.

Should be all good once the PR this now depends on are all merged.

@stgraber stgraber referenced this pull request May 8, 2019

Closed

add IPVLAN support #5182

@tomponline

This comment has been minimized.

Copy link
Member Author

commented May 8, 2019

Thanks @stgraber, ill take onboard your comments about calling repeatedly into cgo being slow and the standard use of commas rather than spaces for option delimiters.

@tomponline tomponline force-pushed the tomponline:tp-ipvlan branch from 344989d to 1772909 May 8, 2019

tomponline added some commits May 7, 2019

network: Adds multiple IP validation functions
Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
doc: IPVLAN docs
Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
api: IPVLAN support
Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
network: IPVLAN support
Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>

@stgraber stgraber force-pushed the tomponline:tp-ipvlan branch from 1772909 to a3c277f May 8, 2019

@s3rj1k

This comment has been minimized.

Copy link
Contributor

commented May 9, 2019

@tomponline In #5182 was discussion about GARP and GNDP, I did not see this implemented in this PR, not sure that closing that issue was good idea.

@stgraber

This comment has been minimized.

Copy link
Member

commented May 9, 2019

@s3rj1k proxy ARP and proxy NDP are detected in the liblxc code and will fail with an error so long as not configured properly. Similar to what we discussed before, we will not have LXC or LXD alter those system wide settings.

@stgraber stgraber merged commit 8114c82 into lxc:master May 9, 2019

4 of 5 checks passed

Testsuite Testsuite failed
Details
Branch target Branch target is correct
Details
DCO All commits signed-off
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@stgraber

This comment has been minimized.

Copy link
Member

commented May 9, 2019

To clarify a bit, the liblxc ipvlan code does the needed sysctls on the generated interface itself but will not alter interfaces which are outside of liblxc's control, if those aren't set to the right value, an error is raised.

@tomponline is the error handling for this good enough for LXD or would we want to improve how we surface liblxc's errors on start or alternatively have a check for those sysctls directly in LXD?

@s3rj1k

This comment has been minimized.

Copy link
Contributor

commented May 9, 2019

@stgraber ProxyARP and ProxyNDP is not the same as GARP and GNDP.
@tomponline knows more, we had a discussion in the original PR about it.

@stgraber I am glad that IPVLAN finely is in the code base, but you are two quick to discard community efforts and issues. This makes community contribution to project almost impossible.

@tomponline

This comment has been minimized.

Copy link
Member Author

commented May 9, 2019

@s3rj1k Yep I know what you mean about GARP I'll post a follow up shortly :)

@tomponline

This comment has been minimized.

Copy link
Member Author

commented May 9, 2019

@stgraber the error handling could definitely be made clearer, currently if the required sysctls are missing then the container will not start, and you have to run the suggested lxd log retrieval command for the container to see the underlying LXC error. Its there, but takes some effort to get to.

Adding explicit checks for the same sysctls in LXD would certainly help the experience, so I can add a check for those no problem.

@s3rj1k the IPVLAN functionality has been implemented in the underlying LXC codebase so as to provide the functionality to a wider set of users. It was implemented as 3 discrete sets of functionality:

  • The core IPVLAN device type - along with exposing the mode (l3, l3s and l2) and isolation flags (bridge, private, vepa).
  • The layer 2 proxy (l2proxy) - this detects the sysctls required to allow Linux to generate proxy ARP/NDP responses (and generates errors if they are not set) and adds the necessary IP neigh proxy entries and static routes to allow Linux to generate the response packets. The intention with splitting this out into its own feature separate from IPVLAN itself is so that it can be used with a routed veth mode we are planning to add in the coming months.
  • Device route gateway support - this allows the containers to use a device rather than an IP address as its default gateway, as using an IP for the gateway doesn't make sense for IPVLAN.

Then in LXD we made use of this new functionality to implement a simplified IPVLAN mode similar to your original pull-request and the existing MACVLAN experience in LXD.

Specifically:

  • Only l3s mode with bridge isolation flag is currently supported in LXD, as this is required in order to work with the l2proxy mode (for some reason Linux will not generate proxy ARP/NDP packets if the IPVLAN device is operating in another mode). This is equivalent to the macvlan experience in LXD today.
  • l2proxy mode is enabled, which assumes that the normal usage scenario for this mode is to add IPs to the containers from the local subnet of the parent interface (as opposed to other subnets being routed to the host). Not that this precludes you from doing that if you want though.

Additionally, the specific sysctls required have been slightly modified from your original PR, as during my testing I found that only these 3 are specifically required when using l3s mode and proxy ARP/NDP:

net.ipv4.conf.<parent>.forwarding=1
net.ipv6.conf.<parent>.forwarding=1
net.ipv6.conf.<parent>.proxy_ndp=1

I found that these sysctls from the original pull-request weren't specifically needed:

net.ipv4.ip_forward=1
net.ipv6.conf.all.forwarding=1
net.ipv4.conf.DEV.rp_filter=0

@s3rj1k the rp_filter one specifically I would be interested in knowing in which situations this one was needed?

Finally, @s3rj1k is correct that we did originally discuss adding gratuitous ARP and NDP adverts when a container boots so as to announce to the network which is slightly different to the proxy ARP/NDP mode in that the packets are sent out into the network irrespective of whether a device is asking "who has xxx?". This is primarily used for devices on the network that have already resolved the layer 2 address of the container and cached it in their IP neighbour cache. If the container is then migrated to a different host with a different MAC address then the devices on the network could continue to send packets to the old host as they had cached the MAC address (until such time as the cache expires or the OS notices there is no response and re-issues a "who has" request).

This could be added as an extension to the l2proxy mode in LXC (although the code I have today for it is written in Go rather than C so would need to be re-worked). This would then benefit both the IPVLAN mode (without code changes to LXD) and the future routed veth mode.

@tomponline tomponline deleted the tomponline:tp-ipvlan branch May 9, 2019

@s3rj1k

This comment has been minimized.

Copy link
Contributor

commented May 9, 2019

@tomponline Enabling IPForwarding is a good idea because of inconsistent kernel behaviour, from what I observed older kernels need IP Forwarding enabled, newer ones enable this behaviour then IPVLAN L3s mode is enabled with proxy ARP/NDP records.

also per the kernel docs net.ipv6.conf.all.forwarding=1 behaves differently from net.ipv4.ip_forward=1
for IPv4 it enables forwarding globally, for IPv6 it informs that you intend to forward traffic from specific interface (per interface forwarding).

Forwarding must be mentioned in documentation. But in my opinion this is needed in code to have defined behaviour across kernel versions.

rp_filter is needed for IPv4 then you have multiple Vlans on parent interface and inside CT.
with rp_filter disabled kernel will drop packets originated from different network.

As a minimum also must be in documentation.

I managed to get working GARP in my test environment, so this one is simple.

The NDP (IPv6) part is not so simple (does not work in my setup).

@lhorace

This comment has been minimized.

Copy link

commented May 9, 2019

@s3rj1k I agree with you re the community participation since I've been following this closely, nevertheless, it's implemented! So continue to contribute because it helps a lot of people too

@s3rj1k

This comment has been minimized.

Copy link
Contributor

commented May 9, 2019

@lhorace Agreed, @tomponline did a great job.
Nevertheless guys from canonical should extend contributors documentation on how to write properly styled code.

Maybe add linters to project, I am willing to assist with this. @stgraber

@tomponline

This comment has been minimized.

Copy link
Member Author

commented May 9, 2019

@s3rj1k RE the ipv4 forwarding sysctls, I made a mistake and missed out the net.ipv4.conf.<parent>.forwarding=1 one which is checked for in LXC/LXD and is in the docs here: https://linuxcontainers.org/lxc/manpages//man5/lxc.container.conf.5.html (search for l2proxy).

I am going to add better error output to LXD so that it checks for those 3 sysctls directly and lets the user know if they are incorrect.

I've updated my original post here:
#5716 (comment)

Regarding your note about older kernels requiring additional global forwarding sysctls enabled, I thought I would have a go at re-creating this using a CentOS 7 box which uses an older kernel 3.10.0-957.5.1.el7.x86_64 (it doesnt have IPVLAN support anyway, so its moot, but useful for testing the behaviour of proxy ARP/NDP settings).

For this experiment, the aim was to see if I could get a test host to response to ARP and NDP requests for an IP that it didn't have on its LAN interface, but did have a route to via the lo interface. As I am not testing "routed mode" proper (as that is not required for IPVLAN l3s mode), I just needed to check what the ping packets were arriving at the LAN interface of the test node to consider the test a success (as opposed to requiring a response to be generated) - as this will indicate that APR and NDP resolution has succeeded.

Here are the test steps I ran:

Disable proxy_arp and IPv4 forwarding globally so we have a baseline config:

sysctl net.ipv4.ip_forward=0
sysctl net.ipv4.conf.all.proxy_arp=0
sysctl net.ipv4.conf.all.forwarding=0

Enable forwarding on the LAN interface of the node:

sysctl net.ipv4.conf.enp2s0.forwarding=1

Add manual IP proxy entry on LAN interface (this is where IPv4 differs to IPv6 in the kernel because adding a manual proxy entry activates proxy ARP on that interface without needing to set it in the sysctls):

ip neigh add proxy 192.168.1.200 dev enp2s0

Add a route to the IP via the lo interface (this is to allow proxy ARP to work, using any non-LAN interface works the same):

ip r add 192.168.1.200/32 dev lo

Then run a tcpdump session:

tcpdump -i enp2s0 host 192.168.1.200 -nn

On another device connected to the same LAN, clear the local neighbour cache and ping 192.168.1.200

ip neigh delete 192.168.1.200 dev enp3s0
ping 192.168.1.200

We should expect to see this on TCPDUMP:

ARP resolution occurring, ICMP requests arriving at LAN interface (no ICMP reply expected though).

[root@rh-vz02 ~]# tcpdump -i enp2s0 host 192.168.1.200 -nn
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on enp2s0, link-type EN10MB (Ethernet), capture size 262144 bytes
09:41:59.915189 ARP, Request who-has 192.168.1.200 tell 192.168.1.106, length 46
09:42:00.364437 ARP, Reply 192.168.1.200 is-at 28:92:4a:32:0c:e2, length 28
09:42:00.364557 IP 192.168.1.106 > 192.168.1.200: ICMP echo request, id 5567, seq 1, length 64
09:42:00.923362 IP 192.168.1.106 > 192.168.1.200: ICMP echo request, id 5567, seq 2, length 64
09:42:01.947367 IP 192.168.1.106 > 192.168.1.200: ICMP echo request, id 5567, seq 3, length 64

Similarly for IPv6:

sysctl net.ipv6.conf.all.forwarding=0
sysctl net.ipv6.conf.all.proxy_ndp=0
sysctl net.ipv6.conf.enp2s0.forwarding=1
sysctl net.ipv6.conf.enp2s0.proxy_ndp=1
ip -6 r add 2a02:xxx:76f4:1::200/128 dev lo
ip neigh add proxy 2a02:xxx:76f4:1::200 dev enp2s0
tcpdump -i enp2s0 host 2a02:xxx:76f4:1::200 -nn

On another device connected to the same LAN, clear the local neighbour cache and ping 2a02:xxx:76f4:1::200

ip neigh delete 2a02:xxx:76f4:1::200 dev enp3s0
ping6 2a02:xxx:76f4:1::200

We would expect to see this:

[root@rh-vz02 ~]# tcpdump -i enp2s0 host 2a02:xxx:76f4:1::200 -nn
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on enp2s0, link-type EN10MB (Ethernet), capture size 262144 bytes
10:14:27.398523 IP6 2a02:xxx:76f4:1:d553:a5ec:8306:c7ce > 2a02:xxx:76f4:1::200: ICMP6, echo request, seq 1, length 64
10:14:28.095726 IP6 2a02:xxx:76f4:1:d553:a5ec:8306:c7ce > 2a02:xxx:76f4:1::200: ICMP6, echo request, seq 2, length 64
10:14:29.116336 IP6 2a02:xxx:76f4:1:d553:a5ec:8306:c7ce > 2a02:xxx:76f4:1::200: ICMP6, echo request, seq 3, length 64

So as IPVLAN mode doesn't actually use the host as a router (it just needs to get the packets to arrive at the LAN interface and the IPVLAN device takes it from there), these settings seem to be the only ones required, even on older kernels without IPVLAN support.

I'd be interested to know about specific kernel versions that do not behave like this.

RE rp_filter, I'd be hesitant to encourage people to disable that feature in the docs, as it is intended to filter out spoofed packets, and, as I understand it, only should be disabled in situations where you have the LXD node multi-honed (as you say on multiple VLANs perhaps). Presumably this is true whether you are using ipvlan or some other sort of container networking (such as bridging) where packets arrive at a different interface than the replies would be delivered out of. Have I misunderstood? Is there something specific to ipvlan and rp_filter?

Thanks

@s3rj1k

This comment has been minimized.

Copy link
Contributor

commented May 9, 2019

@tomponline You are correct about rp_filter usage, I am saying this should be a documented case then multiple VLANs are needed.

For IPv4 IP Forwarding per interface should be sufficient.

For IPv6 this is not the case, you need to enable net.ipv6.conf.all.forwarding=1 as per documentation
https://github.com/torvalds/linux/blob/80f232121b69cc69a31ccb2b38c1665d770b0710/Documentation/networking/ip-sysctl.txt#L1519

To check this, use IPVLAN L2/L3 (without L3S) mode with IPv6

Actually L3 mode is in fact a router mode, see https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/Documentation/networking/ipvlan.txt#L56

P.S. Do Ping6 checks against google IPv6 DNS for example. Also ND cache must be cleared on GW.

@tomponline

This comment has been minimized.

Copy link
Member Author

commented May 9, 2019

@s3rj1k Yes the docs imply the host is performing routing, and in a way it is, but its more like an L3-switch, because the host does not appear in MTR traces as a traditional router hop, also even in l3s mode, the packets are filtered by the INPUT and OUTPUT chains in iptables rather than the FORWARD chain like a normally routed packet would.

I found this out to my disappointment when I tried to use ipvlan on my home router which has a direct ppp connection to the Internet independent of the LAN the ipvlan device was connected to. Packets always went out of the LAN interface rather than use the host's routing table.

With that in mind, I've re-tested proxy NDP with net.ipv6.conf.all.forwarding both on and off and my tests show the same result, that it can be disabled safely assuming you still have net.ipv6.conf.enp3s0.forwarding=1 and it works for all modes, l2, l3 and l3s.

The problem comes when you bring IPv4 into the mix, as proxy ARP does not seem to work in l2 and l3 modes (and only l3s), even with net.ipv4.conf.all.forwarding and net.ipv4.ip_forward=1 enabled.

This is why I settled on forcing l3s mode for LXD's IPVLAN.

@s3rj1k

This comment has been minimized.

Copy link
Contributor

commented May 9, 2019

@tomponline actualy L2 mode is also usable, does not need ProxyARP or ProxyNDP, only IP Forwarding, look at output of ip -d l both parent and slave interfaces have ARP attribute.

I would consider adding l2 mode as an option. Works similar to MacVtap but saves you MAC table entries on a dumb hardware switch.

IPVLAN is designed to send packets ONLY from parent interface, so no wonder you had issues with home router, to overcome this you need to set up policy routing.

Plain L3 mode is useless in my opinion.

@s3rj1k

This comment has been minimized.

Copy link
Contributor

commented May 9, 2019

And I am actually a bit disappointed of @stgraber behaviour, did not even mentioned at all any work that was done by me, convincing him to actually do something about IPVLAN, spending lots of personal time rebasing, testing, rewriting initial PR.

May this be a warning to other members of community to think twice before contributing.

@tomponline thanks for your work.

@tomponline

This comment has been minimized.

Copy link
Member Author

commented May 10, 2019

@s3rj1k OK thanks will give l2 another go, as you say l3 seems to be fairly redundant given l3s mode exists.

Now that the l3s mode that your original PR instigated has been merged, which put in place a lot of the underlying 'plumbing' into LXC (including l2 mode), adding an l2 mode, and required gateway params shouldn't be too hard. It may be that we use the presence of the "gateway" setting to switch into l2 mode. Although that has security implications (in that the host's firewall is bypassed on the inbound). I'll have a think and create a card to track the additional features we discussed here.

@tomponline

This comment has been minimized.

Copy link
Member Author

commented May 15, 2019

@stgraber added better sysctl checks here #5760

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.