New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OVS with a full-routing table utilizing 100% cpu #185
Comments
I am also seeing this issue can anyone point some solution |
It looks like there are lots of changes in the routing table, which isn't too surprising given that the system is running BGP. OVS tries to keep up with these. I'm surprised it's this expensive. If you stop the BGP daemon temporarily, does OVS CPU usage drop to near-zero? If so, then that would confirm the root of the issue and we can look into how to avoid the high CPU for this case. |
I have the same issue, bird running BGP, and server having a full table in kernel is indeed the culprit. |
@zejar have you found a solution to this problem? |
@ddominet I have not. Instead I switched to "regular" Linux bridges. |
If it is not necessary for OVS to be aware of these routes, the solution is to move BGP daemon into a separate network namespace. |
I'm using it with kolla ansible openstack, BGP is there to provide a connectivity to a prefix, that will need to be avaliable within openstack. Is it something that will work with separate namespaces? Might be too separated. If it was a separate routing table than i could just policy route it. In this case i don't think so |
Unfortunately, I'm not a big BGP or OpenStack expert to tell if it will work. FWIW, Here is a link to discussion on why OVS is consuming a lot of CPU in this scenario: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052092.html |
After reading the discussion that @igsilya shared link to in his last comment,
^this loop will continue for some time, don't need to wait for it to complete to see the symptoms in the logs or in htop/top.
sudo grep blocked /var/log/openvswitch/ovs-vswitchd.log | sort -n -k2 | tail In this example all routing tables entries sum up to 14k+: |
bump? Any news? |
Unfortunately OVS team doesn’t seem to take those issues seriously. And
that to me is a big one, or to anyone working in ISP environment.
Kind Regards,
Dominik
W dniu czw., 14.03.2024 o 08:53 Daniel Preussker ***@***.***>
napisał(a):
… bump?
Any news?
—
Reply to this email directly, view it on GitHub
<#185 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASHDJ2NKJI63DS4PFPNFKZTYYFJO3AVCNFSM4NFXOQR2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJZGY3TMNBYGY4Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
As I said, the issue doesn't have a good solution, only compromises. The reason is that Linux kernel routing change notifications are not a reliable source of information. Any application that tries to incrementally maintain the state of the routing table based on these notifications is implementing a huge amount of heuristics to guess which notification is a good one and which is not. And they are never fully correct, meaning old already removed entries may linger in a system or some updates may not be delivered. The only approach that gives a correct view on a current state of a routing table is to dump it as a whole after each change and that is causing high CPU usage for obvious reasons if you have huge routing tables with a high rate of updates. Most of the OVS users are not running BGP in the same network namespace with ovs-vswitchd, so it is not a problem for them. If you can change the architecture of your deployment to not run BGP in the same network namespace with OVS, that is still a preferred approach to fix the problem. One approach we can take is to delay the dumping of the routing table by a certain amount of time. This should alleviate some of the load at a cost of not having the most up to date routing table for that amount of time (few seconds?). Is that a reasonable compromise for your setup?
What exactly these routes are used for? Do you have an OVS tunnel that relies on these routes? Alternative and not recommended way to resolve this issue is to just disable synchronization with the kernel routing table with a command-line argument we have for testing purposes: |
Have a solution which worked for my environment. |
@AswiniViswanathan I don't think the |
@igsilya Yes its related to utilization, Give it a try. |
@AswiniViswanathan it's great that it worked for you somehow, but that makes no sense to me. This option has nothing to do with routing or CPU usage and you're suggesting to add it to the process that is not the problem with this issue. And I'm pretty sure that other people on this thread already use this option as it is a default way to run OVS services. |
@igsilya okay. Anyway lets see if it helps anyone else. |
@igsilya I just finished moving the BGP routing table into it's own VRF but the issue persists. the default routing table is now effectively empty and the interfaces inside the OVS bridges are all part of their own dedicated VRF that only contains the on-link subnets and a default route. BGP with full tables is in it's own routing table with ~2.4m entries. It seems that OVS doesnt care at all which routing table is used or am I missing something obvious? |
@f0o yes, OVS monitors all tables. You need a separate network namespace. |
@igsilya Gotcha, I foolishly assumed vrf would follow the same logic as netns. Let's see if I can replicate this whole thing with netns instead! |
@igsilya Couldn't get it all to work with netns so I took the "easy" way and tried with I verified ovs-vswitchd was actually started with it:
|
@f0o , yeah, my bad. Looking through the code, OVS will not update internal routing table if this option enabled, but it will still dump the whole routing table from the kernel. So, yeah, it removes some of the work, but clearly not enough. This is clearly not an intended use case for this option. I'm working on a fix to only track changes from the main routing table, will post for review hopefully this week. It should help if you configure BGP with its own VRF. But for now, it seems, only a network namespace will work. |
@igsilya fair enough, I'm really excited for your VRF fix 😅 - Maybe to make it more modular, have a parameter that you can use to select a table instead of just defaulting to the default/local one. Could be useful sometimes. I'll continue to work towards a netns solution just to get a solution to into this issue that might work for others without waiting for now packages to be released. |
@igsilya Although I can confirm that using netns solves the issue, there is no good way to "leak" routes across like there is with VRFs - meaning that moving traffic from OVS into the BGP netns is tedious at best and require linknets adding a lot more overhead than it should but a simple left-pocket -> right-pocket routing action. So I cant accept netns as a viable solution unfortunately |
@f0o there will be a cost of running in the same network namespace. Linux kernel doesn't provide a mechanism to receive routes for specific tables only. That means that for every BGP update ovs-vswitchd will still receive a notification. It will be parsed and discarded. This should not be very expensive to do, but it may take a bit of time. Edit: There might be a way to filter out the full dump. |
Currently, ovs-vswitchd is subscribed to all the routing changes in the kernel. On each change, it marks the internal routing table cache as invalid, then resets it and dumps all the routes from the kernel from scratch. The reason for that is kernel routing updates not being reliable in a sense that it's hard to tell which route is getting removed or modified. Userspace application has to track the order in which route entries are dumped from the kernel. Updates can get lost or even duplicated and the kernel doesn't provide a good mechanism to distinguish one route from another. To my knowledge, dumping all the routes from a kernel after each change is the only way to keep the cache consistent. Some more info can be found in the following never addressed issues: https://bugzilla.redhat.com/1337860 https://bugzilla.redhat.com/1337855 It seems to be believed that NetworkManager "mostly" does incremental updates right. But it is still not completely correct, will re-dump the whole table in certain cases, and it takes a huge amount of very complicated code to do the accounting and route comparisons. Going back to ovs-vswitchd, it currently dumps routes from all the routing tables. If it will get conflicting routes from multiple tables, the cache will not be useful. The routing cache in userspace is primarily used for checking the egress port for tunneled traffic and this way also detecting link state changes for a tunnel port. For userspace datapath it is used for actual routing of the packet after sending to a native tunnel. With kernel datapath we don't really have a mechanism to know which routing table will actually be used by the kernel after encapsulation, so our lookups on a cache may be incorrect because of this as well. So, unless all the relevant routes are in the standard tables, the lookup in userspace route cache is unreliable. Luckily, most setups are not using any complicated routing in non-standard tables that OVS has to be aware of. It is possible, but unlikely, that standard routing tables are completely empty while some other custom table is not, and all the OVS tunnel traffic is directed to that table. That would be the only scenario where dumping non-standard tables would make sense. But it seems like this kind of setup will likely need a way to tell OVS from which table the routes should be taken, or we'll need to dump routing rules and keep a separate cache for each table, so we can first match on rules and then lookup correct routes in a specific table. I'm not sure if trying to implement all that is justified. For now, stop considering routes from non-standard tables to avoid mixing different tables together and also wasting CPU resources. This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is running on a same host and in a same network namespace with OVS using its own custom routing table. Unfortunately, there seems to be no way to tell the kernel to send updates only for particular tables. So, we'll still receive and parse all of them. But they will not result in a full cache invalidation in most cases. Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps. So, we can make use of it and dump only standard tables when we get a relevant route update. NETLINK_GET_STRICT_CHK has to be enabled on the socket for filtering to work. There is no reason to not enable it by default, if supported. It is not used outside of NETLINK_ROUTE. Fixes: f0e167f ("route-table: Handle route updates more robustly.") Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb") Reported-at: openvswitch/ovs-issues#185 Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: 0-day Robot <robot@bytheb.org>
@igsilya I'm running into an odd issue and I wonder if it's connected to this one. I cant seem to manage any bridges with ovs-vsctl anymore. I keep getting the error: worth to note, all ovs-vsctl commands are through ansible and with a default timeout of 5s. //EDIT: Increasing the timeout worked but I also notice a very very big amount of these errors in OVN: And I believe that's also caused by the full-tables here... The moment I dropped this router out of the IBGP mesh the errors stopped and ports were being added etc... So this 100% CPU utilization is not just an annoyance because it's pegging a core but it actually immobilizes OVS entirely! |
@f0o I assume ovs-vswitchd reports long poll intervals, what are the largest numbers you see there? If they are in 2.5+ seconds (2.5 before connection is accpeted + potentially 2.5 on the next iteration before the reply is sent), then yes, processes that do not wait longer than 5 seconds may disconnect. Some database operations may potentially wait for more than 2 poll intervals. For OVN, I assume you're using OVN 23.09 older than v23.09.1. The statctrl thread had an inactivity probe set to 5 seconds before commit ovn-org/ovn@bbd07439b . Update to latest v23.09.3 should fix the problem. |
@igsilya Thanks for the insight! I went ahead and built openvswitch-3.2.1 with the patch from 01f7582584354cf087924170723dd0838d8b34f3 and I can confidently say that it is no longer burning CPU nor being unresponsive. All traffic is flowing fine and ports/flows are being added/removed swiftly! I guess I'm your lab-bunny now! 🙈 |
@f0o Thanks a lot for testing! Now we just need to wait for some code review. Hopefully, it won't take too long. If you want to reply to the original patch email with a |
I just ran a few tests by adding/removing flows, ports and bridges like a maniac and verifying that traffic flows while both routers remain in IBGP and EBGP with multiple full tables in different VRFs. Everything works like a breeze! ovs-vswitchd chills at 2% cpu and from the logs of OVS/OVN everything is operational and normal. I tried adding the mbox or using the mailto but all my mail clients strip away the Reply-To header "because of safety". So I'm unable to stamp it there. So please take my informal stamp of testing and approval here :) |
Great Job Team!
Sorry for my harsh words previously,
But happy that it’s fixed.
Kind Regards,
Dominik
W dniu wt., 19.03.2024 o 09:56 Daniel Preussker ***@***.***>
napisał(a):
… I just ran a few tests by adding/removing flows, ports and bridges like a
maniac and verifying that traffic flows while both routers remain in IBGP
and EBGP with multiple full tables in different VRFs.
Everything works like a breeze! ovs-vswitchd chills at 2% cpu and from the
logs of OVS/OVN everything is operational and normal.
I tried adding the mbox or using the mailto but all my mail clients strip
away the Reply-To header "because of safety". So I'm unable to stamp it
there.
So please take my informal stamp of testing and approval here :)
—
Reply to this email directly, view it on GitHub
<#185 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASHDJ2PHFZHLTE6WGKI7SETYY74TZAVCNFSM4NFXOQR2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBQGYZTONZQGA3Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Currently, ovs-vswitchd is subscribed to all the routing changes in the kernel. On each change, it marks the internal routing table cache as invalid, then resets it and dumps all the routes from the kernel from scratch. The reason for that is kernel routing updates not being reliable in a sense that it's hard to tell which route is getting removed or modified. Userspace application has to track the order in which route entries are dumped from the kernel. Updates can get lost or even duplicated and the kernel doesn't provide a good mechanism to distinguish one route from another. To my knowledge, dumping all the routes from a kernel after each change is the only way to keep the cache consistent. Some more info can be found in the following never addressed issues: https://bugzilla.redhat.com/1337860 https://bugzilla.redhat.com/1337855 It seems to be believed that NetworkManager "mostly" does incremental updates right. But it is still not completely correct, will re-dump the whole table in certain cases, and it takes a huge amount of very complicated code to do the accounting and route comparisons. Going back to ovs-vswitchd, it currently dumps routes from all the routing tables. If it will get conflicting routes from multiple tables, the cache will not be useful. The routing cache in userspace is primarily used for checking the egress port for tunneled traffic and this way also detecting link state changes for a tunnel port. For userspace datapath it is used for actual routing of the packet after sending to a native tunnel. With kernel datapath we don't really have a mechanism to know which routing table will actually be used by the kernel after encapsulation, so our lookups on a cache may be incorrect because of this as well. So, unless all the relevant routes are in the standard tables, the lookup in userspace route cache is unreliable. Luckily, most setups are not using any complicated routing in non-standard tables that OVS has to be aware of. It is possible, but unlikely, that standard routing tables are completely empty while some other custom table is not, and all the OVS tunnel traffic is directed to that table. That would be the only scenario where dumping non-standard tables would make sense. But it seems like this kind of setup will likely need a way to tell OVS from which table the routes should be taken, or we'll need to dump routing rules and keep a separate cache for each table, so we can first match on rules and then lookup correct routes in a specific table. I'm not sure if trying to implement all that is justified. For now, stop considering routes from non-standard tables to avoid mixing different tables together and also wasting CPU resources. This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is running on a same host and in a same network namespace with OVS using its own custom routing table. Unfortunately, there seems to be no way to tell the kernel to send updates only for particular tables. So, we'll still receive and parse all of them. But they will not result in a full cache invalidation in most cases. Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps. So, we can make use of it and dump only standard tables when we get a relevant route update. NETLINK_GET_STRICT_CHK has to be enabled on the socket for filtering to work. There is no reason to not enable it by default, if supported. It is not used outside of NETLINK_ROUTE. Fixes: f0e167f ("route-table: Handle route updates more robustly.") Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb") Reported-at: openvswitch/ovs-issues#185 Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, ovs-vswitchd is subscribed to all the routing changes in the kernel. On each change, it marks the internal routing table cache as invalid, then resets it and dumps all the routes from the kernel from scratch. The reason for that is kernel routing updates not being reliable in a sense that it's hard to tell which route is getting removed or modified. Userspace application has to track the order in which route entries are dumped from the kernel. Updates can get lost or even duplicated and the kernel doesn't provide a good mechanism to distinguish one route from another. To my knowledge, dumping all the routes from a kernel after each change is the only way to keep the cache consistent. Some more info can be found in the following never addressed issues: https://bugzilla.redhat.com/1337860 https://bugzilla.redhat.com/1337855 It seems to be believed that NetworkManager "mostly" does incremental updates right. But it is still not completely correct, will re-dump the whole table in certain cases, and it takes a huge amount of very complicated code to do the accounting and route comparisons. Going back to ovs-vswitchd, it currently dumps routes from all the routing tables. If it will get conflicting routes from multiple tables, the cache will not be useful. The routing cache in userspace is primarily used for checking the egress port for tunneled traffic and this way also detecting link state changes for a tunnel port. For userspace datapath it is used for actual routing of the packet after sending to a native tunnel. With kernel datapath we don't really have a mechanism to know which routing table will actually be used by the kernel after encapsulation, so our lookups on a cache may be incorrect because of this as well. So, unless all the relevant routes are in the standard tables, the lookup in userspace route cache is unreliable. Luckily, most setups are not using any complicated routing in non-standard tables that OVS has to be aware of. It is possible, but unlikely, that standard routing tables are completely empty while some other custom table is not, and all the OVS tunnel traffic is directed to that table. That would be the only scenario where dumping non-standard tables would make sense. But it seems like this kind of setup will likely need a way to tell OVS from which table the routes should be taken, or we'll need to dump routing rules and keep a separate cache for each table, so we can first match on rules and then lookup correct routes in a specific table. I'm not sure if trying to implement all that is justified. For now, stop considering routes from non-standard tables to avoid mixing different tables together and also wasting CPU resources. This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is running on a same host and in a same network namespace with OVS using its own custom routing table. Unfortunately, there seems to be no way to tell the kernel to send updates only for particular tables. So, we'll still receive and parse all of them. But they will not result in a full cache invalidation in most cases. Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps. So, we can make use of it and dump only standard tables when we get a relevant route update. NETLINK_GET_STRICT_CHK has to be enabled on the socket for filtering to work. There is no reason to not enable it by default, if supported. It is not used outside of NETLINK_ROUTE. Fixes: f0e167f ("route-table: Handle route updates more robustly.") Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb") Reported-at: openvswitch/ovs-issues#185 Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: 0-day Robot <robot@bytheb.org>
Currently, ovs-vswitchd is subscribed to all the routing changes in the kernel. On each change, it marks the internal routing table cache as invalid, then resets it and dumps all the routes from the kernel from scratch. The reason for that is kernel routing updates not being reliable in a sense that it's hard to tell which route is getting removed or modified. Userspace application has to track the order in which route entries are dumped from the kernel. Updates can get lost or even duplicated and the kernel doesn't provide a good mechanism to distinguish one route from another. To my knowledge, dumping all the routes from a kernel after each change is the only way to keep the cache consistent. Some more info can be found in the following never addressed issues: https://bugzilla.redhat.com/1337860 https://bugzilla.redhat.com/1337855 It seems to be believed that NetworkManager "mostly" does incremental updates right. But it is still not completely correct, will re-dump the whole table in certain cases, and it takes a huge amount of very complicated code to do the accounting and route comparisons. Going back to ovs-vswitchd, it currently dumps routes from all the routing tables. If it will get conflicting routes from multiple tables, the cache will not be useful. The routing cache in userspace is primarily used for checking the egress port for tunneled traffic and this way also detecting link state changes for a tunnel port. For userspace datapath it is used for actual routing of the packet after sending to a native tunnel. With kernel datapath we don't really have a mechanism to know which routing table will actually be used by the kernel after encapsulation, so our lookups on a cache may be incorrect because of this as well. So, unless all the relevant routes are in the standard tables, the lookup in userspace route cache is unreliable. Luckily, most setups are not using any complicated routing in non-standard tables that OVS has to be aware of. It is possible, but unlikely, that standard routing tables are completely empty while some other custom table is not, and all the OVS tunnel traffic is directed to that table. That would be the only scenario where dumping non-standard tables would make sense. But it seems like this kind of setup will likely need a way to tell OVS from which table the routes should be taken, or we'll need to dump routing rules and keep a separate cache for each table, so we can first match on rules and then lookup correct routes in a specific table. I'm not sure if trying to implement all that is justified. For now, stop considering routes from non-standard tables to avoid mixing different tables together and also wasting CPU resources. This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is running on a same host and in a same network namespace with OVS using its own custom routing table. Unfortunately, there seems to be no way to tell the kernel to send updates only for particular tables. So, we'll still receive and parse all of them. But they will not result in a full cache invalidation in most cases. Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps. So, we can make use of it and dump only standard tables when we get a relevant route update. NETLINK_GET_STRICT_CHK has to be enabled on the socket for filtering to work. There is no reason to not enable it by default, if supported. It is not used outside of NETLINK_ROUTE. Fixes: f0e167f ("route-table: Handle route updates more robustly.") Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb") Reported-at: openvswitch/ovs-issues#185 Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, ovs-vswitchd is subscribed to all the routing changes in the kernel. On each change, it marks the internal routing table cache as invalid, then resets it and dumps all the routes from the kernel from scratch. The reason for that is kernel routing updates not being reliable in a sense that it's hard to tell which route is getting removed or modified. Userspace application has to track the order in which route entries are dumped from the kernel. Updates can get lost or even duplicated and the kernel doesn't provide a good mechanism to distinguish one route from another. To my knowledge, dumping all the routes from a kernel after each change is the only way to keep the cache consistent. Some more info can be found in the following never addressed issues: https://bugzilla.redhat.com/1337860 https://bugzilla.redhat.com/1337855 It seems to be believed that NetworkManager "mostly" does incremental updates right. But it is still not completely correct, will re-dump the whole table in certain cases, and it takes a huge amount of very complicated code to do the accounting and route comparisons. Going back to ovs-vswitchd, it currently dumps routes from all the routing tables. If it will get conflicting routes from multiple tables, the cache will not be useful. The routing cache in userspace is primarily used for checking the egress port for tunneled traffic and this way also detecting link state changes for a tunnel port. For userspace datapath it is used for actual routing of the packet after sending to a native tunnel. With kernel datapath we don't really have a mechanism to know which routing table will actually be used by the kernel after encapsulation, so our lookups on a cache may be incorrect because of this as well. So, unless all the relevant routes are in the standard tables, the lookup in userspace route cache is unreliable. Luckily, most setups are not using any complicated routing in non-standard tables that OVS has to be aware of. It is possible, but unlikely, that standard routing tables are completely empty while some other custom table is not, and all the OVS tunnel traffic is directed to that table. That would be the only scenario where dumping non-standard tables would make sense. But it seems like this kind of setup will likely need a way to tell OVS from which table the routes should be taken, or we'll need to dump routing rules and keep a separate cache for each table, so we can first match on rules and then lookup correct routes in a specific table. I'm not sure if trying to implement all that is justified. For now, stop considering routes from non-standard tables to avoid mixing different tables together and also wasting CPU resources. This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is running on a same host and in a same network namespace with OVS using its own custom routing table. Unfortunately, there seems to be no way to tell the kernel to send updates only for particular tables. So, we'll still receive and parse all of them. But they will not result in a full cache invalidation in most cases. Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps. So, we can make use of it and dump only standard tables when we get a relevant route update. NETLINK_GET_STRICT_CHK has to be enabled on the socket for filtering to work. There is no reason to not enable it by default, if supported. It is not used outside of NETLINK_ROUTE. Fixes: f0e167f ("route-table: Handle route updates more robustly.") Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb") Reported-at: openvswitch/ovs-issues#185 Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, ovs-vswitchd is subscribed to all the routing changes in the kernel. On each change, it marks the internal routing table cache as invalid, then resets it and dumps all the routes from the kernel from scratch. The reason for that is kernel routing updates not being reliable in a sense that it's hard to tell which route is getting removed or modified. Userspace application has to track the order in which route entries are dumped from the kernel. Updates can get lost or even duplicated and the kernel doesn't provide a good mechanism to distinguish one route from another. To my knowledge, dumping all the routes from a kernel after each change is the only way to keep the cache consistent. Some more info can be found in the following never addressed issues: https://bugzilla.redhat.com/1337860 https://bugzilla.redhat.com/1337855 It seems to be believed that NetworkManager "mostly" does incremental updates right. But it is still not completely correct, will re-dump the whole table in certain cases, and it takes a huge amount of very complicated code to do the accounting and route comparisons. Going back to ovs-vswitchd, it currently dumps routes from all the routing tables. If it will get conflicting routes from multiple tables, the cache will not be useful. The routing cache in userspace is primarily used for checking the egress port for tunneled traffic and this way also detecting link state changes for a tunnel port. For userspace datapath it is used for actual routing of the packet after sending to a native tunnel. With kernel datapath we don't really have a mechanism to know which routing table will actually be used by the kernel after encapsulation, so our lookups on a cache may be incorrect because of this as well. So, unless all the relevant routes are in the standard tables, the lookup in userspace route cache is unreliable. Luckily, most setups are not using any complicated routing in non-standard tables that OVS has to be aware of. It is possible, but unlikely, that standard routing tables are completely empty while some other custom table is not, and all the OVS tunnel traffic is directed to that table. That would be the only scenario where dumping non-standard tables would make sense. But it seems like this kind of setup will likely need a way to tell OVS from which table the routes should be taken, or we'll need to dump routing rules and keep a separate cache for each table, so we can first match on rules and then lookup correct routes in a specific table. I'm not sure if trying to implement all that is justified. For now, stop considering routes from non-standard tables to avoid mixing different tables together and also wasting CPU resources. This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is running on a same host and in a same network namespace with OVS using its own custom routing table. Unfortunately, there seems to be no way to tell the kernel to send updates only for particular tables. So, we'll still receive and parse all of them. But they will not result in a full cache invalidation in most cases. Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps. So, we can make use of it and dump only standard tables when we get a relevant route update. NETLINK_GET_STRICT_CHK has to be enabled on the socket for filtering to work. There is no reason to not enable it by default, if supported. It is not used outside of NETLINK_ROUTE. Fixes: f0e167f ("route-table: Handle route updates more robustly.") Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb") Reported-at: openvswitch/ovs-issues#185 Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, ovs-vswitchd is subscribed to all the routing changes in the kernel. On each change, it marks the internal routing table cache as invalid, then resets it and dumps all the routes from the kernel from scratch. The reason for that is kernel routing updates not being reliable in a sense that it's hard to tell which route is getting removed or modified. Userspace application has to track the order in which route entries are dumped from the kernel. Updates can get lost or even duplicated and the kernel doesn't provide a good mechanism to distinguish one route from another. To my knowledge, dumping all the routes from a kernel after each change is the only way to keep the cache consistent. Some more info can be found in the following never addressed issues: https://bugzilla.redhat.com/1337860 https://bugzilla.redhat.com/1337855 It seems to be believed that NetworkManager "mostly" does incremental updates right. But it is still not completely correct, will re-dump the whole table in certain cases, and it takes a huge amount of very complicated code to do the accounting and route comparisons. Going back to ovs-vswitchd, it currently dumps routes from all the routing tables. If it will get conflicting routes from multiple tables, the cache will not be useful. The routing cache in userspace is primarily used for checking the egress port for tunneled traffic and this way also detecting link state changes for a tunnel port. For userspace datapath it is used for actual routing of the packet after sending to a native tunnel. With kernel datapath we don't really have a mechanism to know which routing table will actually be used by the kernel after encapsulation, so our lookups on a cache may be incorrect because of this as well. So, unless all the relevant routes are in the standard tables, the lookup in userspace route cache is unreliable. Luckily, most setups are not using any complicated routing in non-standard tables that OVS has to be aware of. It is possible, but unlikely, that standard routing tables are completely empty while some other custom table is not, and all the OVS tunnel traffic is directed to that table. That would be the only scenario where dumping non-standard tables would make sense. But it seems like this kind of setup will likely need a way to tell OVS from which table the routes should be taken, or we'll need to dump routing rules and keep a separate cache for each table, so we can first match on rules and then lookup correct routes in a specific table. I'm not sure if trying to implement all that is justified. For now, stop considering routes from non-standard tables to avoid mixing different tables together and also wasting CPU resources. This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is running on a same host and in a same network namespace with OVS using its own custom routing table. Unfortunately, there seems to be no way to tell the kernel to send updates only for particular tables. So, we'll still receive and parse all of them. But they will not result in a full cache invalidation in most cases. Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps. So, we can make use of it and dump only standard tables when we get a relevant route update. NETLINK_GET_STRICT_CHK has to be enabled on the socket for filtering to work. There is no reason to not enable it by default, if supported. It is not used outside of NETLINK_ROUTE. Fixes: f0e167f ("route-table: Handle route updates more robustly.") Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb") Reported-at: openvswitch/ovs-issues#185 Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, ovs-vswitchd is subscribed to all the routing changes in the kernel. On each change, it marks the internal routing table cache as invalid, then resets it and dumps all the routes from the kernel from scratch. The reason for that is kernel routing updates not being reliable in a sense that it's hard to tell which route is getting removed or modified. Userspace application has to track the order in which route entries are dumped from the kernel. Updates can get lost or even duplicated and the kernel doesn't provide a good mechanism to distinguish one route from another. To my knowledge, dumping all the routes from a kernel after each change is the only way to keep the cache consistent. Some more info can be found in the following never addressed issues: https://bugzilla.redhat.com/1337860 https://bugzilla.redhat.com/1337855 It seems to be believed that NetworkManager "mostly" does incremental updates right. But it is still not completely correct, will re-dump the whole table in certain cases, and it takes a huge amount of very complicated code to do the accounting and route comparisons. Going back to ovs-vswitchd, it currently dumps routes from all the routing tables. If it will get conflicting routes from multiple tables, the cache will not be useful. The routing cache in userspace is primarily used for checking the egress port for tunneled traffic and this way also detecting link state changes for a tunnel port. For userspace datapath it is used for actual routing of the packet after sending to a native tunnel. With kernel datapath we don't really have a mechanism to know which routing table will actually be used by the kernel after encapsulation, so our lookups on a cache may be incorrect because of this as well. So, unless all the relevant routes are in the standard tables, the lookup in userspace route cache is unreliable. Luckily, most setups are not using any complicated routing in non-standard tables that OVS has to be aware of. It is possible, but unlikely, that standard routing tables are completely empty while some other custom table is not, and all the OVS tunnel traffic is directed to that table. That would be the only scenario where dumping non-standard tables would make sense. But it seems like this kind of setup will likely need a way to tell OVS from which table the routes should be taken, or we'll need to dump routing rules and keep a separate cache for each table, so we can first match on rules and then lookup correct routes in a specific table. I'm not sure if trying to implement all that is justified. For now, stop considering routes from non-standard tables to avoid mixing different tables together and also wasting CPU resources. This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is running on a same host and in a same network namespace with OVS using its own custom routing table. Unfortunately, there seems to be no way to tell the kernel to send updates only for particular tables. So, we'll still receive and parse all of them. But they will not result in a full cache invalidation in most cases. Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps. So, we can make use of it and dump only standard tables when we get a relevant route update. NETLINK_GET_STRICT_CHK has to be enabled on the socket for filtering to work. There is no reason to not enable it by default, if supported. It is not used outside of NETLINK_ROUTE. Fixes: f0e167f ("route-table: Handle route updates more robustly.") Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb") Reported-at: openvswitch/ovs-issues#185 Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Currently, ovs-vswitchd is subscribed to all the routing changes in the kernel. On each change, it marks the internal routing table cache as invalid, then resets it and dumps all the routes from the kernel from scratch. The reason for that is kernel routing updates not being reliable in a sense that it's hard to tell which route is getting removed or modified. Userspace application has to track the order in which route entries are dumped from the kernel. Updates can get lost or even duplicated and the kernel doesn't provide a good mechanism to distinguish one route from another. To my knowledge, dumping all the routes from a kernel after each change is the only way to keep the cache consistent. Some more info can be found in the following never addressed issues: https://bugzilla.redhat.com/1337860 https://bugzilla.redhat.com/1337855 It seems to be believed that NetworkManager "mostly" does incremental updates right. But it is still not completely correct, will re-dump the whole table in certain cases, and it takes a huge amount of very complicated code to do the accounting and route comparisons. Going back to ovs-vswitchd, it currently dumps routes from all the routing tables. If it will get conflicting routes from multiple tables, the cache will not be useful. The routing cache in userspace is primarily used for checking the egress port for tunneled traffic and this way also detecting link state changes for a tunnel port. For userspace datapath it is used for actual routing of the packet after sending to a native tunnel. With kernel datapath we don't really have a mechanism to know which routing table will actually be used by the kernel after encapsulation, so our lookups on a cache may be incorrect because of this as well. So, unless all the relevant routes are in the standard tables, the lookup in userspace route cache is unreliable. Luckily, most setups are not using any complicated routing in non-standard tables that OVS has to be aware of. It is possible, but unlikely, that standard routing tables are completely empty while some other custom table is not, and all the OVS tunnel traffic is directed to that table. That would be the only scenario where dumping non-standard tables would make sense. But it seems like this kind of setup will likely need a way to tell OVS from which table the routes should be taken, or we'll need to dump routing rules and keep a separate cache for each table, so we can first match on rules and then lookup correct routes in a specific table. I'm not sure if trying to implement all that is justified. For now, stop considering routes from non-standard tables to avoid mixing different tables together and also wasting CPU resources. This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is running on a same host and in a same network namespace with OVS using its own custom routing table. Unfortunately, there seems to be no way to tell the kernel to send updates only for particular tables. So, we'll still receive and parse all of them. But they will not result in a full cache invalidation in most cases. Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps. So, we can make use of it and dump only standard tables when we get a relevant route update. NETLINK_GET_STRICT_CHK has to be enabled on the socket for filtering to work. There is no reason to not enable it by default, if supported. It is not used outside of NETLINK_ROUTE. Fixes: f0e167f ("route-table: Handle route updates more robustly.") Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb") Reported-at: openvswitch/ovs-issues#185 Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
@f0o , @ddominet , the patch is applied now to all branches down to 2.17. Will be part of the next set of stable releases. Not sure when distributions will pick those up. With the change you should be able to run BGP in a separate VRF without significant impact on OVS CPU usage. But running with the main routing table will still be problematic. Closing this issue for now. |
Amazing job @igsilya! Thank you so much! |
As the title says I'm running OpenVSwitch on a Debian box that is configured to be my router.
The box receives a full routing table (IPv6) via Bird2 and as soon as the full table starts coming in OVS will start utilizing 100% of my cpu.
I let it run overnight but 12 hours later it was still at 100%.
The program utilizing the cpu is "ovs-vswitchd".
This issue can be reproduced by setting up a BGP session with a full routing table and purely installing the package
openvswitch-switch
. No config needs to be done for OVS, it will consume the entire cpu when the full routing table is present on the machine and OVS is running.Debian version:
Linux rtr1 5.6.0-1-cloud-amd64 #1 SMP Debian 5.6.7-1 (2020-04-29) x86_64 GNU/Linux
Bird version:
BIRD 2.0.7
OVS version:
ovs-vswitchd (Open vSwitch) 2.13.0
The
/var/log/openvswitch/ovs-vswitchd.log
file shows this:The text was updated successfully, but these errors were encountered: