Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OVS with a full-routing table utilizing 100% cpu #185

Closed
zejar opened this issue May 20, 2020 · 33 comments
Closed

OVS with a full-routing table utilizing 100% cpu #185

zejar opened this issue May 20, 2020 · 33 comments

Comments

@zejar
Copy link

zejar commented May 20, 2020

As the title says I'm running OpenVSwitch on a Debian box that is configured to be my router.
The box receives a full routing table (IPv6) via Bird2 and as soon as the full table starts coming in OVS will start utilizing 100% of my cpu.
I let it run overnight but 12 hours later it was still at 100%.
The program utilizing the cpu is "ovs-vswitchd".

This issue can be reproduced by setting up a BGP session with a full routing table and purely installing the package openvswitch-switch. No config needs to be done for OVS, it will consume the entire cpu when the full routing table is present on the machine and OVS is running.

Debian version: Linux rtr1 5.6.0-1-cloud-amd64 #1 SMP Debian 5.6.7-1 (2020-04-29) x86_64 GNU/Linux
Bird version: BIRD 2.0.7
OVS version: ovs-vswitchd (Open vSwitch) 2.13.0

The /var/log/openvswitch/ovs-vswitchd.log file shows this:

2020-05-20T08:52:45.118Z|00155|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:52:45.542Z|00797|timeval|WARN|Unreasonably long 1424ms poll interval (668ms user, 736ms system)
2020-05-20T08:52:45.542Z|00798|timeval|WARN|context switches: 0 voluntary, 444 involuntary
2020-05-20T08:52:46.560Z|00156|ovs_rcu(urcu3)|WARN|blocked 1001 ms waiting for main to quiesce
2020-05-20T08:52:47.037Z|00799|timeval|WARN|Unreasonably long 1477ms poll interval (749ms user, 699ms system)
2020-05-20T08:52:47.037Z|00800|timeval|WARN|faults: 256 minor, 0 major
2020-05-20T08:52:47.037Z|00801|timeval|WARN|context switches: 0 voluntary, 433 involuntary
2020-05-20T08:52:48.053Z|00157|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:52:48.507Z|00802|timeval|WARN|Unreasonably long 1454ms poll interval (576ms user, 849ms system)
2020-05-20T08:52:48.507Z|00803|timeval|WARN|context switches: 0 voluntary, 438 involuntary
2020-05-20T08:52:49.523Z|00158|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:52:50.060Z|00804|timeval|WARN|Unreasonably long 1536ms poll interval (651ms user, 784ms system)
2020-05-20T08:52:50.060Z|00805|timeval|WARN|faults: 256 minor, 0 major
2020-05-20T08:52:50.060Z|00806|timeval|WARN|context switches: 0 voluntary, 427 involuntary
2020-05-20T08:52:50.078Z|00807|poll_loop|INFO|Dropped 4 log messages in last 4 seconds (most recently, 1 seconds ago) due to excessive rate
2020-05-20T08:52:50.078Z|00808|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (NETLINK_ROUTE<->NETLINK_ROUTE) at ../lib/netlink-socket.c:1401 (95% CPU usage)
2020-05-20T08:52:51.077Z|00159|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:52:51.551Z|00809|timeval|WARN|Unreasonably long 1473ms poll interval (670ms user, 772ms system)
2020-05-20T08:52:51.551Z|00810|timeval|WARN|context switches: 0 voluntary, 442 involuntary
2020-05-20T08:52:52.567Z|00160|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:52:53.033Z|00811|timeval|WARN|Unreasonably long 1467ms poll interval (691ms user, 743ms system)
2020-05-20T08:52:53.033Z|00812|timeval|WARN|context switches: 0 voluntary, 444 involuntary
2020-05-20T08:52:54.050Z|00161|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:52:54.505Z|00813|timeval|WARN|Unreasonably long 1455ms poll interval (675ms user, 751ms system)
2020-05-20T08:52:54.505Z|00814|timeval|WARN|context switches: 0 voluntary, 410 involuntary
2020-05-20T08:52:55.523Z|00162|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:52:56.013Z|00815|timeval|WARN|Unreasonably long 1490ms poll interval (762ms user, 701ms system)
2020-05-20T08:52:56.013Z|00816|timeval|WARN|context switches: 0 voluntary, 454 involuntary
2020-05-20T08:52:56.029Z|00817|poll_loop|INFO|Dropped 3 log messages in last 4 seconds (most recently, 1 seconds ago) due to excessive rate
2020-05-20T08:52:56.029Z|00818|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (<->/var/run/openvswitch/db.sock) at ../lib/stream-fd.c:157 (97% CPU usage)
2020-05-20T08:52:57.029Z|00163|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:52:57.475Z|00819|timeval|WARN|Unreasonably long 1445ms poll interval (677ms user, 748ms system)
2020-05-20T08:52:57.475Z|00820|timeval|WARN|context switches: 0 voluntary, 440 involuntary
2020-05-20T08:52:58.492Z|00164|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:52:58.965Z|00821|timeval|WARN|Unreasonably long 1474ms poll interval (612ms user, 831ms system)
2020-05-20T08:52:58.965Z|00822|timeval|WARN|context switches: 0 voluntary, 434 involuntary
2020-05-20T08:52:59.986Z|00165|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:00.489Z|00823|timeval|WARN|Unreasonably long 1503ms poll interval (708ms user, 689ms system)
2020-05-20T08:53:00.489Z|00824|timeval|WARN|faults: 256 minor, 0 major
2020-05-20T08:53:00.489Z|00825|timeval|WARN|context switches: 0 voluntary, 401 involuntary
2020-05-20T08:53:01.786Z|00166|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:02.232Z|00826|timeval|WARN|Unreasonably long 1454ms poll interval (617ms user, 813ms system)
2020-05-20T08:53:02.232Z|00827|timeval|WARN|context switches: 0 voluntary, 425 involuntary
2020-05-20T08:53:02.236Z|00828|poll_loop|INFO|Dropped 5 log messages in last 6 seconds (most recently, 1 seconds ago) due to excessive rate
2020-05-20T08:53:02.237Z|00829|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (NETLINK_ROUTE<->NETLINK_ROUTE) at ../lib/netlink-socket.c:1401 (88% CPU usage)
2020-05-20T08:53:03.236Z|00167|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:03.670Z|00830|timeval|WARN|Unreasonably long 1434ms poll interval (672ms user, 742ms system)
2020-05-20T08:53:03.670Z|00831|timeval|WARN|context switches: 0 voluntary, 421 involuntary
2020-05-20T08:53:04.683Z|00168|ovs_rcu(urcu3)|WARN|blocked 1001 ms waiting for main to quiesce
2020-05-20T08:53:05.162Z|00832|timeval|WARN|Unreasonably long 1480ms poll interval (633ms user, 825ms system)
2020-05-20T08:53:05.162Z|00833|timeval|WARN|context switches: 0 voluntary, 437 involuntary
2020-05-20T08:53:06.182Z|00169|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:06.637Z|00834|timeval|WARN|Unreasonably long 1455ms poll interval (735ms user, 697ms system)
2020-05-20T08:53:06.637Z|00835|timeval|WARN|context switches: 0 voluntary, 410 involuntary
2020-05-20T08:53:07.654Z|00170|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:08.092Z|00836|timeval|WARN|Unreasonably long 1438ms poll interval (630ms user, 787ms system)
2020-05-20T08:53:08.092Z|00837|timeval|WARN|faults: 243 minor, 0 major
2020-05-20T08:53:08.092Z|00838|timeval|WARN|context switches: 0 voluntary, 402 involuntary
2020-05-20T08:53:08.109Z|00839|poll_loop|INFO|Dropped 4 log messages in last 4 seconds (most recently, 1 seconds ago) due to excessive rate
2020-05-20T08:53:08.110Z|00840|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (NETLINK_ROUTE<->NETLINK_ROUTE) at ../lib/netlink-socket.c:1401 (97% CPU usage)
2020-05-20T08:53:09.109Z|00171|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:09.599Z|00841|timeval|WARN|Unreasonably long 1490ms poll interval (702ms user, 718ms system)
2020-05-20T08:53:09.599Z|00842|timeval|WARN|context switches: 0 voluntary, 399 involuntary
2020-05-20T08:53:10.617Z|00172|ovs_rcu(urcu3)|WARN|blocked 1001 ms waiting for main to quiesce
2020-05-20T08:53:11.077Z|00843|timeval|WARN|Unreasonably long 1461ms poll interval (662ms user, 749ms system)
2020-05-20T08:53:11.078Z|00844|timeval|WARN|context switches: 0 voluntary, 394 involuntary
2020-05-20T08:53:12.093Z|00173|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:12.544Z|00845|timeval|WARN|Unreasonably long 1450ms poll interval (726ms user, 704ms system)
2020-05-20T08:53:12.544Z|00846|timeval|WARN|context switches: 0 voluntary, 403 involuntary
2020-05-20T08:53:13.560Z|00174|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:13.984Z|00847|timeval|WARN|Unreasonably long 1424ms poll interval (648ms user, 757ms system)
2020-05-20T08:53:13.984Z|00848|timeval|WARN|context switches: 0 voluntary, 399 involuntary
2020-05-20T08:53:14.002Z|00849|poll_loop|INFO|Dropped 4 log messages in last 4 seconds (most recently, 1 seconds ago) due to excessive rate
2020-05-20T08:53:14.002Z|00850|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (NETLINK_ROUTE<->NETLINK_ROUTE) at ../lib/netlink-socket.c:1401 (96% CPU usage)
2020-05-20T08:53:15.002Z|00175|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:15.419Z|00851|timeval|WARN|Unreasonably long 1417ms poll interval (699ms user, 701ms system)
2020-05-20T08:53:15.419Z|00852|timeval|WARN|context switches: 0 voluntary, 391 involuntary
2020-05-20T08:53:16.437Z|00176|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:16.874Z|00853|timeval|WARN|Unreasonably long 1437ms poll interval (623ms user, 796ms system)
2020-05-20T08:53:16.874Z|00854|timeval|WARN|context switches: 0 voluntary, 384 involuntary
2020-05-20T08:53:17.891Z|00177|ovs_rcu(urcu3)|WARN|blocked 1001 ms waiting for main to quiesce
2020-05-20T08:53:18.348Z|00855|timeval|WARN|Unreasonably long 1458ms poll interval (698ms user, 731ms system)
2020-05-20T08:53:18.348Z|00856|timeval|WARN|context switches: 0 voluntary, 427 involuntary
2020-05-20T08:53:19.366Z|00178|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:19.883Z|00857|timeval|WARN|Unreasonably long 1517ms poll interval (689ms user, 736ms system)
2020-05-20T08:53:19.883Z|00858|timeval|WARN|context switches: 0 voluntary, 402 involuntary
2020-05-20T08:53:20.901Z|00179|ovs_rcu(urcu3)|WARN|blocked 1001 ms waiting for main to quiesce
2020-05-20T08:53:21.362Z|00859|timeval|WARN|Unreasonably long 1462ms poll interval (683ms user, 750ms system)
2020-05-20T08:53:21.362Z|00860|timeval|WARN|faults: 128 minor, 0 major
2020-05-20T08:53:21.362Z|00861|timeval|WARN|context switches: 0 voluntary, 396 involuntary
2020-05-20T08:53:21.379Z|00862|poll_loop|INFO|Dropped 5 log messages in last 6 seconds (most recently, 1 seconds ago) due to excessive rate
2020-05-20T08:53:21.379Z|00863|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (NETLINK_ROUTE<->NETLINK_ROUTE) at ../lib/netlink-socket.c:1401 (97% CPU usage)
2020-05-20T08:53:22.380Z|00180|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:22.820Z|00864|timeval|WARN|Unreasonably long 1441ms poll interval (612ms user, 808ms system)
2020-05-20T08:53:22.820Z|00865|timeval|WARN|context switches: 0 voluntary, 387 involuntary
2020-05-20T08:53:23.838Z|00181|ovs_rcu(urcu3)|WARN|blocked 1001 ms waiting for main to quiesce
2020-05-20T08:53:24.339Z|00866|timeval|WARN|Unreasonably long 1501ms poll interval (670ms user, 810ms system)
2020-05-20T08:53:24.339Z|00867|timeval|WARN|context switches: 0 voluntary, 422 involuntary
2020-05-20T08:53:25.356Z|00182|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:25.861Z|00868|timeval|WARN|Unreasonably long 1505ms poll interval (744ms user, 734ms system)
2020-05-20T08:53:25.861Z|00869|timeval|WARN|context switches: 0 voluntary, 413 involuntary
2020-05-20T08:53:25.861Z|00870|coverage|INFO|Dropped 39 log messages in last 59 seconds (most recently, 2 seconds ago) due to excessive rate
2020-05-20T08:53:25.861Z|00871|coverage|INFO|Skipping details of duplicate event coverage for hash=b67da075
2020-05-20T08:53:26.880Z|00183|ovs_rcu(urcu3)|WARN|blocked 1001 ms waiting for main to quiesce
2020-05-20T08:53:27.294Z|00872|timeval|WARN|Unreasonably long 1415ms poll interval (647ms user, 749ms system)
2020-05-20T08:53:27.294Z|00873|timeval|WARN|context switches: 0 voluntary, 380 involuntary
2020-05-20T08:53:27.312Z|00874|poll_loop|INFO|Dropped 4 log messages in last 4 seconds (most recently, 1 seconds ago) due to excessive rate
2020-05-20T08:53:27.312Z|00875|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (NETLINK_ROUTE<->NETLINK_ROUTE) at ../lib/netlink-socket.c:1401 (98% CPU usage)
2020-05-20T08:53:28.312Z|00184|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:28.738Z|00876|timeval|WARN|Unreasonably long 1426ms poll interval (692ms user, 717ms system)
2020-05-20T08:53:28.738Z|00877|timeval|WARN|context switches: 0 voluntary, 394 involuntary
2020-05-20T08:53:29.758Z|00185|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:30.334Z|00878|timeval|WARN|Unreasonably long 1577ms poll interval (693ms user, 783ms system)
2020-05-20T08:53:30.334Z|00879|timeval|WARN|context switches: 0 voluntary, 400 involuntary
2020-05-20T08:53:31.351Z|00186|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:31.779Z|00880|timeval|WARN|Unreasonably long 1428ms poll interval (663ms user, 747ms system)
2020-05-20T08:53:31.779Z|00881|timeval|WARN|context switches: 0 voluntary, 394 involuntary
2020-05-20T08:53:32.797Z|00187|ovs_rcu(urcu3)|WARN|blocked 1001 ms waiting for main to quiesce
2020-05-20T08:53:33.281Z|00882|timeval|WARN|Unreasonably long 1484ms poll interval (711ms user, 751ms system)
2020-05-20T08:53:33.281Z|00883|timeval|WARN|context switches: 0 voluntary, 396 involuntary
2020-05-20T08:53:33.298Z|00884|poll_loop|INFO|Dropped 3 log messages in last 4 seconds (most recently, 1 seconds ago) due to excessive rate
2020-05-20T08:53:33.298Z|00885|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (<->/var/run/openvswitch/db.sock) at ../lib/stream-fd.c:157 (96% CPU usage)
2020-05-20T08:53:34.298Z|00188|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:34.752Z|00886|timeval|WARN|Unreasonably long 1453ms poll interval (629ms user, 805ms system)
2020-05-20T08:53:34.752Z|00887|timeval|WARN|context switches: 0 voluntary, 398 involuntary
2020-05-20T08:53:35.771Z|00189|ovs_rcu(urcu3)|WARN|blocked 1001 ms waiting for main to quiesce
2020-05-20T08:53:36.252Z|00888|timeval|WARN|Unreasonably long 1482ms poll interval (625ms user, 837ms system)
2020-05-20T08:53:36.252Z|00889|timeval|WARN|context switches: 0 voluntary, 417 involuntary
2020-05-20T08:53:37.270Z|00190|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:37.689Z|00890|timeval|WARN|Unreasonably long 1419ms poll interval (637ms user, 764ms system)
2020-05-20T08:53:37.689Z|00891|timeval|WARN|context switches: 0 voluntary, 390 involuntary
2020-05-20T08:53:38.707Z|00191|ovs_rcu(urcu3)|WARN|blocked 1001 ms waiting for main to quiesce
2020-05-20T08:53:39.127Z|00892|timeval|WARN|Unreasonably long 1421ms poll interval (689ms user, 713ms system)
2020-05-20T08:53:39.127Z|00893|timeval|WARN|context switches: 0 voluntary, 397 involuntary
2020-05-20T08:53:39.146Z|00894|poll_loop|INFO|Dropped 4 log messages in last 6 seconds (most recently, 1 seconds ago) due to excessive rate
2020-05-20T08:53:39.146Z|00895|poll_loop|INFO|wakeup due to 18-ms timeout at ../ofproto/ofproto-dpif.c:1940 (99% CPU usage)
2020-05-20T08:53:40.241Z|00192|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:40.643Z|00896|timeval|WARN|Unreasonably long 1405ms poll interval (655ms user, 735ms system)
2020-05-20T08:53:40.643Z|00897|timeval|WARN|context switches: 0 voluntary, 384 involuntary
2020-05-20T08:53:41.646Z|00193|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:42.091Z|00898|timeval|WARN|Unreasonably long 1445ms poll interval (684ms user, 743ms system)
2020-05-20T08:53:42.091Z|00899|timeval|WARN|context switches: 0 voluntary, 400 involuntary
2020-05-20T08:53:43.106Z|00194|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:43.575Z|00900|timeval|WARN|Unreasonably long 1470ms poll interval (699ms user, 751ms system)
2020-05-20T08:53:43.575Z|00901|timeval|WARN|context switches: 0 voluntary, 393 involuntary
2020-05-20T08:53:44.594Z|00195|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:45.037Z|00902|timeval|WARN|Unreasonably long 1444ms poll interval (587ms user, 834ms system)
2020-05-20T08:53:45.038Z|00903|timeval|WARN|context switches: 0 voluntary, 408 involuntary
2020-05-20T08:53:45.055Z|00904|poll_loop|INFO|Dropped 5 log messages in last 6 seconds (most recently, 1 seconds ago) due to excessive rate
2020-05-20T08:53:45.055Z|00905|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (NETLINK_ROUTE<->NETLINK_ROUTE) at ../lib/netlink-socket.c:1401 (97% CPU usage)
2020-05-20T08:53:46.055Z|00196|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:46.508Z|00906|timeval|WARN|Unreasonably long 1453ms poll interval (680ms user, 747ms system)
2020-05-20T08:53:46.508Z|00907|timeval|WARN|context switches: 0 voluntary, 402 involuntary
2020-05-20T08:53:47.524Z|00197|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:47.954Z|00908|timeval|WARN|Unreasonably long 1430ms poll interval (653ms user, 760ms system)
2020-05-20T08:53:47.954Z|00909|timeval|WARN|context switches: 0 voluntary, 394 involuntary
2020-05-20T08:53:48.973Z|00198|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:49.479Z|00910|timeval|WARN|Unreasonably long 1506ms poll interval (744ms user, 701ms system)
2020-05-20T08:53:49.479Z|00911|timeval|WARN|context switches: 0 voluntary, 416 involuntary
2020-05-20T08:53:50.496Z|00199|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:50.989Z|00912|timeval|WARN|Unreasonably long 1493ms poll interval (708ms user, 717ms system)
2020-05-20T08:53:50.989Z|00913|timeval|WARN|context switches: 0 voluntary, 387 involuntary
2020-05-20T08:53:51.005Z|00914|poll_loop|INFO|Dropped 4 log messages in last 4 seconds (most recently, 1 seconds ago) due to excessive rate
2020-05-20T08:53:51.005Z|00915|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (NETLINK_ROUTE<->NETLINK_ROUTE) at ../lib/netlink-socket.c:1401 (97% CPU usage)
2020-05-20T08:53:52.005Z|00200|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:52.433Z|00916|timeval|WARN|Unreasonably long 1427ms poll interval (652ms user, 755ms system)
2020-05-20T08:53:52.433Z|00917|timeval|WARN|context switches: 0 voluntary, 394 involuntary
2020-05-20T08:53:53.453Z|00201|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2020-05-20T08:53:53.893Z|00918|timeval|WARN|Unreasonably long 1440ms poll interval (688ms user, 727ms system)
2020-05-20T08:53:53.893Z|00919|timeval|WARN|context switches: 0 voluntary, 405 involuntary
2020-05-20T08:53:54.909Z|00202|ovs_rcu(urcu3)|WARN|blocked 1001 ms waiting for main to quiesce
2020-05-20T08:53:55.340Z|00920|timeval|WARN|Unreasonably long 1431ms poll interval (665ms user, 744ms system)
2020-05-20T08:53:55.340Z|00921|timeval|WARN|context switches: 0 voluntary, 402 involuntary
2020-05-20T08:53:56.356Z|00203|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
@AswiniViswanathan
Copy link

I am also seeing this issue can anyone point some solution

@blp
Copy link

blp commented Jul 15, 2021

It looks like there are lots of changes in the routing table, which isn't too surprising given that the system is running BGP. OVS tries to keep up with these. I'm surprised it's this expensive.

If you stop the BGP daemon temporarily, does OVS CPU usage drop to near-zero? If so, then that would confirm the root of the issue and we can look into how to avoid the high CPU for this case.

@ddominet
Copy link

It looks like there are lots of changes in the routing table, which isn't too surprising given that the system is running BGP. OVS tries to keep up with these. I'm surprised it's this expensive.

If you stop the BGP daemon temporarily, does OVS CPU usage drop to near-zero? If so, then that would confirm the root of the issue and we can look into how to avoid the high CPU for this case.

I have the same issue, bird running BGP, and server having a full table in kernel is indeed the culprit.
How can we make openvswitch behave normally in such scenario?

@ddominet
Copy link

@zejar have you found a solution to this problem?

@zejar
Copy link
Author

zejar commented Sep 27, 2023

@ddominet I have not. Instead I switched to "regular" Linux bridges.

@igsilya
Copy link
Member

igsilya commented Sep 27, 2023

If it is not necessary for OVS to be aware of these routes, the solution is to move BGP daemon into a separate network namespace.

@ddominet
Copy link

ddominet commented Sep 27, 2023

@igsilya

I'm using it with kolla ansible openstack, BGP is there to provide a connectivity to a prefix, that will need to be avaliable within openstack. Is it something that will work with separate namespaces? Might be too separated. If it was a separate routing table than i could just policy route it. In this case i don't think so

@igsilya
Copy link
Member

igsilya commented Sep 27, 2023

Unfortunately, I'm not a big BGP or OpenStack expert to tell if it will work.

FWIW, Here is a link to discussion on why OVS is consuming a lot of CPU in this scenario: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052092.html

@wilkmar
Copy link

wilkmar commented Oct 19, 2023

After reading the discussion that @igsilya shared link to in his last comment,
I was able to recreate the issue on pure Ubuntu 20.04.6 LTS, kernel 5.4.0-162-generic, openvswitch-switch 2.13.8-0ubuntu1.2, 1 vcpu, without running BGB daemon at all. This is only related to the massive routing table update stream. One of the triggers may be BGB full table sync, the other example is as follows (without BGB/Bird running):

  1. sudo apt install openvswitch-switch
  2. for i in {10..25}; do for k in {0..20}; do for j in {1..254}; do sudo ip route add 10.1$i.$k.$j/32 via ; done; done; done

^this loop will continue for some time, don't need to wait for it to complete to see the symptoms in the logs or in htop/top.

  1. wait a few minutes and keep watching the /var/log/openvswitch/ovs-vswitchd.log
    sudo grep long /var/log/openvswitch/ovs-vswitchd.log | sort -n -k3 | tail
    2023-10-18T15:08:01.173Z|00282|timeval|WARN|Unreasonably long 1518ms poll interval (337ms user, 313ms system)
    2023-10-18T15:08:05.703Z|00289|timeval|WARN|Unreasonably long 1521ms poll interval (287ms user, 368ms system)
    2023-10-18T15:07:57.995Z|00278|timeval|WARN|Unreasonably long 1523ms poll interval (296ms user, 367ms system)
    2023-10-18T15:07:42.543Z|00255|timeval|WARN|Unreasonably long 1539ms poll interval (355ms user, 309ms system)
    2023-10-18T15:07:26.212Z|00230|timeval|WARN|Unreasonably long 1542ms poll interval (258ms user, 404ms system)
    2023-10-18T15:08:02.744Z|00285|timeval|WARN|Unreasonably long 1547ms poll interval (299ms user, 365ms system)
    2023-10-18T15:08:10.185Z|00295|timeval|WARN|Unreasonably long 1551ms poll interval (268ms user, 398ms system)
    2023-10-18T15:07:59.629Z|00280|timeval|WARN|Unreasonably long 1614ms poll interval (307ms user, 428ms system)
    2023-10-18T15:07:38.103Z|00246|timeval|WARN|Unreasonably long 1648ms poll interval (338ms user, 376ms system)
    2023-10-18T15:08:12.263Z|00297|timeval|WARN|Unreasonably long 2050ms poll interval (374ms user, 497ms system)

sudo grep blocked /var/log/openvswitch/ovs-vswitchd.log | sort -n -k2 | tail
2023-10-18T15:06:45.577Z|00019|ovs_rcu(urcu2)|WARN|blocked 1006 ms waiting for main to quiesce
2023-10-18T15:06:54.166Z|00027|ovs_rcu(urcu2)|WARN|blocked 1006 ms waiting for main to quiesce
2023-10-18T15:07:00.512Z|00033|ovs_rcu(urcu2)|WARN|blocked 1006 ms waiting for main to quiesce
2023-10-18T15:07:04.773Z|00037|ovs_rcu(urcu2)|WARN|blocked 1006 ms waiting for main to quiesce
2023-10-18T15:07:14.049Z|00045|ovs_rcu(urcu2)|WARN|blocked 1006 ms waiting for main to quiesce
2023-10-18T15:07:45.074Z|00067|ovs_rcu(urcu2)|WARN|blocked 1006 ms waiting for main to quiesce
2023-10-18T15:08:00.661Z|00078|ovs_rcu(urcu2)|WARN|blocked 1006 ms waiting for main to quiesce
2023-10-18T15:07:05.854Z|00038|ovs_rcu(urcu2)|WARN|blocked 1007 ms waiting for main to quiesce
2023-10-18T15:07:56.270Z|00075|ovs_rcu(urcu2)|WARN|blocked 1007 ms waiting for main to quiesce
2023-10-18T15:08:12.214Z|00086|ovs_rcu(urcu2)|WARN|blocked 2004 ms waiting for main to quiesce

In this example all routing tables entries sum up to 14k+:
sudo ip r l table 0 | wc -l
14650

@f0o
Copy link

f0o commented Mar 14, 2024

bump?

Any news?

@ddominet
Copy link

ddominet commented Mar 14, 2024 via email

@igsilya
Copy link
Member

igsilya commented Mar 14, 2024

As I said, the issue doesn't have a good solution, only compromises. The reason is that Linux kernel routing change notifications are not a reliable source of information. Any application that tries to incrementally maintain the state of the routing table based on these notifications is implementing a huge amount of heuristics to guess which notification is a good one and which is not. And they are never fully correct, meaning old already removed entries may linger in a system or some updates may not be delivered.
See:
https://bugzilla.redhat.com/show_bug.cgi?id=1337855
https://bugzilla.redhat.com/show_bug.cgi?id=1722728
thom311/libnl#226
thom311/libnl#224

The only approach that gives a correct view on a current state of a routing table is to dump it as a whole after each change and that is causing high CPU usage for obvious reasons if you have huge routing tables with a high rate of updates.

Most of the OVS users are not running BGP in the same network namespace with ovs-vswitchd, so it is not a problem for them. If you can change the architecture of your deployment to not run BGP in the same network namespace with OVS, that is still a preferred approach to fix the problem.

One approach we can take is to delay the dumping of the routing table by a certain amount of time. This should alleviate some of the load at a cost of not having the most up to date routing table for that amount of time (few seconds?). Is that a reasonable compromise for your setup?

BGP is there to provide a connectivity to a prefix, that will need to be avaliable within openstack

What exactly these routes are used for? Do you have an OVS tunnel that relies on these routes?

Alternative and not recommended way to resolve this issue is to just disable synchronization with the kernel routing table with a command-line argument we have for testing purposes: --disable-system-route. Some OVS functionality will stop working, most notably OVS will not be able to determine link state changes for OVS tunnel ports, because it will not be able to lookup the route and determine on which interface the traffic will go after encapsulation. The tunnels themselves will be working just fine, since they operate inside the kernel. Some other functionality like checking route table with ovs-appctl commands will not work as well, for obvious reasons. And this will completely break tunnels in userspace datapath, but I'm assuming you're using the Linux kernel datapath. I'm maybe missing a few things that will stop working, but I listed the most notable I remember.
This option is not officially supported in any form, but you can try it, if it is the only option for your setup.

@AswiniViswanathan
Copy link

Have a solution which worked for my environment.
While running the ovsdb process try adding this
Daemon Option-
--detach
Runs ovsdb-server as a background process. The process forks,
and in the child it starts a new session, closes the standard
file descriptors (which has the side effect of disabling logging
to the console), and changes its current directory to the root
(unless --no-chdir is specified). After the child completes its
initialization, the parent exits. ovsdb-server detaches only
after it starts listening on all configured remotes. At this
point, all standalone and active-backup databases are ready for
use. Clustered databases only become ready for use after they
finish joining their clusters (which could have already happened
in previous runs of ovsdb-server).
More information on this link:https://www.openvswitch.org//support/dist-docs/ovsdb-server.1.html
@igsilya @ddominet @wilkmar @zejar

@igsilya
Copy link
Member

igsilya commented Mar 14, 2024

@AswiniViswanathan I don't think the --detach option on ovsdb process is related to this issue.

@AswiniViswanathan
Copy link

@igsilya Yes its related to utilization,
I faced the below issue and after adding the detach option it did fix the problem.
2020-05-20T08:53:06.637Z|00834|timeval|WARN|Unreasonably long 1455ms poll interval (735ms user, 697ms system)

Give it a try.

@igsilya
Copy link
Member

igsilya commented Mar 14, 2024

@AswiniViswanathan it's great that it worked for you somehow, but that makes no sense to me. This option has nothing to do with routing or CPU usage and you're suggesting to add it to the process that is not the problem with this issue. And I'm pretty sure that other people on this thread already use this option as it is a default way to run OVS services.

@AswiniViswanathan
Copy link

@igsilya okay. Anyway lets see if it helps anyone else.

@f0o
Copy link

f0o commented Mar 14, 2024

@igsilya I just finished moving the BGP routing table into it's own VRF but the issue persists.

the default routing table is now effectively empty and the interfaces inside the OVS bridges are all part of their own dedicated VRF that only contains the on-link subnets and a default route.

BGP with full tables is in it's own routing table with ~2.4m entries.

It seems that OVS doesnt care at all which routing table is used or am I missing something obvious?

@igsilya
Copy link
Member

igsilya commented Mar 14, 2024

@f0o yes, OVS monitors all tables. You need a separate network namespace.
There was a suggestion in the past to only dump the default one, we may explore this option.

@f0o
Copy link

f0o commented Mar 14, 2024

@igsilya Gotcha, I foolishly assumed vrf would follow the same logic as netns.

Let's see if I can replicate this whole thing with netns instead!

@f0o
Copy link

f0o commented Mar 14, 2024

@igsilya Couldn't get it all to work with netns so I took the "easy" way and tried with --disable-system-route - however that parameter does not change the situation.

I verified ovs-vswitchd was actually started with it:

ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --disable-system-route --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach

@igsilya
Copy link
Member

igsilya commented Mar 14, 2024

@f0o , yeah, my bad. Looking through the code, OVS will not update internal routing table if this option enabled, but it will still dump the whole routing table from the kernel. So, yeah, it removes some of the work, but clearly not enough. This is clearly not an intended use case for this option.

I'm working on a fix to only track changes from the main routing table, will post for review hopefully this week. It should help if you configure BGP with its own VRF. But for now, it seems, only a network namespace will work.

@f0o
Copy link

f0o commented Mar 14, 2024

@igsilya fair enough, I'm really excited for your VRF fix 😅 - Maybe to make it more modular, have a parameter that you can use to select a table instead of just defaulting to the default/local one. Could be useful sometimes.

I'll continue to work towards a netns solution just to get a solution to into this issue that might work for others without waiting for now packages to be released.

@f0o
Copy link

f0o commented Mar 14, 2024

@igsilya Although I can confirm that using netns solves the issue, there is no good way to "leak" routes across like there is with VRFs - meaning that moving traffic from OVS into the BGP netns is tedious at best and require linknets adding a lot more overhead than it should but a simple left-pocket -> right-pocket routing action.

So I cant accept netns as a viable solution unfortunately

@igsilya
Copy link
Member

igsilya commented Mar 15, 2024

@f0o there will be a cost of running in the same network namespace. Linux kernel doesn't provide a mechanism to receive routes for specific tables only. That means that for every BGP update ovs-vswitchd will still receive a notification. It will be parsed and discarded. This should not be very expensive to do, but it may take a bit of time.
Furthermore, whenever there is a normal route update in the main/local/default table, OVS will have to dump all the routes from all the tables (no filtering supported by the kernel), parse them and choose the relevant ones. That will still be expensive to do. The impact will depend on how frequently the VRF will update the main routing table by leaking the routes.

Edit: There might be a way to filter out the full dump.

ovsrobot pushed a commit to ovsrobot/ovs that referenced this issue Mar 15, 2024
Currently, ovs-vswitchd is subscribed to all the routing changes in the
kernel.  On each change, it marks the internal routing table cache as
invalid, then resets it and dumps all the routes from the kernel from
scratch.  The reason for that is kernel routing updates not being
reliable in a sense that it's hard to tell which route is getting
removed or modified.  Userspace application has to track the order in
which route entries are dumped from the kernel.  Updates can get lost
or even duplicated and the kernel doesn't provide a good mechanism to
distinguish one route from another.  To my knowledge, dumping all the
routes from a kernel after each change is the only way to keep the
cache consistent.  Some more info can be found in the following never
addressed issues:
  https://bugzilla.redhat.com/1337860
  https://bugzilla.redhat.com/1337855

It seems to be believed that NetworkManager "mostly" does incremental
updates right.  But it is still not completely correct, will re-dump
the whole table in certain cases, and it takes a huge amount of very
complicated code to do the accounting and route comparisons.

Going back to ovs-vswitchd, it currently dumps routes from all the
routing tables.  If it will get conflicting routes from multiple
tables, the cache will not be useful.  The routing cache in userspace
is primarily used for checking the egress port for tunneled traffic
and this way also detecting link state changes for a tunnel port.
For userspace datapath it is used for actual routing of the packet
after sending to a native tunnel.
With kernel datapath we don't really have a mechanism to know which
routing table will actually be used by the kernel after encapsulation,
so our lookups on a cache may be incorrect because of this as well.

So, unless all the relevant routes are in the standard tables, the
lookup in userspace route cache is unreliable.

Luckily, most setups are not using any complicated routing in
non-standard tables that OVS has to be aware of.

It is possible, but unlikely, that standard routing tables are
completely empty while some other custom table is not, and all the OVS
tunnel traffic is directed to that table.  That would be the only
scenario where dumping non-standard tables would make sense.  But it
seems like this kind of setup will likely need a way to tell OVS from
which table the routes should be taken, or we'll need to dump routing
rules and keep a separate cache for each table, so we can first match
on rules and then lookup correct routes in a specific table.  I'm not
sure if trying to implement all that is justified.

For now, stop considering routes from non-standard tables to avoid
mixing different tables together and also wasting CPU resources.

This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is
running on a same host and in a same network namespace with OVS using
its own custom routing table.

Unfortunately, there seems to be no way to tell the kernel to send
updates only for particular tables.  So, we'll still receive and parse
all of them.  But they will not result in a full cache invalidation in
most cases.

Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps.
So, we can make use of it and dump only standard tables when we get a
relevant route update.  NETLINK_GET_STRICT_CHK has to be enabled on
the socket for filtering to work.  There is no reason to not enable it
by default, if supported.  It is not used outside of NETLINK_ROUTE.

Fixes: f0e167f ("route-table: Handle route updates more robustly.")
Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb")
Reported-at: openvswitch/ovs-issues#185
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: 0-day Robot <robot@bytheb.org>
@f0o
Copy link

f0o commented Mar 18, 2024

@igsilya I'm running into an odd issue and I wonder if it's connected to this one.

I cant seem to manage any bridges with ovs-vsctl anymore.

I keep getting the error:
...00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)

worth to note, all ovs-vsctl commands are through ansible and with a default timeout of 5s.

//EDIT:

Increasing the timeout worked but I also notice a very very big amount of these errors in OVN:
2024-03-18T18:08:50.183Z|00159|rconn(ovn_statctrl2)|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)

And I believe that's also caused by the full-tables here...

The moment I dropped this router out of the IBGP mesh the errors stopped and ports were being added etc...

So this 100% CPU utilization is not just an annoyance because it's pegging a core but it actually immobilizes OVS entirely!

@igsilya
Copy link
Member

igsilya commented Mar 18, 2024

@f0o I assume ovs-vswitchd reports long poll intervals, what are the largest numbers you see there? If they are in 2.5+ seconds (2.5 before connection is accpeted + potentially 2.5 on the next iteration before the reply is sent), then yes, processes that do not wait longer than 5 seconds may disconnect. Some database operations may potentially wait for more than 2 poll intervals.

For OVN, I assume you're using OVN 23.09 older than v23.09.1. The statctrl thread had an inactivity probe set to 5 seconds before commit ovn-org/ovn@bbd07439b . Update to latest v23.09.3 should fix the problem.

@f0o
Copy link

f0o commented Mar 18, 2024

@igsilya Thanks for the insight!

I went ahead and built openvswitch-3.2.1 with the patch from 01f7582584354cf087924170723dd0838d8b34f3 and I can confidently say that it is no longer burning CPU nor being unresponsive. All traffic is flowing fine and ports/flows are being added/removed swiftly!

I guess I'm your lab-bunny now! 🙈

@igsilya
Copy link
Member

igsilya commented Mar 18, 2024

@f0o Thanks a lot for testing! Now we just need to wait for some code review. Hopefully, it won't take too long.

If you want to reply to the original patch email with a Tested-by: Name <email> tag, we can include it in the commit message before applying. Assuming you're not subscribed to ovs-dev list, you may reply by importing an mbox from patchwork to your email client or pressing on the mailto link at the top of the page in archive. leaving the tag here in a github comment is also fine.
But, of course, all that is completely optional. :)

@f0o
Copy link

f0o commented Mar 19, 2024

I just ran a few tests by adding/removing flows, ports and bridges like a maniac and verifying that traffic flows while both routers remain in IBGP and EBGP with multiple full tables in different VRFs.

Everything works like a breeze! ovs-vswitchd chills at 2% cpu and from the logs of OVS/OVN everything is operational and normal.

I tried adding the mbox or using the mailto but all my mail clients strip away the Reply-To header "because of safety". So I'm unable to stamp it there.

So please take my informal stamp of testing and approval here :)

@ddominet
Copy link

ddominet commented Mar 19, 2024 via email

igsilya added a commit to igsilya/ovs that referenced this issue Mar 20, 2024
Currently, ovs-vswitchd is subscribed to all the routing changes in the
kernel.  On each change, it marks the internal routing table cache as
invalid, then resets it and dumps all the routes from the kernel from
scratch.  The reason for that is kernel routing updates not being
reliable in a sense that it's hard to tell which route is getting
removed or modified.  Userspace application has to track the order in
which route entries are dumped from the kernel.  Updates can get lost
or even duplicated and the kernel doesn't provide a good mechanism to
distinguish one route from another.  To my knowledge, dumping all the
routes from a kernel after each change is the only way to keep the
cache consistent.  Some more info can be found in the following never
addressed issues:
  https://bugzilla.redhat.com/1337860
  https://bugzilla.redhat.com/1337855

It seems to be believed that NetworkManager "mostly" does incremental
updates right.  But it is still not completely correct, will re-dump
the whole table in certain cases, and it takes a huge amount of very
complicated code to do the accounting and route comparisons.

Going back to ovs-vswitchd, it currently dumps routes from all the
routing tables.  If it will get conflicting routes from multiple
tables, the cache will not be useful.  The routing cache in userspace
is primarily used for checking the egress port for tunneled traffic
and this way also detecting link state changes for a tunnel port.
For userspace datapath it is used for actual routing of the packet
after sending to a native tunnel.
With kernel datapath we don't really have a mechanism to know which
routing table will actually be used by the kernel after encapsulation,
so our lookups on a cache may be incorrect because of this as well.

So, unless all the relevant routes are in the standard tables, the
lookup in userspace route cache is unreliable.

Luckily, most setups are not using any complicated routing in
non-standard tables that OVS has to be aware of.

It is possible, but unlikely, that standard routing tables are
completely empty while some other custom table is not, and all the OVS
tunnel traffic is directed to that table.  That would be the only
scenario where dumping non-standard tables would make sense.  But it
seems like this kind of setup will likely need a way to tell OVS from
which table the routes should be taken, or we'll need to dump routing
rules and keep a separate cache for each table, so we can first match
on rules and then lookup correct routes in a specific table.  I'm not
sure if trying to implement all that is justified.

For now, stop considering routes from non-standard tables to avoid
mixing different tables together and also wasting CPU resources.

This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is
running on a same host and in a same network namespace with OVS using
its own custom routing table.

Unfortunately, there seems to be no way to tell the kernel to send
updates only for particular tables.  So, we'll still receive and parse
all of them.  But they will not result in a full cache invalidation in
most cases.

Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps.
So, we can make use of it and dump only standard tables when we get a
relevant route update.  NETLINK_GET_STRICT_CHK has to be enabled on
the socket for filtering to work.  There is no reason to not enable it
by default, if supported.  It is not used outside of NETLINK_ROUTE.

Fixes: f0e167f ("route-table: Handle route updates more robustly.")
Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb")
Reported-at: openvswitch/ovs-issues#185
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
ovsrobot pushed a commit to ovsrobot/ovs that referenced this issue Mar 20, 2024
Currently, ovs-vswitchd is subscribed to all the routing changes in the
kernel.  On each change, it marks the internal routing table cache as
invalid, then resets it and dumps all the routes from the kernel from
scratch.  The reason for that is kernel routing updates not being
reliable in a sense that it's hard to tell which route is getting
removed or modified.  Userspace application has to track the order in
which route entries are dumped from the kernel.  Updates can get lost
or even duplicated and the kernel doesn't provide a good mechanism to
distinguish one route from another.  To my knowledge, dumping all the
routes from a kernel after each change is the only way to keep the
cache consistent.  Some more info can be found in the following never
addressed issues:
  https://bugzilla.redhat.com/1337860
  https://bugzilla.redhat.com/1337855

It seems to be believed that NetworkManager "mostly" does incremental
updates right.  But it is still not completely correct, will re-dump
the whole table in certain cases, and it takes a huge amount of very
complicated code to do the accounting and route comparisons.

Going back to ovs-vswitchd, it currently dumps routes from all the
routing tables.  If it will get conflicting routes from multiple
tables, the cache will not be useful.  The routing cache in userspace
is primarily used for checking the egress port for tunneled traffic
and this way also detecting link state changes for a tunnel port.
For userspace datapath it is used for actual routing of the packet
after sending to a native tunnel.
With kernel datapath we don't really have a mechanism to know which
routing table will actually be used by the kernel after encapsulation,
so our lookups on a cache may be incorrect because of this as well.

So, unless all the relevant routes are in the standard tables, the
lookup in userspace route cache is unreliable.

Luckily, most setups are not using any complicated routing in
non-standard tables that OVS has to be aware of.

It is possible, but unlikely, that standard routing tables are
completely empty while some other custom table is not, and all the OVS
tunnel traffic is directed to that table.  That would be the only
scenario where dumping non-standard tables would make sense.  But it
seems like this kind of setup will likely need a way to tell OVS from
which table the routes should be taken, or we'll need to dump routing
rules and keep a separate cache for each table, so we can first match
on rules and then lookup correct routes in a specific table.  I'm not
sure if trying to implement all that is justified.

For now, stop considering routes from non-standard tables to avoid
mixing different tables together and also wasting CPU resources.

This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is
running on a same host and in a same network namespace with OVS using
its own custom routing table.

Unfortunately, there seems to be no way to tell the kernel to send
updates only for particular tables.  So, we'll still receive and parse
all of them.  But they will not result in a full cache invalidation in
most cases.

Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps.
So, we can make use of it and dump only standard tables when we get a
relevant route update.  NETLINK_GET_STRICT_CHK has to be enabled on
the socket for filtering to work.  There is no reason to not enable it
by default, if supported.  It is not used outside of NETLINK_ROUTE.

Fixes: f0e167f ("route-table: Handle route updates more robustly.")
Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb")
Reported-at: openvswitch/ovs-issues#185
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: 0-day Robot <robot@bytheb.org>
igsilya added a commit to igsilya/ovs that referenced this issue Mar 22, 2024
Currently, ovs-vswitchd is subscribed to all the routing changes in the
kernel.  On each change, it marks the internal routing table cache as
invalid, then resets it and dumps all the routes from the kernel from
scratch.  The reason for that is kernel routing updates not being
reliable in a sense that it's hard to tell which route is getting
removed or modified.  Userspace application has to track the order in
which route entries are dumped from the kernel.  Updates can get lost
or even duplicated and the kernel doesn't provide a good mechanism to
distinguish one route from another.  To my knowledge, dumping all the
routes from a kernel after each change is the only way to keep the
cache consistent.  Some more info can be found in the following never
addressed issues:
  https://bugzilla.redhat.com/1337860
  https://bugzilla.redhat.com/1337855

It seems to be believed that NetworkManager "mostly" does incremental
updates right.  But it is still not completely correct, will re-dump
the whole table in certain cases, and it takes a huge amount of very
complicated code to do the accounting and route comparisons.

Going back to ovs-vswitchd, it currently dumps routes from all the
routing tables.  If it will get conflicting routes from multiple
tables, the cache will not be useful.  The routing cache in userspace
is primarily used for checking the egress port for tunneled traffic
and this way also detecting link state changes for a tunnel port.
For userspace datapath it is used for actual routing of the packet
after sending to a native tunnel.
With kernel datapath we don't really have a mechanism to know which
routing table will actually be used by the kernel after encapsulation,
so our lookups on a cache may be incorrect because of this as well.

So, unless all the relevant routes are in the standard tables, the
lookup in userspace route cache is unreliable.

Luckily, most setups are not using any complicated routing in
non-standard tables that OVS has to be aware of.

It is possible, but unlikely, that standard routing tables are
completely empty while some other custom table is not, and all the OVS
tunnel traffic is directed to that table.  That would be the only
scenario where dumping non-standard tables would make sense.  But it
seems like this kind of setup will likely need a way to tell OVS from
which table the routes should be taken, or we'll need to dump routing
rules and keep a separate cache for each table, so we can first match
on rules and then lookup correct routes in a specific table.  I'm not
sure if trying to implement all that is justified.

For now, stop considering routes from non-standard tables to avoid
mixing different tables together and also wasting CPU resources.

This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is
running on a same host and in a same network namespace with OVS using
its own custom routing table.

Unfortunately, there seems to be no way to tell the kernel to send
updates only for particular tables.  So, we'll still receive and parse
all of them.  But they will not result in a full cache invalidation in
most cases.

Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps.
So, we can make use of it and dump only standard tables when we get a
relevant route update.  NETLINK_GET_STRICT_CHK has to be enabled on
the socket for filtering to work.  There is no reason to not enable it
by default, if supported.  It is not used outside of NETLINK_ROUTE.

Fixes: f0e167f ("route-table: Handle route updates more robustly.")
Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb")
Reported-at: openvswitch/ovs-issues#185
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
igsilya added a commit to igsilya/ovs that referenced this issue Mar 22, 2024
Currently, ovs-vswitchd is subscribed to all the routing changes in the
kernel.  On each change, it marks the internal routing table cache as
invalid, then resets it and dumps all the routes from the kernel from
scratch.  The reason for that is kernel routing updates not being
reliable in a sense that it's hard to tell which route is getting
removed or modified.  Userspace application has to track the order in
which route entries are dumped from the kernel.  Updates can get lost
or even duplicated and the kernel doesn't provide a good mechanism to
distinguish one route from another.  To my knowledge, dumping all the
routes from a kernel after each change is the only way to keep the
cache consistent.  Some more info can be found in the following never
addressed issues:
  https://bugzilla.redhat.com/1337860
  https://bugzilla.redhat.com/1337855

It seems to be believed that NetworkManager "mostly" does incremental
updates right.  But it is still not completely correct, will re-dump
the whole table in certain cases, and it takes a huge amount of very
complicated code to do the accounting and route comparisons.

Going back to ovs-vswitchd, it currently dumps routes from all the
routing tables.  If it will get conflicting routes from multiple
tables, the cache will not be useful.  The routing cache in userspace
is primarily used for checking the egress port for tunneled traffic
and this way also detecting link state changes for a tunnel port.
For userspace datapath it is used for actual routing of the packet
after sending to a native tunnel.
With kernel datapath we don't really have a mechanism to know which
routing table will actually be used by the kernel after encapsulation,
so our lookups on a cache may be incorrect because of this as well.

So, unless all the relevant routes are in the standard tables, the
lookup in userspace route cache is unreliable.

Luckily, most setups are not using any complicated routing in
non-standard tables that OVS has to be aware of.

It is possible, but unlikely, that standard routing tables are
completely empty while some other custom table is not, and all the OVS
tunnel traffic is directed to that table.  That would be the only
scenario where dumping non-standard tables would make sense.  But it
seems like this kind of setup will likely need a way to tell OVS from
which table the routes should be taken, or we'll need to dump routing
rules and keep a separate cache for each table, so we can first match
on rules and then lookup correct routes in a specific table.  I'm not
sure if trying to implement all that is justified.

For now, stop considering routes from non-standard tables to avoid
mixing different tables together and also wasting CPU resources.

This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is
running on a same host and in a same network namespace with OVS using
its own custom routing table.

Unfortunately, there seems to be no way to tell the kernel to send
updates only for particular tables.  So, we'll still receive and parse
all of them.  But they will not result in a full cache invalidation in
most cases.

Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps.
So, we can make use of it and dump only standard tables when we get a
relevant route update.  NETLINK_GET_STRICT_CHK has to be enabled on
the socket for filtering to work.  There is no reason to not enable it
by default, if supported.  It is not used outside of NETLINK_ROUTE.

Fixes: f0e167f ("route-table: Handle route updates more robustly.")
Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb")
Reported-at: openvswitch/ovs-issues#185
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
igsilya added a commit to igsilya/ovs that referenced this issue Mar 22, 2024
Currently, ovs-vswitchd is subscribed to all the routing changes in the
kernel.  On each change, it marks the internal routing table cache as
invalid, then resets it and dumps all the routes from the kernel from
scratch.  The reason for that is kernel routing updates not being
reliable in a sense that it's hard to tell which route is getting
removed or modified.  Userspace application has to track the order in
which route entries are dumped from the kernel.  Updates can get lost
or even duplicated and the kernel doesn't provide a good mechanism to
distinguish one route from another.  To my knowledge, dumping all the
routes from a kernel after each change is the only way to keep the
cache consistent.  Some more info can be found in the following never
addressed issues:
  https://bugzilla.redhat.com/1337860
  https://bugzilla.redhat.com/1337855

It seems to be believed that NetworkManager "mostly" does incremental
updates right.  But it is still not completely correct, will re-dump
the whole table in certain cases, and it takes a huge amount of very
complicated code to do the accounting and route comparisons.

Going back to ovs-vswitchd, it currently dumps routes from all the
routing tables.  If it will get conflicting routes from multiple
tables, the cache will not be useful.  The routing cache in userspace
is primarily used for checking the egress port for tunneled traffic
and this way also detecting link state changes for a tunnel port.
For userspace datapath it is used for actual routing of the packet
after sending to a native tunnel.
With kernel datapath we don't really have a mechanism to know which
routing table will actually be used by the kernel after encapsulation,
so our lookups on a cache may be incorrect because of this as well.

So, unless all the relevant routes are in the standard tables, the
lookup in userspace route cache is unreliable.

Luckily, most setups are not using any complicated routing in
non-standard tables that OVS has to be aware of.

It is possible, but unlikely, that standard routing tables are
completely empty while some other custom table is not, and all the OVS
tunnel traffic is directed to that table.  That would be the only
scenario where dumping non-standard tables would make sense.  But it
seems like this kind of setup will likely need a way to tell OVS from
which table the routes should be taken, or we'll need to dump routing
rules and keep a separate cache for each table, so we can first match
on rules and then lookup correct routes in a specific table.  I'm not
sure if trying to implement all that is justified.

For now, stop considering routes from non-standard tables to avoid
mixing different tables together and also wasting CPU resources.

This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is
running on a same host and in a same network namespace with OVS using
its own custom routing table.

Unfortunately, there seems to be no way to tell the kernel to send
updates only for particular tables.  So, we'll still receive and parse
all of them.  But they will not result in a full cache invalidation in
most cases.

Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps.
So, we can make use of it and dump only standard tables when we get a
relevant route update.  NETLINK_GET_STRICT_CHK has to be enabled on
the socket for filtering to work.  There is no reason to not enable it
by default, if supported.  It is not used outside of NETLINK_ROUTE.

Fixes: f0e167f ("route-table: Handle route updates more robustly.")
Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb")
Reported-at: openvswitch/ovs-issues#185
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
igsilya added a commit to igsilya/ovs that referenced this issue Mar 22, 2024
Currently, ovs-vswitchd is subscribed to all the routing changes in the
kernel.  On each change, it marks the internal routing table cache as
invalid, then resets it and dumps all the routes from the kernel from
scratch.  The reason for that is kernel routing updates not being
reliable in a sense that it's hard to tell which route is getting
removed or modified.  Userspace application has to track the order in
which route entries are dumped from the kernel.  Updates can get lost
or even duplicated and the kernel doesn't provide a good mechanism to
distinguish one route from another.  To my knowledge, dumping all the
routes from a kernel after each change is the only way to keep the
cache consistent.  Some more info can be found in the following never
addressed issues:
  https://bugzilla.redhat.com/1337860
  https://bugzilla.redhat.com/1337855

It seems to be believed that NetworkManager "mostly" does incremental
updates right.  But it is still not completely correct, will re-dump
the whole table in certain cases, and it takes a huge amount of very
complicated code to do the accounting and route comparisons.

Going back to ovs-vswitchd, it currently dumps routes from all the
routing tables.  If it will get conflicting routes from multiple
tables, the cache will not be useful.  The routing cache in userspace
is primarily used for checking the egress port for tunneled traffic
and this way also detecting link state changes for a tunnel port.
For userspace datapath it is used for actual routing of the packet
after sending to a native tunnel.
With kernel datapath we don't really have a mechanism to know which
routing table will actually be used by the kernel after encapsulation,
so our lookups on a cache may be incorrect because of this as well.

So, unless all the relevant routes are in the standard tables, the
lookup in userspace route cache is unreliable.

Luckily, most setups are not using any complicated routing in
non-standard tables that OVS has to be aware of.

It is possible, but unlikely, that standard routing tables are
completely empty while some other custom table is not, and all the OVS
tunnel traffic is directed to that table.  That would be the only
scenario where dumping non-standard tables would make sense.  But it
seems like this kind of setup will likely need a way to tell OVS from
which table the routes should be taken, or we'll need to dump routing
rules and keep a separate cache for each table, so we can first match
on rules and then lookup correct routes in a specific table.  I'm not
sure if trying to implement all that is justified.

For now, stop considering routes from non-standard tables to avoid
mixing different tables together and also wasting CPU resources.

This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is
running on a same host and in a same network namespace with OVS using
its own custom routing table.

Unfortunately, there seems to be no way to tell the kernel to send
updates only for particular tables.  So, we'll still receive and parse
all of them.  But they will not result in a full cache invalidation in
most cases.

Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps.
So, we can make use of it and dump only standard tables when we get a
relevant route update.  NETLINK_GET_STRICT_CHK has to be enabled on
the socket for filtering to work.  There is no reason to not enable it
by default, if supported.  It is not used outside of NETLINK_ROUTE.

Fixes: f0e167f ("route-table: Handle route updates more robustly.")
Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb")
Reported-at: openvswitch/ovs-issues#185
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
igsilya added a commit to igsilya/ovs that referenced this issue Mar 22, 2024
Currently, ovs-vswitchd is subscribed to all the routing changes in the
kernel.  On each change, it marks the internal routing table cache as
invalid, then resets it and dumps all the routes from the kernel from
scratch.  The reason for that is kernel routing updates not being
reliable in a sense that it's hard to tell which route is getting
removed or modified.  Userspace application has to track the order in
which route entries are dumped from the kernel.  Updates can get lost
or even duplicated and the kernel doesn't provide a good mechanism to
distinguish one route from another.  To my knowledge, dumping all the
routes from a kernel after each change is the only way to keep the
cache consistent.  Some more info can be found in the following never
addressed issues:
  https://bugzilla.redhat.com/1337860
  https://bugzilla.redhat.com/1337855

It seems to be believed that NetworkManager "mostly" does incremental
updates right.  But it is still not completely correct, will re-dump
the whole table in certain cases, and it takes a huge amount of very
complicated code to do the accounting and route comparisons.

Going back to ovs-vswitchd, it currently dumps routes from all the
routing tables.  If it will get conflicting routes from multiple
tables, the cache will not be useful.  The routing cache in userspace
is primarily used for checking the egress port for tunneled traffic
and this way also detecting link state changes for a tunnel port.
For userspace datapath it is used for actual routing of the packet
after sending to a native tunnel.
With kernel datapath we don't really have a mechanism to know which
routing table will actually be used by the kernel after encapsulation,
so our lookups on a cache may be incorrect because of this as well.

So, unless all the relevant routes are in the standard tables, the
lookup in userspace route cache is unreliable.

Luckily, most setups are not using any complicated routing in
non-standard tables that OVS has to be aware of.

It is possible, but unlikely, that standard routing tables are
completely empty while some other custom table is not, and all the OVS
tunnel traffic is directed to that table.  That would be the only
scenario where dumping non-standard tables would make sense.  But it
seems like this kind of setup will likely need a way to tell OVS from
which table the routes should be taken, or we'll need to dump routing
rules and keep a separate cache for each table, so we can first match
on rules and then lookup correct routes in a specific table.  I'm not
sure if trying to implement all that is justified.

For now, stop considering routes from non-standard tables to avoid
mixing different tables together and also wasting CPU resources.

This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is
running on a same host and in a same network namespace with OVS using
its own custom routing table.

Unfortunately, there seems to be no way to tell the kernel to send
updates only for particular tables.  So, we'll still receive and parse
all of them.  But they will not result in a full cache invalidation in
most cases.

Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps.
So, we can make use of it and dump only standard tables when we get a
relevant route update.  NETLINK_GET_STRICT_CHK has to be enabled on
the socket for filtering to work.  There is no reason to not enable it
by default, if supported.  It is not used outside of NETLINK_ROUTE.

Fixes: f0e167f ("route-table: Handle route updates more robustly.")
Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb")
Reported-at: openvswitch/ovs-issues#185
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
igsilya added a commit to igsilya/ovs that referenced this issue Mar 22, 2024
Currently, ovs-vswitchd is subscribed to all the routing changes in the
kernel.  On each change, it marks the internal routing table cache as
invalid, then resets it and dumps all the routes from the kernel from
scratch.  The reason for that is kernel routing updates not being
reliable in a sense that it's hard to tell which route is getting
removed or modified.  Userspace application has to track the order in
which route entries are dumped from the kernel.  Updates can get lost
or even duplicated and the kernel doesn't provide a good mechanism to
distinguish one route from another.  To my knowledge, dumping all the
routes from a kernel after each change is the only way to keep the
cache consistent.  Some more info can be found in the following never
addressed issues:
  https://bugzilla.redhat.com/1337860
  https://bugzilla.redhat.com/1337855

It seems to be believed that NetworkManager "mostly" does incremental
updates right.  But it is still not completely correct, will re-dump
the whole table in certain cases, and it takes a huge amount of very
complicated code to do the accounting and route comparisons.

Going back to ovs-vswitchd, it currently dumps routes from all the
routing tables.  If it will get conflicting routes from multiple
tables, the cache will not be useful.  The routing cache in userspace
is primarily used for checking the egress port for tunneled traffic
and this way also detecting link state changes for a tunnel port.
For userspace datapath it is used for actual routing of the packet
after sending to a native tunnel.
With kernel datapath we don't really have a mechanism to know which
routing table will actually be used by the kernel after encapsulation,
so our lookups on a cache may be incorrect because of this as well.

So, unless all the relevant routes are in the standard tables, the
lookup in userspace route cache is unreliable.

Luckily, most setups are not using any complicated routing in
non-standard tables that OVS has to be aware of.

It is possible, but unlikely, that standard routing tables are
completely empty while some other custom table is not, and all the OVS
tunnel traffic is directed to that table.  That would be the only
scenario where dumping non-standard tables would make sense.  But it
seems like this kind of setup will likely need a way to tell OVS from
which table the routes should be taken, or we'll need to dump routing
rules and keep a separate cache for each table, so we can first match
on rules and then lookup correct routes in a specific table.  I'm not
sure if trying to implement all that is justified.

For now, stop considering routes from non-standard tables to avoid
mixing different tables together and also wasting CPU resources.

This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is
running on a same host and in a same network namespace with OVS using
its own custom routing table.

Unfortunately, there seems to be no way to tell the kernel to send
updates only for particular tables.  So, we'll still receive and parse
all of them.  But they will not result in a full cache invalidation in
most cases.

Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps.
So, we can make use of it and dump only standard tables when we get a
relevant route update.  NETLINK_GET_STRICT_CHK has to be enabled on
the socket for filtering to work.  There is no reason to not enable it
by default, if supported.  It is not used outside of NETLINK_ROUTE.

Fixes: f0e167f ("route-table: Handle route updates more robustly.")
Fixes: ea83a2f ("lib: Show tunnel egress interface in ovsdb")
Reported-at: openvswitch/ovs-issues#185
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
@igsilya
Copy link
Member

igsilya commented Mar 22, 2024

@f0o , @ddominet , the patch is applied now to all branches down to 2.17. Will be part of the next set of stable releases. Not sure when distributions will pick those up.

With the change you should be able to run BGP in a separate VRF without significant impact on OVS CPU usage. But running with the main routing table will still be problematic.

Closing this issue for now.

@igsilya igsilya closed this as completed Mar 22, 2024
@f0o
Copy link

f0o commented Mar 24, 2024

Amazing job @igsilya! Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants