Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ovs-vswitchd deadlock #175

Closed
jharbott opened this issue Sep 18, 2019 · 13 comments
Closed

ovs-vswitchd deadlock #175

jharbott opened this issue Sep 18, 2019 · 13 comments

Comments

@jharbott
Copy link

This is similar to #153 but the backtrace that we are seeing is slightly different, happening on the "else" branch and thus not with multicast traffic. We haven't found a way to trigger this, but are seeing irregular occurances since we upgraded our OpenStack setup from Xenial to Bionic. The version of openvswitch is unchanged by this upgrade though, since on Xenial it was pulled in from the Queens UCA: 2.9.2-0ubuntu0.18.04.3, kernel 4.15.0-55.60.

Thread 62 (Thread 0x7fd3d3fff700 (LWP 2619)):
#0  0x00007fd4d06bb37b in futex_abstimed_wait (private=<optimized out>, abstime=0x0, expected=10, futex_word=0x55b87dc503c8) at ../sysdeps/unix/sysv/linux/futex-internal.h:172
#1  __pthread_rwlock_rdlock_full (abstime=0x0, rwlock=0x55b87dc503c8) at pthread_rwlock_common.c:321
#2  __GI___pthread_rwlock_rdlock (rwlock=rwlock@entry=0x55b87dc503c8) at pthread_rwlock_rdlock.c:27
#3  0x000055b87d47d9d8 in ovs_rwlock_rdlock_at (l_=0x55b87dc503c8, where=where@entry=0x55b87d519848 "../ofproto/ofproto-dpif-xlate.c:2878") at ../lib/ovs-thread.c:77
#4  0x000055b87d3d7794 in xlate_normal (ctx=0x7fd3d3ff9a30) at ../ofproto/ofproto-dpif-xlate.c:2878
#5  xlate_output_action (ctx=ctx@entry=0x7fd3d3ff9a30, port=<optimized out>, controller_len=<optimized out>, may_packet_in=may_packet_in@entry=true, is_last_action=is_last_action@entry=false, truncate=truncate@entry=false) at ../ofproto/ofproto-dpif-xlate.c:4803
#6  0x000055b87d3d8bda in do_xlate_actions (ofpacts=<optimized out>, ofpacts_len=<optimized out>, ctx=<optimized out>, is_last_action=<optimized out>) at ../ofproto/ofproto-dpif-xlate.c:6183
#7  0x000055b87d3d2ef3 in xlate_recursively (actions_xlator=0x55b87d3d81b0 <do_xlate_actions>, is_last_action=false, deepens=<optimized out>, rule=0x55b87dcec7d0, ctx=0x7fd3d3ff9a30) at ../ofproto/ofproto-dpif-xlate.c:3927
#8  xlate_table_action (ctx=0x7fd3d3ff9a30, in_port=<optimized out>, table_id=<optimized out>, may_packet_in=<optimized out>, honor_table_miss=<optimized out>, with_ct_orig=<optimized out>, is_last_action=false, xlator=0x55b87d3d81b0 <do_xlate_actions>) at ../ofproto/ofproto-dpif-xlate.c:4055
#9  0x000055b87d3d876e in do_xlate_actions (ofpacts=<optimized out>, ofpacts_len=<optimized out>, ctx=<optimized out>, is_last_action=<optimized out>) at ../ofproto/ofproto-dpif-xlate.c:6496
#10 0x000055b87d3d2ef3 in xlate_recursively (actions_xlator=0x55b87d3d81b0 <do_xlate_actions>, is_last_action=false, deepens=<optimized out>, rule=0x55b87de03e60, ctx=0x7fd3d3ff9a30) at ../ofproto/ofproto-dpif-xlate.c:3927
#11 xlate_table_action (ctx=0x7fd3d3ff9a30, in_port=<optimized out>, table_id=<optimized out>, may_packet_in=<optimized out>, honor_table_miss=<optimized out>, with_ct_orig=<optimized out>, is_last_action=false, xlator=0x55b87d3d81b0 <do_xlate_actions>) at ../ofproto/ofproto-dpif-xlate.c:4055
#12 0x000055b87d3d876e in do_xlate_actions (ofpacts=ofpacts@entry=0x55b87de54898, ofpacts_len=ofpacts_len@entry=8, ctx=ctx@entry=0x7fd3d3ff9a30, is_last_action=is_last_action@entry=true) at ../ofproto/ofproto-dpif-xlate.c:6496
#13 0x000055b87d3dd455 in xlate_actions (xin=xin@entry=0x7fd3d3ffa2d0, xout=xout@entry=0x7fd3d3ffabf0) at ../ofproto/ofproto-dpif-xlate.c:7107
#14 0x000055b87d3cc7f5 in xlate_key (key=<optimized out>, len=<optimized out>, push=push@entry=0x7fd3d3ffa670, ctx=ctx@entry=0x7fd3d3ffabd0, udpif=<optimized out>) at ../ofproto/ofproto-dpif-upcall.c:2049
#15 0x000055b87d3ccacc in xlate_ukey (ctx=0x7fd3d3ffabd0, tcp_flags=<optimized out>, ukey=<optimized out>, udpif=0x55b87dc1fa50) at ../ofproto/ofproto-dpif-upcall.c:2061
#16 revalidate_ukey__ (udpif=udpif@entry=0x55b87dc1fa50, ukey=ukey@entry=0x7fd398018980, tcp_flags=<optimized out>, odp_actions=0x7fd3d3ffaff0, recircs=recircs@entry=0x7fd3d3ffb070, xcache=<optimized out>) at ../ofproto/ofproto-dpif-upcall.c:2107
#17 0x000055b87d3ccdfd in revalidate_ukey (udpif=udpif@entry=0x55b87dc1fa50, ukey=ukey@entry=0x7fd398018980, stats=stats@entry=0x7fd3d3ffca58, odp_actions=odp_actions@entry=0x7fd3d3ffaff0, reval_seq=reval_seq@entry=1806239372, recircs=recircs@entry=0x7fd3d3ffb070) at ../ofproto/ofproto-dpif-upcall.c:2214
#18 0x000055b87d3cfa7f in revalidate (revalidator=<optimized out>) at ../ofproto/ofproto-dpif-upcall.c:2517
#19 0x000055b87d3d00f1 in udpif_revalidator (arg=0x55b87dd096e0) at ../ofproto/ofproto-dpif-upcall.c:913
#20 0x000055b87d47d916 in ovsthread_wrapper (aux_=<optimized out>) at ../lib/ovs-thread.c:348
#21 0x00007fd4d06b66db in start_thread (arg=0x7fd3d3fff700) at pthread_create.c:463
#22 0x00007fd4cfe3988f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
@ionutbiru
Copy link

i also had a deadlock but sadly i don't know how to debug it. everything went back to normal after restart.


openvswitch-common  2.11.0-0ubuntu2~cloud0  
openvswitch-switch 2.11.0-0ubuntu2~cloud0
Linux cj204 5.0.0-29-generic #31~18.04.1-Ubuntu SMP Thu Sep 12 18:29:21 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
2019-11-12T12:51:03.233Z|00446|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2019-11-12T17:35:24.902Z|00001|ovs_rcu(urcu4)|WARN|blocked 1000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:25.902Z|00002|ovs_rcu(urcu4)|WARN|blocked 2000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:27.168Z|00447|ovs_rcu|WARN|blocked 1000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:27.902Z|00003|ovs_rcu(urcu4)|WARN|blocked 4000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:28.167Z|00448|ovs_rcu|WARN|blocked 2000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:30.167Z|00449|ovs_rcu|WARN|blocked 4000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:31.901Z|00004|ovs_rcu(urcu4)|WARN|blocked 8000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:34.167Z|00450|ovs_rcu|WARN|blocked 8000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:39.901Z|00005|ovs_rcu(urcu4)|WARN|blocked 16000 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:42.168Z|00451|ovs_rcu|WARN|blocked 16001 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:55.902Z|00006|ovs_rcu(urcu4)|WARN|blocked 32001 ms waiting for revalidator95 to quiesce
2019-11-12T17:35:58.168Z|00452|ovs_rcu|WARN|blocked 32001 ms waiting for revalidator95 to quiesce
2019-11-12T17:36:27.902Z|00007|ovs_rcu(urcu4)|WARN|blocked 64000 ms waiting for revalidator95 to quiesce
2019-11-12T17:36:30.167Z|00453|ovs_rcu|WARN|blocked 64000 ms waiting for revalidator95 to quiesce
2019-11-12T17:37:31.902Z|00008|ovs_rcu(urcu4)|WARN|blocked 128000 ms waiting for revalidator95 to quiesce
2019-11-12T17:37:34.168Z|00454|ovs_rcu|WARN|blocked 128000 ms waiting for revalidator95 to quiesce
2019-11-12T17:39:39.901Z|00009|ovs_rcu(urcu4)|WARN|blocked 256000 ms waiting for revalidator95 to quiesce
2019-11-12T17:39:42.168Z|00455|ovs_rcu|WARN|blocked 256001 ms waiting for revalidator95 to quiesce
2019-11-12T17:43:55.904Z|00010|ovs_rcu(urcu4)|WARN|blocked 512003 ms waiting for revalidator95 to quiesce
2019-11-12T17:43:58.172Z|00456|ovs_rcu|WARN|blocked 512004 ms waiting for revalidator95 to quiesce
2019-11-12T17:52:27.902Z|00011|ovs_rcu(urcu4)|WARN|blocked 1024000 ms waiting for revalidator95 to quiesce
2019-11-12T17:52:30.168Z|00457|ovs_rcu|WARN|blocked 1024000 ms waiting for revalidator95 to quiesce
2019-11-12T18:09:31.908Z|00012|ovs_rcu(urcu4)|WARN|blocked 2048007 ms waiting for revalidator95 to quiesce
2019-11-12T18:09:34.177Z|00458|ovs_rcu|WARN|blocked 2048009 ms waiting for revalidator95 to quiesce
2019-11-12T18:43:39.942Z|00013|ovs_rcu(urcu4)|WARN|blocked 4096040 ms waiting for revalidator95 to quiesce
2019-11-12T18:43:42.210Z|00459|ovs_rcu|WARN|blocked 4096042 ms waiting for revalidator95 to quiesce
2019-11-12T19:51:56.000Z|00014|ovs_rcu(urcu4)|WARN|blocked 8192099 ms waiting for revalidator95 to quiesce
2019-11-12T19:51:58.268Z|00460|ovs_rcu|WARN|blocked 8192100 ms waiting for revalidator95 to quiesce
2019-11-12T22:08:28.002Z|00015|ovs_rcu(urcu4)|WARN|blocked 16384100 ms waiting for revalidator95 to quiesce
2019-11-12T22:08:30.267Z|00461|ovs_rcu|WARN|blocked 16384100 ms waiting for revalidator95 to quiesce
2019-11-13T02:41:32.001Z|00016|ovs_rcu(urcu4)|WARN|blocked 32768099 ms waiting for revalidator95 to quiesce
2019-11-13T02:41:34.230Z|00462|ovs_rcu|WARN|blocked 32768062 ms waiting for revalidator95 to quiesce

@zdenekjanda
Copy link

zdenekjanda commented Nov 27, 2019

I can confirm this bug, manifestation is exactly same. We can reproduce this bug easily, as it occurs several times a day. It is also in Openstack environment after update to Bionic, we are at 2.12.0-0ubuntu1 (compiled from latest ubuntu dsc) and 5.0.0-31-generic. Nothing helped so far.
Stucked thread:

40 Thread 0x7fb88f7fe700 (LWP 4457) "revalidator519" 0x00007fb91f544bf9 in __GI___poll (fds=fds@entry=0x7fb8cc029150, nfds=nfds@entry=2, timeout=timeout@entry=2147483647) at ../sysdeps/unix/sysv/linux/poll.c:29 41 Thread 0x7fb88effd700 (LWP 4458) "revalidator520" 0x00007fb92007d37b in futex_abstimed_wait (private=<optimized out>, abstime=0x0, expected=10, futex_word=0x5614a8a33378) at ../sysdeps/unix/sysv/linux/futex-internal.h:172 42 Thread 0x7fb88e7fc700 (LWP 4459) "revalidator521" 0x00007fb91f544bf9 in __GI___poll (fds=fds@entry=0x7fb8b412f6e0, nfds=nfds@entry=2, timeout=timeout@entry=2147483647) at ../sysdeps/unix/sysv/linux/poll.c:29

Backtrace of stucked thread

Thread 41 (Thread 0x7fb88effd700 (LWP 4458)): #0 0x00007fb92007d37b in futex_abstimed_wait (private=<optimized out>, abstime=0x0, expected=10, futex_word=0x5614a8a33378) at ../sysdeps/unix/sysv/linux/futex-internal.h:172 __ret = -512 err = <optimized out> err = <optimized out> __ret = <optimized out> resultvar = <optimized out> __arg6 = <optimized out> __arg5 = <optimized out> __arg4 = <optimized out> __arg3 = <optimized out> __arg2 = <optimized out> __arg1 = <optimized out> _a6 = <optimized out> _a5 = <optimized out> _a4 = <optimized out> _a3 = <optimized out> _a2 = <optimized out> _a1 = <optimized out> #1 __pthread_rwlock_rdlock_full (abstime=0x0, rwlock=0x5614a8a33378) at pthread_rwlock_common.c:321 private = <optimized out> err = <optimized out> r = 10 wpf = <optimized out> ready = <optimized out> r = <optimized out> wpf = <optimized out> ready = <optimized out> __value = <optimized out> private = <optimized out> err = <optimized out> private = <optimized out> private = <optimized out> err = <optimized out> #2 __GI___pthread_rwlock_rdlock (rwlock=rwlock@entry=0x5614a8a33378) at pthread_rwlock_rdlock.c:27 result = <optimized out> #3 0x00005614a66e9ed8 in ovs_rwlock_rdlock_at (l_=0x5614a8a33378, where=where@entry=0x5614a6791c60 "../ofproto/ofproto-dpif-xlate.c:3114") at ../lib/ovs-thread.c:76 l = 0x5614a8a33378 error = <optimized out> __func__ = "ovs_rwlock_rdlock_at" #4 0x00005614a663294c in xlate_normal (ctx=0x7fb88eff7850) at ../ofproto/ofproto-dpif-xlate.c:3114 in_xbundle = <optimized out> mac = <optimized out> xvlan = {v = {{tpid = 33024, vid = 290, pcp = 0}, {tpid = 0, vid = 0, pcp = 0}}} wc = 0x7fb88eff8750 flow = 0x7fb88eff8100 in_port = 0x5614a8ddb840 mac_port = <optimized out> in_xvlan = {v = {{tpid = 33024, vid = 290, pcp = 0}, {tpid = 0, vid = 0, pcp = 0}}} vlan = 290 is_grat_arp = <optimized out> wc = <optimized out> flow = <optimized out> in_xbundle = <optimized out> in_port = <optimized out> mac = <optimized out> mac_port = <optimized out> in_xvlan = <optimized out> xvlan = <optimized out> vlan = <optimized out> is_grat_arp = <optimized out> __v = <optimized out> __x = <optimized out> __v = <optimized out> __x = <optimized out> entry = <optimized out> ms = <optimized out> grp = <optimized out> out = <optimized out> out = <optimized out> out = <optimized out> __v = <optimized out> __x = <optimized out> __v = <optimized out> __x = <optimized out> mac_xbundle = <optimized out> #5 xlate_output_action (ctx=ctx@entry=0x7fb88eff7850, port=<optimized out>, controller_len=<optimized out>, may_packet_in=may_packet_in@entry=true, is_last_action=<optimized out>, truncate=truncate@entry=false, group_bucket_action=false) at ../ofproto/ofproto-dpif-xlate.c:5119 prev_nf_output_iface = 65533 #6 0x00005614a6634bfb in do_xlate_actions (ofpacts=ofpacts@entry=0x5614a8c05f08, ofpacts_len=ofpacts_len@entry=16, ctx=ctx@entry=0x7fb88eff7850, is_last_action=<optimized out>, group_bucket_action=group_bucket_action@entry=false) at ../ofproto/ofproto-dpif-xlate.c:6630 controller = <optimized out> last = <optimized out> mf = <optimized out> set_field = <optimized out> metadata = <optimized out> __v = <optimized out> __v = <optimized out> __v = <optimized out> __v = <optimized out> __v = <optimized out> __v = <optimized out> __v = <optimized out> __v = <optimized out> __v = <optimized out> __v = <optimized out> __v = <optimized out> __v = <optimized out> __x = <optimized out> __x = <optimized out> __x = <optimized out> __x = <optimized out> __x = <optimized out> __x = <optimized out> __x = <optimized out> __x = <optimized out> __x = <optimized out> __x = <optimized out> __x = <optimized out> __x = <optimized out> wc = <optimized out> flow = <optimized out> a = 0x5614a8c05f08 __func__ = "do_xlate_actions" #7 0x00005614a6637583 in clone_xlate_actions (actions=0x5614a8c05f08, actions_len=16, ctx=0x7fb88eff7850, is_last_action=<optimized out>, group_bucket_action=<optimized out>) at ../ofproto/ofproto-dpif-xlate.c:5643 old_stack = {base = 0x7fb88eff5560, data = 0x7fb88eff5560, size = 0, allocated = 1024, header = 0x0, msg = <optimized out>, list_node = {prev = <optimized out>, next = <optimized out>}, source = <optimized out>} new_stack = {{u8 = '\000' <repeats 127 times>, be16 = {0 <repeats 64 times>}, be32 = {0 <repeats 32 times>}, be64 = {0 <repeats 16 times>}, be128 = {{be32 = {0, 0, 0, 0}, be64 = {hi = 0, lo = 0}}, {be32 = {0, 0, 0, 0}, be64 = {hi = 0, lo = 0}}, {be32 = {0, 0, 0, 0}, be64 = {hi = 0, lo = 0}}, {be32 = {0, 0, 0, 0}, be64 = {hi = 0, lo = 0}}, {be32 = {0, 0, 0, 0}, be64 = {hi = 0, lo = 0}}, {be32 = {0, 0, 0, 0}, be64 = {hi = 0, lo = 0}}, {be32 = {0, 0, 0, 0}, be64 = {hi = 0, lo = 0}}, {be32 = {0, 0, 0, 0}, be64 = {hi = 0, lo = 0}}},

Reported to mailing list of OVS as well, attached full backtrace. Now, any idea what we can do about this by first shot ? We can definitively reproduce, bug is occuring pretty frequently. From what i see, it seems thread is blocked for now apparent reason. What is also very strange is a way how we mitigate this problem - we simply do gcore coredump of ovs-vswitchd, which unblocks stucked thread and ovs-vswitchd continues to work normally. Even with this, its extremely annoying, especialy if you hit this while creating many VMs, etc... Anyone have suggestion on what we can do now with this ?

@zdenekjanda
Copy link

Also to note, we tested 4.x kernels and even lower versions of OVS, nothing seems to help. So the problem will most propably be interaction with Ubuntu Bionic and some C libraries OVS is using. Will need to get some help from somebody faimiliar with OVS code to figure out what is going on here - we are ready to patch and test, but so far no idea where to begin

@zdenekjanda
Copy link

zdenekjanda commented Nov 27, 2019

@jharbott @ionutbiru anything new regards this from you ?

@igsilya
Copy link
Member

igsilya commented Nov 27, 2019

This issue might be related and it seems that it not fixed yet in Ubuntu 18.04:
https://sourceware.org/bugzilla/show_bug.cgi?id=23844

@zdenekjanda
Copy link

@igsilya thanks for the lead, from what i see this can be the most likely cause. I will see about testing the patch in our environment.

@jharbott
Copy link
Author

@igsilya @zdenekjanda That looks pretty interesting and would match with what we see here. I'll also trigger some testing at our side, but it will take a couple of days.

@jharbott
Copy link
Author

jharbott commented Dec 5, 2019

So I did some testing with the patch for the glibc bug added to the Ubuntu Bionic version of it (2.27-3ubuntu1), but it did not resolve the ovs lockups for us. Going to try with newer glibc versions now, maybe there were some similar issues that got fixed.

@zdenekjanda
Copy link

@jharbott i have tested the same and it didnt resolve the lockup. But, 2 days ago we have deployed 2.30-0ubuntu2 which is first version to have the patch in natively (brave step, added repos from eoan and installed libc-dev-bin libc6-dbg libc6-dev multiarch-support libc-bin libc6 libidn2-0 locales), and since then no occurences of the bug (before, we had stable like 10+ occurences a day). Issue is resolved, indeed its not just this patch but also somethin else, definitively i strongly sugest to avoid openvswitch with glibc 2.27-2.29. Will keep it few more days to see if it would appear again or if there are any issues with such mixed system.

@jharbott
Copy link
Author

jharbott commented Dec 6, 2019

@zdenekjanda that is good news. We started a similar test with 2.29-0ubuntu2 from disco before I read your reply, we'll see on Monday whether that might already be good enough. If not, we will switch to eoan like you did.

@jharbott
Copy link
Author

So 2.29-0ubuntu2 solves the issue for us, too. With that, I'm closing this issue, as it isn't caused by ovs, will work with Ubuntu to try and get the fix(es) into Bionic.

@jshen28
Copy link

jshen28 commented Jul 13, 2020

Do you have pretty busy network traffic? we seem to also hit by the same issue on production, but cannot reproduce it in dev env. Curious to find the trigger....

@jshen28
Copy link

jshen28 commented Jul 28, 2020

https://sourceware.org/bugzilla/show_bug.cgi?id=23861 I think this bug report is pretty interesting....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants