Finite LINGER for Zeromq blocks #1132

jaredd · 2016-12-06T22:36:45Z

I have a dynamic distributed application where multiple flowgraphs communicate using zeromq blocks. It is often the case that from test to test the set of flowgraphs operating changes, requiring complete teardown of one flowgraph and construction of new one in its place. I cannot reliably produce a minimal example of this behavior, but almost surely with some number of reconfigurations (under 10) a flowgraph will fail to exit. The prevailing theory is that ZMQ is hanging during destruction due to unserved messages in its queue. The theory was somewhat validated by placing a finite LINGER sockopt in the zeromq base_impl constructor. After this change flowgraphs reliably cleaned up. LINGER seems like a reasonable option to make available to users.

jmcorgan · 2017-01-12T21:55:10Z

Can you do a pull request with your proposed changes?

jaredd · 2017-02-15T22:22:07Z

Just a small update to this issue, when the GUI experienced a freeze when trying to teardown the flowgraph I attached to the process with gdb and got the following traceback which suggests that it is zmq waiting on unsent messages:
#0 0x00007f235424bfdd in poll () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007f22f9630a6a in ?? () from /usr/lib/x86_64-linux-gnu/libzmq.so.3
#2 0x00007f22f961d2f7 in ?? () from /usr/lib/x86_64-linux-gnu/libzmq.so.3
#3 0x00007f22f961463c in ?? () from /usr/lib/x86_64-linux-gnu/libzmq.so.3
#4 0x00007f22f9645388 in zmq_ctx_term () from /usr/lib/x86_64-linux-gnu/libzmq.so.3
#5 0x00007f22f987054e in close (this=0x37adbc0) at /usr/include/zmq.hpp:288
#6 ~context_t (this=0x37adbc0, __in_chrg=) at /usr/include/zmq.hpp:281
#7 gr::zeromq::base_impl::~base_impl (this=0x334e5f8, __vtt_parm=,
__in_chrg=) at /usr/local/src/gnuradio/gr-zeromq/lib/base_impl.cc:54
#8 0x00007f22f987cdb9 in ~base_sink_impl (
__vtt_parm=0x7f22f9a95778 <VTT for gr::zeromq::push_sink_impl+24>, this=,
__in_chrg=) at /usr/local/src/gnuradio/gr-zeromq/lib/base_impl.h:47
#9 ~push_sink_impl (this=0x334e5f0, __in_chrg=, __vtt_parm=)
at /usr/local/src/gnuradio/gr-zeromq/lib/push_sink_impl.h:34
#10 gr::zeromq::push_sink_impl::~push_sink_impl (this=0x334e5f0, __in_chrg=,
__vtt_parm=) at /usr/local/src/gnuradio/gr-zeromq/lib/push_sink_impl.h:34

Again, I cannot reliably reproduce this behavior. It appears to occur randomly after some number of flowgraph builds and teardowns.

MattMills · 2020-10-17T16:25:04Z

Experiencing a similar issue, flowgraph teardowns during top_block.stop() are locking up processes, related portion of backtrace:

#0  0x00007efcd77efaff in __GI___poll (fds=0x7ffe480eee20, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007efc09e4e56b in ?? () from /usr/lib/x86_64-linux-gnu/libzmq.so.5
#2  0x00007efc09e292be in ?? () from /usr/lib/x86_64-linux-gnu/libzmq.so.5
#3  0x00007efc09e1715c in ?? () from /usr/lib/x86_64-linux-gnu/libzmq.so.5
#4  0x00007efc09e6f7ce in zmq_ctx_term () from /usr/lib/x86_64-linux-gnu/libzmq.so.5
#5  0x00007efc09eba413 in ?? () from /usr/lib/x86_64-linux-gnu/libgnuradio-zeromq.so.3.8.1
#6  0x00007efc09ecaf1d in ?? () from /usr/lib/x86_64-linux-gnu/libgnuradio-zeromq.so.3.8.1
#7  0x00007efcd690eeea in gr::edge::~edge() () from /usr/lib/x86_64-linux-gnu/libgnuradio-runtime.so.3.8.1
#8  0x00007efcd690f3cc in gr::flowgraph::clear() () from /usr/lib/x86_64-linux-gnu/libgnuradio-runtime.so.3.8.1
#9  0x00007efcd691c487 in gr::hier_block2_detail::disconnect_all() () from /usr/lib/x86_64-linux-gnu/libgnuradio-runtime.so.3.8.1
#10 0x00007efcd691ad81 in gr::hier_block2::~hier_block2() () from /usr/lib/x86_64-linux-gnu/libgnuradio-runtime.so.3.8.1
#11 0x00007efcd693f5ed in gr::top_block::~top_block() () from /usr/lib/x86_64-linux-gnu/libgnuradio-runtime.so.3.8.1
#12 0x00007efcd6a2e30a in ?? () from /usr/lib/python3/dist-packages/gnuradio/gr/_runtime_swig.so
#13 0x00007efcd69f8d37 in ?? () from /usr/lib/python3/dist-packages/gnuradio/gr/_runtime_swig.so
#14 0x00000000005d31e8 in _Py_Dealloc (op=<optimized out>) at ../Objects/object.c:2215
... remainder of backtrace excluded...

@jmcorgan I'm not a C dev, but I believe a const int zero = 0; and a d_socket.setsockopt(ZMQ_LINGER, &zero, sizeof(&zero)); would be appropriate somewhere around here: https://github.com/gnuradio/gnuradio/blob/master/gr-zeromq/lib/base_impl.cc#L74, according to the documentation for ZMQ_LINGER, http://api.zeromq.org/4-2:zmq-setsockopt#toc24, specifally:

A value of -1 specifies an infinite linger period. Pending messages shall not be discarded after a call to zmq_disconnect() or zmq_close(); attempting to terminate the socket's context with zmq_ctx_term() shall block until all pending messages have been sent to a peer.

I believe this is related to this behavior as well: zeromq/libzmq@90ea11c (as the actual timeout value in my backtrace is showing as -1, which would be consistent with the indefinite blocking behavior listed above).

Related test case showing example usage of zmq_setsockopt with ZMQ_LINGER: https://github.com/zeromq/libzmq/blob/master/tests/test_connect_rid.cpp#L62

Re: gnuradio#1132 the default value of ZMQ_Linger was incorrect in the ZMQ docs, and is currently -1, which causes sockets with pending messages to block indefinitely during tear down, causing flow graphs to lock up indefinitely during top_block.stop() if outbound data is pending.

Closes: gnuradio#1132 Per the ZMQ documentation update, the docs originally listed the default of ZMQ_LINGER as 30 seconds, however the real default was -1. This caused the behavior of blocking indefinitely on top_block.stop() while the socket waited for abandoned messages to be read by a client. Ideally this value should be configurable, I've opened gnuradio#3872 as follow up.

Backport of gnuradio#3866 to maint-3.8 Closes: gnuradio#1132 Per the ZMQ documentation update, the docs originally listed the default of ZMQ_LINGER as 30 seconds, however the real default was -1. This caused the behavior of blocking indefinitely on top_block.stop() while the socket waited for abandoned messages to be read by a client. Ideally this value should be configurable, I've opened gnuradio#3872 as follow up.

Backport of #3866 to maint-3.8 Closes: #1132 Per the ZMQ documentation update, the docs originally listed the default of ZMQ_LINGER as 30 seconds, however the real default was -1. This caused the behavior of blocking indefinitely on top_block.stop() while the socket waited for abandoned messages to be read by a client. Ideally this value should be configurable, I've opened #3872 as follow up.

jmcorgan added the ZMQ label Dec 6, 2016

This was referenced Oct 19, 2020

Add ZMQ_Linger of 1000ms to ZMQ socket creation to prevent infinite block in top_block.stop() #3866

Merged

Channelizer channels sometimes lock up during channel teardown MattMills/radiocapture-rf#19

Closed

mormj closed this as completed in 15efb1e Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finite LINGER for Zeromq blocks #1132

Finite LINGER for Zeromq blocks #1132

jaredd commented Dec 6, 2016

jmcorgan commented Jan 12, 2017

jaredd commented Feb 15, 2017

MattMills commented Oct 17, 2020 •

edited

Finite LINGER for Zeromq blocks #1132

Finite LINGER for Zeromq blocks #1132

Comments

jaredd commented Dec 6, 2016

jmcorgan commented Jan 12, 2017

jaredd commented Feb 15, 2017

MattMills commented Oct 17, 2020 • edited

MattMills commented Oct 17, 2020 •

edited