Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drone blocks, when stream statistics is switched off in port configuration #234

Closed
ghost opened this issue Dec 11, 2017 · 13 comments

Comments

@ghost
Copy link

commented Dec 11, 2017

Version 0.9 revision f32c50e on Debian 9.2 (Stretch)

Simple test:

  • enable stream statistics on port 0 (eth0)
  • disable stream statistics on port 0 (eth0)

Enabling works fine, but disabling port statistics leads to a blocked GUI (busy cursor remains active). The drone doesn't use CPU time in that situation, so it is not in an infinite loop.

The output during disabling of the stream statistics:

In PortConfigDialog
reservedBy_ = 0
modifyPort: portIndex = 0
client(CallMethod) sending 16 bytes <----
0001000200000008
method = 2:modifyPort
 req = PortConfigList
port {
  port_id {
    id: 0
  }
  is_tracking_stream_stats: false
}

---->
[127.0.0.1:46862] Server(on_clientSock_dataAvail): successfully received/parsed msg <----
[127.0.0.1:46862] method = 2
[127.0.0.1:46862] req = 
[127.0.0.1:46862] port {
[127.0.0.1:46862]   port_id {
[127.0.0.1:46862]     id: 0
[127.0.0.1:46862]   }
[127.0.0.1:46862]   is_tracking_stream_stats: false
[127.0.0.1:46862] }
[127.0.0.1:46862] ---->
[127.0.0.1:46862] In virtual void MyService::modifyPort(google::protobuf::RpcController*, const OstProto::PortConfigList*, OstProto::Ack*, google::protobuf::Closure*)
RpcChannel: queueing rpc since method 2 is pending;<----
 queued method = 13:getStats
queued message = 
port_id {
  id: 0
}
port_id {
  id: 1
}
port_id {
  id: 2
}
port_id {
  id: 3
}

---->
pendingCallList size = 1
@pstavirs

This comment has been minimized.

Copy link
Owner

commented Dec 11, 2017

@ehlers Not able to repro this on Ubuntu or Windows. Busy cursor means Ostinato didn't get a response from Drone. Can you check what is happening on the Drone side? See if you can run gdb on Drone to check.

@ghost

This comment has been minimized.

Copy link
Author

commented Dec 11, 2017

Quite unfortunate, that you can't reproduce it. As debugging this on my side will be very time-consuming for both of us, I suggest to close this issue. Maybe it's just my environment, that is unusual and I currently can't spend a lot of time to it.

@pstavirs

This comment has been minimized.

Copy link
Owner

commented Dec 11, 2017

No problem.

Though, if you could get me the drone side log that may help.

@ghost

This comment has been minimized.

Copy link
Author

commented Dec 11, 2017

No problem, see attached file. I only started Ostinato, enabled stream statistics, then disabled it.
As the drone/GUI blocks I killed the drone (SIGINT), that's were the last 2 lines in the logs are coming from.
ostinato-drone.log.txt

@ghost

This comment has been minimized.

Copy link
Author

commented Dec 11, 2017

Two observation:
Having traffic on the interface doesn't change anything, it still blocks.
When I disable the port (ifconfig eth0 down) while the drone is blocked, it immediately unblocks. The drone log then continues with:

virtual void PcapRxStats::run(): error reading packet (-1): The interface went down
user requested receiver stop
@ghost

This comment has been minimized.

Copy link
Author

commented Dec 11, 2017

Well, I have compiled the drone with -config debug, started the drone in gdb, waited for the drone to lock (it happens also without optimization) and interrupted with ^C. Here just a backtrace, please tell me what you want be to debug.

[127.0.0.1:48140] Server(on_clientSock_dataAvail): successfully received/parsed msg <----
[127.0.0.1:48140] method = 2
[127.0.0.1:48140] req = 
[127.0.0.1:48140] port {
[127.0.0.1:48140]   port_id {
[127.0.0.1:48140]     id: 0
[127.0.0.1:48140]   }
[127.0.0.1:48140]   is_tracking_stream_stats: false
[127.0.0.1:48140] }
[127.0.0.1:48140] ---->
[127.0.0.1:48140] In virtual void MyService::modifyPort(google::protobuf::RpcController*, const OstProto::PortConfigList*, OstProto::Ack*, google::protobuf::Closure*)
^C
Thread 1 "drone" received signal SIGINT, Interrupt.
0x00007ffff5a5e6ad in poll () at ../sysdeps/unix/syscall-template.S:84
84	../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007ffff5a5e6ad in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007ffff52979f6 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#2  0x00007ffff5297b0c in g_main_context_iteration ()
   from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3  0x00007ffff6993854 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#4  0x00007ffff69617ef in QEventLoop::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#5  0x00007ffff6961b55 in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#6  0x00007ffff6967bd9 in QCoreApplication::exec() ()
   from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#7  0x00005555555837b7 in main (argc=1, argv=0x7fffffffecc8)
    at drone_main.cpp:97
@pstavirs

This comment has been minimized.

Copy link
Owner

commented Dec 12, 2017

That BT is most likely of the main thread which listens for incoming connections. The actual processing for a client/controller is done in a different thread. It is the latter that I suspect is getting stuck somewhere.

Now the problem is that it might be difficult to isolate the specific thread in question because there are multiple threads in Drone. What might be easier is to put a breakpoint on MyService::modifyPort(). The breakpoint will be hit first when you enable, you can just choose to continue. The second time it will hit for the disable case - do a single step execution from there.

The problem is somewhere in that function only (or some function called by that function)

Thanks a bunch for your help!

@ghost

This comment has been minimized.

Copy link
Author

commented Dec 12, 2017

Here my two GDB logs, the drone hangs in PcapRxStats::stop()
trace1.txt
trace2.txt

As I wrote previously, there is some (light) traffic coming in into that interface.

@ghost

This comment has been minimized.

Copy link
Author

commented Dec 12, 2017

I think. I found the reason for blocking.

I've added a debug message to PcapRxStats::run() to show me, when (and with which status) pcap_next_ex returns. With the libpcap v1.8.1-3 from debian stretch it returns only, when at least one packet is available. It never returns with status 0 (timeout).

On a normal pcap capture this is also an issue (see #215), but normally some packets are always floating around. So it just takes a while until one is received and the capture can be stopped. But with the capture filter in PcapRxStats::run() it returns only, when a statistics packet is received, what's not the case in my test setup. Therefore pcap_next_ex will never return.

Here my change to print the pcap_next_ex debug:

diff --git a/server/pcaprxstats.cpp b/server/pcaprxstats.cpp
index 139ed65..a37ff7d 100644
--- a/server/pcaprxstats.cpp
+++ b/server/pcaprxstats.cpp
@@ -111,6 +111,8 @@ _skip_filter:
         const uchar *data;
 
         ret = pcap_next_ex(handle_, &hdr, &data);
+        qDebug("RxStats: pcap_next_ex(%s) returns %d, stop_ = %d\n",
+                qPrintable(device_), ret, stop_);
         switch (ret) {
             case 1: {
                 uint guid;
@ghost

This comment has been minimized.

Copy link
Author

commented Dec 12, 2017

I had the idea, that the stop function could issue some pcap function to unblock the pcap receive thread. But https://www.tcpdump.org/manpages/pcap_breakloop.3pcap.txt states:

Note also that, in a multi-threaded application, if one thread is blocked in pcap_dispatch(), pcap_loop(), pcap_next(), or pcap_next_ex(), a call to pcap_breakloop() in a different thread will
not unblock that thread; you will need to use whatever mechanism the OS provides for breaking a thread out of blocking calls in order to unblock the thread, such as thread cancellation or thread signaling in systems that support POSIX threads.

@ghost

This comment has been minimized.

Copy link
Author

commented Dec 13, 2017

Just tested with libpcap v1.6.2, that works fine, no blocking. For me this is a good workaround.

Update: libpcap v1.7.4 works as well, so only the newest v1.8 seems problematic.

@pstavirs

This comment has been minimized.

Copy link
Owner

commented Dec 13, 2017

@ehlers Thank you so much for the detailed debugging and investigation. Yes, the workaround of using an older version of libpcap should suffice for now. But I do need to find a way to fix this for the future - the rub is to find a solution that works across Linux, Mac and Windows as I said in #215. Now that 0.9 is out, I'll put on my thinking cap to find a way. Thanks once again!

@pstavirs pstavirs added the bug label Jan 13, 2018

pstavirs added a commit that referenced this issue Aug 10, 2019
Fix infinite loop when stopping capture etc.
On some platforms and/or some libpcap verisons, libpcap doesn't support a
timeout which makes interactive stop not possible. So we now use a UNIX
signal to break out. Obviously this works only on *nix platforms - which
includes MacOS. For now the problem is not seen on Windows with WinPCAP,
so we should be fine. May need to revisit when we add Npcap support.

Fixes #215, #234
@pstavirs

This comment has been minimized.

Copy link
Owner

commented Aug 10, 2019

Fixed by 64d1525

@pstavirs pstavirs closed this Aug 10, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.