Skip to content
This repository has been archived by the owner on Apr 14, 2024. It is now read-only.

Drone blocks, when stream statistics is switched off in port configuration #234

Closed
ghost opened this issue Dec 11, 2017 · 13 comments
Closed
Labels

Comments

@ghost
Copy link

ghost commented Dec 11, 2017

Version 0.9 revision f32c50e on Debian 9.2 (Stretch)

Simple test:

  • enable stream statistics on port 0 (eth0)
  • disable stream statistics on port 0 (eth0)

Enabling works fine, but disabling port statistics leads to a blocked GUI (busy cursor remains active). The drone doesn't use CPU time in that situation, so it is not in an infinite loop.

The output during disabling of the stream statistics:

In PortConfigDialog
reservedBy_ = 0
modifyPort: portIndex = 0
client(CallMethod) sending 16 bytes <----
0001000200000008
method = 2:modifyPort
 req = PortConfigList
port {
  port_id {
    id: 0
  }
  is_tracking_stream_stats: false
}

---->
[127.0.0.1:46862] Server(on_clientSock_dataAvail): successfully received/parsed msg <----
[127.0.0.1:46862] method = 2
[127.0.0.1:46862] req = 
[127.0.0.1:46862] port {
[127.0.0.1:46862]   port_id {
[127.0.0.1:46862]     id: 0
[127.0.0.1:46862]   }
[127.0.0.1:46862]   is_tracking_stream_stats: false
[127.0.0.1:46862] }
[127.0.0.1:46862] ---->
[127.0.0.1:46862] In virtual void MyService::modifyPort(google::protobuf::RpcController*, const OstProto::PortConfigList*, OstProto::Ack*, google::protobuf::Closure*)
RpcChannel: queueing rpc since method 2 is pending;<----
 queued method = 13:getStats
queued message = 
port_id {
  id: 0
}
port_id {
  id: 1
}
port_id {
  id: 2
}
port_id {
  id: 3
}

---->
pendingCallList size = 1
@pstavirs
Copy link
Owner

@Ehlers Not able to repro this on Ubuntu or Windows. Busy cursor means Ostinato didn't get a response from Drone. Can you check what is happening on the Drone side? See if you can run gdb on Drone to check.

@ghost
Copy link
Author

ghost commented Dec 11, 2017

Quite unfortunate, that you can't reproduce it. As debugging this on my side will be very time-consuming for both of us, I suggest to close this issue. Maybe it's just my environment, that is unusual and I currently can't spend a lot of time to it.

@pstavirs
Copy link
Owner

No problem.

Though, if you could get me the drone side log that may help.

@ghost
Copy link
Author

ghost commented Dec 11, 2017

No problem, see attached file. I only started Ostinato, enabled stream statistics, then disabled it.
As the drone/GUI blocks I killed the drone (SIGINT), that's were the last 2 lines in the logs are coming from.
ostinato-drone.log.txt

@ghost
Copy link
Author

ghost commented Dec 11, 2017

Two observation:
Having traffic on the interface doesn't change anything, it still blocks.
When I disable the port (ifconfig eth0 down) while the drone is blocked, it immediately unblocks. The drone log then continues with:

virtual void PcapRxStats::run(): error reading packet (-1): The interface went down
user requested receiver stop

@ghost
Copy link
Author

ghost commented Dec 11, 2017

Well, I have compiled the drone with -config debug, started the drone in gdb, waited for the drone to lock (it happens also without optimization) and interrupted with ^C. Here just a backtrace, please tell me what you want be to debug.

[127.0.0.1:48140] Server(on_clientSock_dataAvail): successfully received/parsed msg <----
[127.0.0.1:48140] method = 2
[127.0.0.1:48140] req = 
[127.0.0.1:48140] port {
[127.0.0.1:48140]   port_id {
[127.0.0.1:48140]     id: 0
[127.0.0.1:48140]   }
[127.0.0.1:48140]   is_tracking_stream_stats: false
[127.0.0.1:48140] }
[127.0.0.1:48140] ---->
[127.0.0.1:48140] In virtual void MyService::modifyPort(google::protobuf::RpcController*, const OstProto::PortConfigList*, OstProto::Ack*, google::protobuf::Closure*)
^C
Thread 1 "drone" received signal SIGINT, Interrupt.
0x00007ffff5a5e6ad in poll () at ../sysdeps/unix/syscall-template.S:84
84	../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007ffff5a5e6ad in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007ffff52979f6 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#2  0x00007ffff5297b0c in g_main_context_iteration ()
   from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3  0x00007ffff6993854 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#4  0x00007ffff69617ef in QEventLoop::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#5  0x00007ffff6961b55 in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#6  0x00007ffff6967bd9 in QCoreApplication::exec() ()
   from /usr/lib/x86_64-linux-gnu/libQtCore.so.4
#7  0x00005555555837b7 in main (argc=1, argv=0x7fffffffecc8)
    at drone_main.cpp:97

@pstavirs
Copy link
Owner

That BT is most likely of the main thread which listens for incoming connections. The actual processing for a client/controller is done in a different thread. It is the latter that I suspect is getting stuck somewhere.

Now the problem is that it might be difficult to isolate the specific thread in question because there are multiple threads in Drone. What might be easier is to put a breakpoint on MyService::modifyPort(). The breakpoint will be hit first when you enable, you can just choose to continue. The second time it will hit for the disable case - do a single step execution from there.

The problem is somewhere in that function only (or some function called by that function)

Thanks a bunch for your help!

@ghost
Copy link
Author

ghost commented Dec 12, 2017

Here my two GDB logs, the drone hangs in PcapRxStats::stop()
trace1.txt
trace2.txt

As I wrote previously, there is some (light) traffic coming in into that interface.

@ghost
Copy link
Author

ghost commented Dec 12, 2017

I think. I found the reason for blocking.

I've added a debug message to PcapRxStats::run() to show me, when (and with which status) pcap_next_ex returns. With the libpcap v1.8.1-3 from debian stretch it returns only, when at least one packet is available. It never returns with status 0 (timeout).

On a normal pcap capture this is also an issue (see #215), but normally some packets are always floating around. So it just takes a while until one is received and the capture can be stopped. But with the capture filter in PcapRxStats::run() it returns only, when a statistics packet is received, what's not the case in my test setup. Therefore pcap_next_ex will never return.

Here my change to print the pcap_next_ex debug:

diff --git a/server/pcaprxstats.cpp b/server/pcaprxstats.cpp
index 139ed65..a37ff7d 100644
--- a/server/pcaprxstats.cpp
+++ b/server/pcaprxstats.cpp
@@ -111,6 +111,8 @@ _skip_filter:
         const uchar *data;
 
         ret = pcap_next_ex(handle_, &hdr, &data);
+        qDebug("RxStats: pcap_next_ex(%s) returns %d, stop_ = %d\n",
+                qPrintable(device_), ret, stop_);
         switch (ret) {
             case 1: {
                 uint guid;

@ghost
Copy link
Author

ghost commented Dec 12, 2017

I had the idea, that the stop function could issue some pcap function to unblock the pcap receive thread. But https://www.tcpdump.org/manpages/pcap_breakloop.3pcap.txt states:

Note also that, in a multi-threaded application, if one thread is blocked in pcap_dispatch(), pcap_loop(), pcap_next(), or pcap_next_ex(), a call to pcap_breakloop() in a different thread will
not unblock that thread; you will need to use whatever mechanism the OS provides for breaking a thread out of blocking calls in order to unblock the thread, such as thread cancellation or thread signaling in systems that support POSIX threads.

@ghost
Copy link
Author

ghost commented Dec 13, 2017

Just tested with libpcap v1.6.2, that works fine, no blocking. For me this is a good workaround.

Update: libpcap v1.7.4 works as well, so only the newest v1.8 seems problematic.

@pstavirs
Copy link
Owner

@Ehlers Thank you so much for the detailed debugging and investigation. Yes, the workaround of using an older version of libpcap should suffice for now. But I do need to find a way to fix this for the future - the rub is to find a solution that works across Linux, Mac and Windows as I said in #215. Now that 0.9 is out, I'll put on my thinking cap to find a way. Thanks once again!

@pstavirs pstavirs added the bug label Jan 13, 2018
pstavirs added a commit that referenced this issue Aug 10, 2019
On some platforms and/or some libpcap verisons, libpcap doesn't support a
timeout which makes interactive stop not possible. So we now use a UNIX
signal to break out. Obviously this works only on *nix platforms - which
includes MacOS. For now the problem is not seen on Windows with WinPCAP,
so we should be fine. May need to revisit when we add Npcap support.

Fixes #215, #234
@pstavirs
Copy link
Owner

Fixed by 64d1525

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

1 participant