Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick connections/reconnections lead to TSAN errors with gcc #77

Closed
bsergean opened this issue May 14, 2019 · 15 comments
Closed

Quick connections/reconnections lead to TSAN errors with gcc #77

bsergean opened this issue May 14, 2019 · 15 comments

Comments

@bsergean
Copy link
Collaborator

No description provided.

@AlexandreK38
Copy link
Contributor

This is the test we commented, it’s more like I said: you stop before connection is marked OPEN and the one after has enough time, then you have your tsan errors. Several stop before being open work, several stop after being connected work too; it seems the transition fwils

@bsergean
Copy link
Collaborator Author

You're right it's a good test to use to track this down.
There's _requestInitCancellation which we can use to interrupt the code that tries to reconnect.
We probably need more synchronization between the 2 threads, it's clearly a race condition between stop and connect.

WARNING: ThreadSanitizer: data race (pid=3630)
  Write of size 8 at 0x7ffc53c08db8 by thread T5:
    #0 void std::swap<ix::Socket*>(ix::Socket*&, ix::Socket*&) /usr/include/c++/5/bits/move.h:187 (ixwebsocket_unittest+0x0000005d9444)
    #1 std::__shared_ptr<ix::Socket, (__gnu_cxx::_Lock_policy)2>::swap(std::__shared_ptr<ix::Socket, (__gnu_cxx::_Lock_policy)2>&) /usr/include/c++/5/bits/shared_ptr_base.h:1076 (ixwebsocket_unittest+0x0000005d90ce)
    #2 std::__shared_ptr<ix::Socket, (__gnu_cxx::_Lock_policy)2>::operator=(std::__shared_ptr<ix::Socket, (__gnu_cxx::_Lock_policy)2>&&) /usr/include/c++/5/bits/shared_ptr_base.h:1000 (ixwebsocket_unittest+0x0000005d8ef6)
    #3 std::shared_ptr<ix::Socket>::operator=(std::shared_ptr<ix::Socket>&&) /usr/include/c++/5/bits/shared_ptr.h:294 (ixwebsocket_unittest+0x0000005d8b90)
    #4 ix::WebSocketTransport::connectToUrl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) /home/travis/build/machinezone/IXWebSocket/ixwebsocket/IXWebSocketTransport.cpp:154 (ixwebsocket_unittest+0x0000005ee93e)
    #5 ix::WebSocket::connect(int) /home/travis/build/machinezone/IXWebSocket/ixwebsocket/IXWebSocket.cpp:169 (ixwebsocket_unittest+0x0000005e36a1)
    #6 ix::WebSocket::checkConnection(bool) /home/travis/build/machinezone/IXWebSocket/ixwebsocket/IXWebSocket.cpp:251 (ixwebsocket_unittest+0x0000005e3f1d)
    #7 ix::WebSocket::run() /home/travis/build/machinezone/IXWebSocket/ixwebsocket/IXWebSocket.cpp:284 (ixwebsocket_unittest+0x0000005e45ad)
    #8 void std::_Mem_fn_base<void (ix::WebSocket::*)(), true>::operator()<, void>(ix::WebSocket*) const /usr/include/c++/5/functional:600 (ixwebsocket_unittest+0x0000005e9794)
    #9 void std::_Bind_simple<std::_Mem_fn<void (ix::WebSocket::*)()> (ix::WebSocket*)>::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/include/c++/5/functional:1531 (ixwebsocket_unittest+0x0000005e96ab)
    #10 std::_Bind_simple<std::_Mem_fn<void (ix::WebSocket::*)()> (ix::WebSocket*)>::operator()() /usr/include/c++/5/functional:1520 (ixwebsocket_unittest+0x0000005e94dc)
    #11 std::thread::_Impl<std::_Bind_simple<std::_Mem_fn<void (ix::WebSocket::*)()> (ix::WebSocket*)> >::_M_run() /usr/include/c++/5/thread:115 (ixwebsocket_unittest+0x0000005e9414)
    #12 <null> <null> (libstdc++.so.6+0x0000000b8c7f)
  Previous read of size 8 at 0x7ffc53c08db8 by main thread:
    #0 std::__shared_ptr<ix::Socket, (__gnu_cxx::_Lock_policy)2>::operator->() const /usr/include/c++/5/bits/shared_ptr_base.h:1055 (ixwebsocket_unittest+0x0000005aa4c0)
    #1 ix::WebSocketTransport::close(unsigned short, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, bool) /home/travis/build/machinezone/IXWebSocket/ixwebsocket/IXWebSocketTransport.cpp:1035 (ixwebsocket_unittest+0x0000005f320a)
    #2 ix::WebSocket::close() /home/travis/build/machinezone/IXWebSocket/ixwebsocket/IXWebSocket.cpp:217 (ixwebsocket_unittest+0x0000005e3d6a)
    #3 ix::WebSocket::stop() /home/travis/build/machinezone/IXWebSocket/ixwebsocket/IXWebSocket.cpp:147 (ixwebsocket_unittest+0x0000005e3502)
    #4 stop /home/travis/build/machinezone/IXWebSocket/test/cmd_websocket_chat.cpp:95 (ixwebsocket_unittest+0x0000005c4b42)
    #5 ____C_A_T_C_H____T_E_S_T____0 /home/travis/build/machinezone/IXWebSocket/test/cmd_websocket_chat.cpp:326 (ixwebsocket_unittest+0x0000005c6ab8)
    #6 Catch::TestInvokerAsFunction::invoke() const /home/travis/build/machinezone/IXWebSocket/test/Catch2/single_include/catch.hpp:11868 (ixwebsocket_unittest+0x000000426050)
    #7 Catch::TestCase::invoke() const /home/travis/build/machinezone/IXWebSocket/test/Catch2/single_include/catch.hpp:11769 (ixwebsocket_unittest+0x000000425462)
    #8 Catch::RunContext::invokeActiveTestCase() /home/travis/build/machinezone/IXWebSocket/test/Catch2/single_include/catch.hpp:10622 (ixwebsocket_unittest+0x00000041e755)
    #9 Catch::RunContext::runCurrentTest(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) /home/travis/build/machinezone/IXWebSocket/test/Catch2/single_include/catch.hpp:10595 (ixwebsocket_unittest+0x00000041e3ba)
    #10 Catch::RunContext::runTest(Catch::TestCase const&) /home/travis/build/machinezone/IXWebSocket/test/Catch2/single_include/catch.hpp:10365 (ixwebsocket_unittest+0x00000041c4da)
    #11 runTests /home/travis/build/machinezone/IXWebSocket/test/Catch2/single_include/catch.hpp:10924 (ixwebsocket_unittest+0x000000420138)
    #12 Catch::Session::runInternal() /home/travis/build/machinezone/IXWebSocket/test/Catch2/single_include/catch.hpp:11119 (ixwebsocket_unittest+0x000000421831)
    #13 Catch::Session::run() /home/travis/build/machinezone/IXWebSocket/test/Catch2/single_include/catch.hpp:11076 (ixwebsocket_unittest+0x0000004214cd)
    #14 int Catch::Session::run<char>(int, char const* const*) <null> (ixwebsocket_unittest+0x00000048ca3f)
    #15 main /home/travis/build/machinezone/IXWebSocket/test/test_runner.cpp:25 (ixwebsocket_unittest+0x00000043c694)
  As if synchronized via sleep:
    #0 nanosleep <null> (libtsan.so.0+0x000000043c7a)
    #1 void std::this_thread::sleep_for<double, std::ratio<1l, 1000l> >(std::chrono::duration<double, std::ratio<1l, 1000l> > const&) /usr/include/c++/5/thread:292 (ixwebsocket_unittest+0x000000562956)
    #2 ix::WebSocket::checkConnection(bool) /home/travis/build/machinezone/IXWebSocket/ixwebsocket/IXWebSocket.cpp:247 (ixwebsocket_unittest+0x0000005e3eed)
    #3 ix::WebSocket::run() /home/travis/build/machinezone/IXWebSocket/ixwebsocket/IXWebSocket.cpp:284 (ixwebsocket_unittest+0x0000005e45ad)
    #4 void std::_Mem_fn_base<void (ix::WebSocket::*)(), true>::operator()<, void>(ix::WebSocket*) const /usr/include/c++/5/functional:600 (ixwebsocket_unittest+0x0000005e9794)
    #5 void std::_Bind_simple<std::_Mem_fn<void (ix::WebSocket::*)()> (ix::WebSocket*)>::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/include/c++/5/functional:1531 (ixwebsocket_unittest+0x0000005e96ab)
    #6 std::_Bind_simple<std::_Mem_fn<void (ix::WebSocket::*)()> (ix::WebSocket*)>::operator()() /usr/include/c++/5/functional:1520 (ixwebsocket_unittest+0x0000005e94dc)
    #7 std::thread::_Impl<std::_Bind_simple<std::_Mem_fn<void (ix::WebSocket::*)()> (ix::WebSocket*)> >::_M_run() /usr/include/c++/5/thread:115 (ixwebsocket_unittest+0x0000005e9414)
    #8 <null> <null> (libstdc++.so.6+0x0000000b8c7f)

@AlexandreK38
Copy link
Contributor

AlexandreK38 commented May 14, 2019 via email

@AlexandreK38
Copy link
Contributor

I may have an idea of the problem and how to fix it, I will give it a try tomorrow

@bsergean
Copy link
Collaborator Author

If you can quickly describe your idea that would be great.

I was thinking about having 2 different socket objects, but that sounds like a bandaid and a bad idea.

@AlexandreK38
Copy link
Contributor

The idea is to prevent the checkConnection() or at least the connect() inside it to set the _socket var before the close() is done (as the close is called on the other thread).
My idea is to have a mutex in Websocket.h and lock it in .cpp, in close() method before the call to _ws.close(), and also in the checkConnection(); the place in the latter would probably be at the first line in the loop but to be check for the behavior
Anyway that’s the idea :)

@AlexandreK38
Copy link
Contributor

The place of the mutex in the checkConnection has to be placed carefully because we need to be sure that we sent data on the background thread when closing on the old socket

@AlexandreK38
Copy link
Contributor

And if this is working, it may also solve the same tsan errors we had in the other test of close code and reason, when server is closing; if not the fix will looks like this one but adapted for no thread

@bsergean
Copy link
Collaborator Author

bsergean commented May 15, 2019 via email

@AlexandreK38
Copy link
Contributor

Okay, great 👍

@AlexandreK38
Copy link
Contributor

I’ll be happy if all the errors preventing the PR to be merged and so on are finally gone

@bsergean
Copy link
Collaborator Author

So I added a mutex (recursive_mutex actually), I've enabled the test that was causing trouble, and now the tsan errors are gone. I've put the mutex in WebSocketTransport, so that they are kindof fine grained.

The sad news is that I'd like not to use a recursive mutex. Now the failure on travis seems to be on macOS. If you can try to make a PR and try different strategies.

I think that some of the complications relate to the fact that we are more correct now when processing the close code. But I'm not too sure about that. Maybe that problem has always been there.

@bsergean
Copy link
Collaborator Author

9315eb5

@bsergean
Copy link
Collaborator Author

I have a 100% reproducible hang on macOS now (in master ...). That should make the bug easy to repro.

@bsergean
Copy link
Collaborator Author

So the hang was happening because we are polling in the background thread, without a timeout, and the control thread which is closing is not exiting of its main loop before calling poll.

(control thread is waiting on _thread.join()).

Last commit fixes this and we're back to a reliable green build -> 671c9f8

Unfortunately the close and friends unittest are not working now ... I think they are relying on code which can potentially hang. I wonder how we should fix this, but now we have some reliable code that triggers those problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants