Skip to content


rpc_mclient tests fail occasionally #646

kmaehashi opened this Issue · 7 comments

2 participants

Jubatus member

In CI environment, about 1% of test failures are constantly detected while testing ./configure option variations. It almost always fails in rpc_mclient unit test:

(Red bar indicates number of test failures)

Jubatus member

Maybe we should move rpc_client_test to jubatest (such as client_test) from waf unittest.

@kmaehashi kmaehashi self-assigned this
Jubatus member

I'm not sure if this is a bug or not; I'll heat-run this test in my local env and see what happens.

@kmaehashi kmaehashi modified the milestone: Near Future, 0.5.3
Jubatus member

I tested this on my local environment:

build/jubatus/server/common/mprpc/rpc_client_test --gtest_filter=rpc_mclient.small --gtest_repeat=-1

and got this:

Repeating all tests (iteration 22) . . .

Note: Google Test filter = rpc_mclient.small
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from rpc_mclient
[ RUN      ] rpc_mclient.small
thread terminated with throwing an instance of 'mp::pthread_error'
  what():  failed to lock pthread mutex: Unknown error -22
terminate called after throwing an instance of 'mp::pthread_error'
  what():  failed to lock pthread mutex: Unknown error -22
zsh: abort (core dumped)  build/jubatus/server/common/mprpc/rpc_client_test  --gtest_repeat=-1

Still investigating on it.

Jubatus member

Other variations I saw:

terminate called after throwing an instance of 'msgpack::rpc::system_error'
  what():  Connection reset by peer
terminate called after throwing an instance of 'std::runtime_error'
  what():  failed to resolve host name

(using "" instead of "localhost" in test codes seems to work for the latter case)

Jubatus member

I found that 3 file desciptor leak (2 epoll and 1 eventfd) when close() method of jubatus::server::common::mprpc::rpc_server is called while client is connected to the server. I also confirmed that destructor of mpio kernel does not run in this case. Seems like a bug in mpio, but more investigation is needed.

As for Jubatus usecase, this issue is NOT fatal, as we only call close() method when shutting down the process.

However, I'm still not clear why this happens in unit test programs; unit test processes are reinvoked every time, so fd leak should not happen.
To see what actually happening in CI server, I proposed to record stderr in waf tests (#726).

Jubatus member

Discussion from the meeting on 2014-03-25:

  • It seems that the fd leak problem is not related to the test failure.
    • Apply #726 and see the actual error message in CI environment.
    • CI environment is not configured as sysctl -w net.ipv4.tcp_tw_reuse=1. This may be a root cause.
  • Raise another issue on jubatus-mpio for the leak problem.

I've changed milestone of this issue to Pending (until it reproduces in CI environment).

@kmaehashi kmaehashi modified the milestone: Pending, 0.5.3
Jubatus member

It reproduced in the CI environment:

The stderr was as follows:

terminate called after throwing an instance of 'mp::system_error'
  what():  bind failed: Address already in use

I think this indicates that not setting sysctl -w net.ipv4.tcp_tw_reuse=1 on CI server is the cause of these test failures.

I set sysctl -w net.ipv4.tcp_tw_reuse=1 on 2014-04-22, at 17:38. I'll close this issue once we confirm that these test failures disappear in CI environment (monitoring about 1 week may be enough).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.