rpc_mclient tests fail occasionally #646

kmaehashi · 2014-02-04T02:58:50Z

In CI environment, about 1% of test failures are constantly detected while testing ./configure option variations. It almost always fails in rpc_mclient unit test:

http://ci.jubat.us/job/develop_configure/723/testReport/%28root%29/configure/__enable_debug___enable_zookeeper___enable_re2___enable_mecab/#footer

(Red bar indicates number of test failures)

suma · 2014-02-05T02:34:07Z

Maybe we should move rpc_client_test to jubatest (such as client_test) from waf unittest.

kmaehashi · 2014-02-10T05:32:36Z

I'm not sure if this is a bug or not; I'll heat-run this test in my local env and see what happens.

kmaehashi · 2014-03-19T01:14:35Z

I tested this on my local environment:

build/jubatus/server/common/mprpc/rpc_client_test --gtest_filter=rpc_mclient.small --gtest_repeat=-1

and got this:

Repeating all tests (iteration 22) . . .

Note: Google Test filter = rpc_mclient.small
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from rpc_mclient
[ RUN      ] rpc_mclient.small
thread terminated with throwing an instance of 'mp::pthread_error'
  what():  failed to lock pthread mutex: Unknown error -22
terminate called after throwing an instance of 'mp::pthread_error'
  what():  failed to lock pthread mutex: Unknown error -22
zsh: abort (core dumped)  build/jubatus/server/common/mprpc/rpc_client_test  --gtest_repeat=-1

Still investigating on it.

kmaehashi · 2014-03-19T02:16:33Z

Other variations I saw:

terminate called after throwing an instance of 'msgpack::rpc::system_error'
  what():  Connection reset by peer

terminate called after throwing an instance of 'std::runtime_error'
  what():  failed to resolve host name

(using "127.0.0.1" instead of "localhost" in test codes seems to work for the latter case)

kmaehashi · 2014-03-24T01:25:48Z

I found that 3 file desciptor leak (2 epoll and 1 eventfd) when close() method of jubatus::server::common::mprpc::rpc_server is called while client is connected to the server. I also confirmed that destructor of mpio kernel does not run in this case. Seems like a bug in mpio, but more investigation is needed.

As for Jubatus usecase, this issue is NOT fatal, as we only call close() method when shutting down the process.

However, I'm still not clear why this happens in unit test programs; unit test processes are reinvoked every time, so fd leak should not happen.
To see what actually happening in CI server, I proposed to record stderr in waf tests (#726).

kmaehashi · 2014-03-25T05:54:59Z

Discussion from the meeting on 2014-03-25:

It seems that the fd leak problem is not related to the test failure.
- Apply Print stderr in waf-unittest #726 and see the actual error message in CI environment.
- CI environment is not configured as sysctl -w net.ipv4.tcp_tw_reuse=1. This may be a root cause.
Raise another issue on jubatus-mpio for the leak problem.

I've changed milestone of this issue to Pending (until it reproduces in CI environment).

kmaehashi · 2014-04-22T08:41:32Z

It reproduced in the CI environment: http://ci.jubat.us/job/develop_configure/860/testReport/junit/%28root%29/configure/__enable_debug___disable_eigen___enable_ux/

The stderr was as follows:

terminate called after throwing an instance of 'mp::system_error'
  what():  bind failed: Address already in use

I think this indicates that not setting sysctl -w net.ipv4.tcp_tw_reuse=1 on CI server is the cause of these test failures.

I set sysctl -w net.ipv4.tcp_tw_reuse=1 on 2014-04-22, at 17:38. I'll close this issue once we confirm that these test failures disappear in CI environment (monitoring about 1 week may be enough).

kmaehashi · 2016-06-03T11:30:21Z

I don't see this error log for 2 years in CI environment or locally.

kmaehashi self-assigned this Feb 10, 2014

kmaehashi modified the milestones: Near Future, 0.5.3 Feb 24, 2014

kmaehashi modified the milestones: Pending, 0.5.3 Mar 25, 2014

kmaehashi mentioned this issue Mar 25, 2014

file descriptors leak when stopping server while client is still connected jubatus/jubatus-mpio#10

Closed

kmaehashi closed this as completed Jun 3, 2016

kmaehashi removed this from the Pending milestone Jun 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc_mclient tests fail occasionally #646

rpc_mclient tests fail occasionally #646

kmaehashi commented Feb 4, 2014

suma commented Feb 5, 2014

kmaehashi commented Feb 10, 2014

kmaehashi commented Mar 19, 2014

kmaehashi commented Mar 19, 2014

kmaehashi commented Mar 24, 2014

kmaehashi commented Mar 25, 2014

kmaehashi commented Apr 22, 2014

kmaehashi commented Jun 3, 2016

rpc_mclient tests fail occasionally #646

rpc_mclient tests fail occasionally #646

Comments

kmaehashi commented Feb 4, 2014

suma commented Feb 5, 2014

kmaehashi commented Feb 10, 2014

kmaehashi commented Mar 19, 2014

kmaehashi commented Mar 19, 2014

kmaehashi commented Mar 24, 2014

kmaehashi commented Mar 25, 2014

kmaehashi commented Apr 22, 2014

kmaehashi commented Jun 3, 2016