rpc_mclient tests fail occasionally #646

Closed
kmaehashi opened this Issue Feb 4, 2014 · 8 comments

Comments

Projects
None yet
2 participants
@kmaehashi
Member

kmaehashi commented Feb 4, 2014

In CI environment, about 1% of test failures are constantly detected while testing ./configure option variations. It almost always fails in rpc_mclient unit test:

http://ci.jubat.us/job/develop_configure/723/testReport/%28root%29/configure/__enable_debug___enable_zookeeper___enable_re2___enable_mecab/#footer

Graph
(Red bar indicates number of test failures)

@suma

This comment has been minimized.

Show comment
Hide comment
@suma

suma Feb 5, 2014

Member

Maybe we should move rpc_client_test to jubatest (such as client_test) from waf unittest.

Member

suma commented Feb 5, 2014

Maybe we should move rpc_client_test to jubatest (such as client_test) from waf unittest.

@kmaehashi kmaehashi self-assigned this Feb 10, 2014

@kmaehashi

This comment has been minimized.

Show comment
Hide comment
@kmaehashi

kmaehashi Feb 10, 2014

Member

I'm not sure if this is a bug or not; I'll heat-run this test in my local env and see what happens.

Member

kmaehashi commented Feb 10, 2014

I'm not sure if this is a bug or not; I'll heat-run this test in my local env and see what happens.

@kmaehashi kmaehashi modified the milestones: Near Future, 0.5.3 Feb 24, 2014

@kmaehashi

This comment has been minimized.

Show comment
Hide comment
@kmaehashi

kmaehashi Mar 19, 2014

Member

I tested this on my local environment:

build/jubatus/server/common/mprpc/rpc_client_test --gtest_filter=rpc_mclient.small --gtest_repeat=-1

and got this:

Repeating all tests (iteration 22) . . .

Note: Google Test filter = rpc_mclient.small
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from rpc_mclient
[ RUN      ] rpc_mclient.small
thread terminated with throwing an instance of 'mp::pthread_error'
  what():  failed to lock pthread mutex: Unknown error -22
terminate called after throwing an instance of 'mp::pthread_error'
  what():  failed to lock pthread mutex: Unknown error -22
zsh: abort (core dumped)  build/jubatus/server/common/mprpc/rpc_client_test  --gtest_repeat=-1

Still investigating on it.

Member

kmaehashi commented Mar 19, 2014

I tested this on my local environment:

build/jubatus/server/common/mprpc/rpc_client_test --gtest_filter=rpc_mclient.small --gtest_repeat=-1

and got this:

Repeating all tests (iteration 22) . . .

Note: Google Test filter = rpc_mclient.small
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from rpc_mclient
[ RUN      ] rpc_mclient.small
thread terminated with throwing an instance of 'mp::pthread_error'
  what():  failed to lock pthread mutex: Unknown error -22
terminate called after throwing an instance of 'mp::pthread_error'
  what():  failed to lock pthread mutex: Unknown error -22
zsh: abort (core dumped)  build/jubatus/server/common/mprpc/rpc_client_test  --gtest_repeat=-1

Still investigating on it.

@kmaehashi

This comment has been minimized.

Show comment
Hide comment
@kmaehashi

kmaehashi Mar 19, 2014

Member

Other variations I saw:

terminate called after throwing an instance of 'msgpack::rpc::system_error'
  what():  Connection reset by peer
terminate called after throwing an instance of 'std::runtime_error'
  what():  failed to resolve host name

(using "127.0.0.1" instead of "localhost" in test codes seems to work for the latter case)

Member

kmaehashi commented Mar 19, 2014

Other variations I saw:

terminate called after throwing an instance of 'msgpack::rpc::system_error'
  what():  Connection reset by peer
terminate called after throwing an instance of 'std::runtime_error'
  what():  failed to resolve host name

(using "127.0.0.1" instead of "localhost" in test codes seems to work for the latter case)

@kmaehashi

This comment has been minimized.

Show comment
Hide comment
@kmaehashi

kmaehashi Mar 24, 2014

Member

I found that 3 file desciptor leak (2 epoll and 1 eventfd) when close() method of jubatus::server::common::mprpc::rpc_server is called while client is connected to the server. I also confirmed that destructor of mpio kernel does not run in this case. Seems like a bug in mpio, but more investigation is needed.

As for Jubatus usecase, this issue is NOT fatal, as we only call close() method when shutting down the process.

However, I'm still not clear why this happens in unit test programs; unit test processes are reinvoked every time, so fd leak should not happen.
To see what actually happening in CI server, I proposed to record stderr in waf tests (#726).

Member

kmaehashi commented Mar 24, 2014

I found that 3 file desciptor leak (2 epoll and 1 eventfd) when close() method of jubatus::server::common::mprpc::rpc_server is called while client is connected to the server. I also confirmed that destructor of mpio kernel does not run in this case. Seems like a bug in mpio, but more investigation is needed.

As for Jubatus usecase, this issue is NOT fatal, as we only call close() method when shutting down the process.

However, I'm still not clear why this happens in unit test programs; unit test processes are reinvoked every time, so fd leak should not happen.
To see what actually happening in CI server, I proposed to record stderr in waf tests (#726).

@kmaehashi

This comment has been minimized.

Show comment
Hide comment
@kmaehashi

kmaehashi Mar 25, 2014

Member

Discussion from the meeting on 2014-03-25:

  • It seems that the fd leak problem is not related to the test failure.
    • Apply #726 and see the actual error message in CI environment.
    • CI environment is not configured as sysctl -w net.ipv4.tcp_tw_reuse=1. This may be a root cause.
  • Raise another issue on jubatus-mpio for the leak problem.

I've changed milestone of this issue to Pending (until it reproduces in CI environment).

Member

kmaehashi commented Mar 25, 2014

Discussion from the meeting on 2014-03-25:

  • It seems that the fd leak problem is not related to the test failure.
    • Apply #726 and see the actual error message in CI environment.
    • CI environment is not configured as sysctl -w net.ipv4.tcp_tw_reuse=1. This may be a root cause.
  • Raise another issue on jubatus-mpio for the leak problem.

I've changed milestone of this issue to Pending (until it reproduces in CI environment).

@kmaehashi

This comment has been minimized.

Show comment
Hide comment
@kmaehashi

kmaehashi Apr 22, 2014

Member

It reproduced in the CI environment: http://ci.jubat.us/job/develop_configure/860/testReport/junit/%28root%29/configure/__enable_debug___disable_eigen___enable_ux/

The stderr was as follows:

terminate called after throwing an instance of 'mp::system_error'
  what():  bind failed: Address already in use

I think this indicates that not setting sysctl -w net.ipv4.tcp_tw_reuse=1 on CI server is the cause of these test failures.

I set sysctl -w net.ipv4.tcp_tw_reuse=1 on 2014-04-22, at 17:38. I'll close this issue once we confirm that these test failures disappear in CI environment (monitoring about 1 week may be enough).

Member

kmaehashi commented Apr 22, 2014

It reproduced in the CI environment: http://ci.jubat.us/job/develop_configure/860/testReport/junit/%28root%29/configure/__enable_debug___disable_eigen___enable_ux/

The stderr was as follows:

terminate called after throwing an instance of 'mp::system_error'
  what():  bind failed: Address already in use

I think this indicates that not setting sysctl -w net.ipv4.tcp_tw_reuse=1 on CI server is the cause of these test failures.

I set sysctl -w net.ipv4.tcp_tw_reuse=1 on 2014-04-22, at 17:38. I'll close this issue once we confirm that these test failures disappear in CI environment (monitoring about 1 week may be enough).

@kmaehashi

This comment has been minimized.

Show comment
Hide comment
@kmaehashi

kmaehashi Jun 3, 2016

Member

I don't see this error log for 2 years in CI environment or locally.

Member

kmaehashi commented Jun 3, 2016

I don't see this error log for 2 years in CI environment or locally.

@kmaehashi kmaehashi closed this Jun 3, 2016

@kmaehashi kmaehashi removed this from the Pending milestone Jun 3, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment