Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc_mclient tests fail occasionally #646

Closed
kmaehashi opened this issue Feb 4, 2014 · 8 comments
Closed

rpc_mclient tests fail occasionally #646

kmaehashi opened this issue Feb 4, 2014 · 8 comments
Assignees

Comments

@kmaehashi
Copy link
Member

In CI environment, about 1% of test failures are constantly detected while testing ./configure option variations. It almost always fails in rpc_mclient unit test:

http://ci.jubat.us/job/develop_configure/723/testReport/%28root%29/configure/__enable_debug___enable_zookeeper___enable_re2___enable_mecab/#footer

Graph
(Red bar indicates number of test failures)

@suma
Copy link
Member

suma commented Feb 5, 2014

Maybe we should move rpc_client_test to jubatest (such as client_test) from waf unittest.

@kmaehashi kmaehashi self-assigned this Feb 10, 2014
@kmaehashi
Copy link
Member Author

I'm not sure if this is a bug or not; I'll heat-run this test in my local env and see what happens.

@kmaehashi kmaehashi modified the milestones: Near Future, 0.5.3 Feb 24, 2014
@kmaehashi
Copy link
Member Author

I tested this on my local environment:

build/jubatus/server/common/mprpc/rpc_client_test --gtest_filter=rpc_mclient.small --gtest_repeat=-1

and got this:

Repeating all tests (iteration 22) . . .

Note: Google Test filter = rpc_mclient.small
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from rpc_mclient
[ RUN      ] rpc_mclient.small
thread terminated with throwing an instance of 'mp::pthread_error'
  what():  failed to lock pthread mutex: Unknown error -22
terminate called after throwing an instance of 'mp::pthread_error'
  what():  failed to lock pthread mutex: Unknown error -22
zsh: abort (core dumped)  build/jubatus/server/common/mprpc/rpc_client_test  --gtest_repeat=-1

Still investigating on it.

@kmaehashi
Copy link
Member Author

Other variations I saw:

terminate called after throwing an instance of 'msgpack::rpc::system_error'
  what():  Connection reset by peer
terminate called after throwing an instance of 'std::runtime_error'
  what():  failed to resolve host name

(using "127.0.0.1" instead of "localhost" in test codes seems to work for the latter case)

@kmaehashi
Copy link
Member Author

I found that 3 file desciptor leak (2 epoll and 1 eventfd) when close() method of jubatus::server::common::mprpc::rpc_server is called while client is connected to the server. I also confirmed that destructor of mpio kernel does not run in this case. Seems like a bug in mpio, but more investigation is needed.

As for Jubatus usecase, this issue is NOT fatal, as we only call close() method when shutting down the process.

However, I'm still not clear why this happens in unit test programs; unit test processes are reinvoked every time, so fd leak should not happen.
To see what actually happening in CI server, I proposed to record stderr in waf tests (#726).

@kmaehashi
Copy link
Member Author

Discussion from the meeting on 2014-03-25:

  • It seems that the fd leak problem is not related to the test failure.
    • Apply Print stderr in waf-unittest #726 and see the actual error message in CI environment.
    • CI environment is not configured as sysctl -w net.ipv4.tcp_tw_reuse=1. This may be a root cause.
  • Raise another issue on jubatus-mpio for the leak problem.

I've changed milestone of this issue to Pending (until it reproduces in CI environment).

@kmaehashi
Copy link
Member Author

It reproduced in the CI environment: http://ci.jubat.us/job/develop_configure/860/testReport/junit/%28root%29/configure/__enable_debug___disable_eigen___enable_ux/

The stderr was as follows:

terminate called after throwing an instance of 'mp::system_error'
  what():  bind failed: Address already in use

I think this indicates that not setting sysctl -w net.ipv4.tcp_tw_reuse=1 on CI server is the cause of these test failures.

I set sysctl -w net.ipv4.tcp_tw_reuse=1 on 2014-04-22, at 17:38. I'll close this issue once we confirm that these test failures disappear in CI environment (monitoring about 1 week may be enough).

@kmaehashi
Copy link
Member Author

I don't see this error log for 2 years in CI environment or locally.

@kmaehashi kmaehashi removed this from the Pending milestone Jun 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants