[leo_rpc]Performance degradation due to leaking tcp connections #141

Closed
mocchira opened this Issue Feb 20, 2014 · 12 comments

Comments

Projects
None yet
2 participants
Owner

mocchira commented Feb 20, 2014

Under very high load, there are some possibilities that a LeoFS throughput degrades due to leaking tcp connections corresponding with Erlang processes.
But since leo_rpc will be used under a relatively low load, this is not critical for now.

mocchira added this to the 1.0.0 milestone Feb 20, 2014

mocchira added the Bug label Feb 20, 2014

mocchira self-assigned this Feb 20, 2014

@yosukehara yosukehara modified the milestone: Future works, 1.0.0 Feb 26, 2014

@mocchira mocchira modified the milestone: 1.2.8, Future works Mar 27, 2015

Owner

mocchira commented Mar 27, 2015

Since it turned out that bulk-loading 10,000 objects with the mdc default settings can easily reproduce this issue, so we've looked into now.

Not using tcp connection pool might be most easiest solution.(in my local env, this works)

Owner

mocchira commented Mar 27, 2015

This feature branch have fixed this issue by disabling tcp connection pool.
https://github.com/leo-project/leo_rpc/tree/feature/disable_connection_pool

Owner

yosukehara commented Mar 30, 2015

I've fixed this issue leo-project/leo_rpc@92e28ee . When leo_rpc faced errors, it cannot reconnect a remote-node because it does not close an old connection. Then we need to check the mdc-replication mechanism through stress-tests and benchmarks more and more.

Owner

mocchira commented Mar 31, 2015

@yosukehara We still have tcp leaking issues on the latest develop branch.
The below error continue to generate on leo_storage(s).

 [W] storage_0@127.0.0.1 2015-03-31 01:46:45.543806 +0000    1427766405  leo_sync_remote_cluster:defer_stack/1        82  key:test/1662, cause:sending_data_to_remote
Owner

yosukehara commented Mar 31, 2015

@mocchira Thank you for sharing the info. I'll try to check this issue again.

Owner

yosukehara commented Mar 31, 2015

I've benchmarked leo_pc with large files, the result of which is as follows.
I could not face the issues then I'll check behavior of LeoFS Storage as the next step.

leo_rpc's benchmark result:

leo_rpc_bench_result_32mb

basho_bench configuration file:

{mode, max}.
{duration, 10}.
{concurrent, 64}.
{remote_node_ip,   "127.0.0.1"}.
{remote_node_port, 13076}.
{driver, basho_bench_driver_leo_rpc}.
{key_generator, {int_to_bin_bigendian, {uniform_int, 1}}}.
{value_generator, {fixed_bin, 33554432}}. %% 32MB
Owner

mocchira commented Mar 31, 2015

I found out the root causes of this issue.
There are two problems.

Process leak

Processes for rpc connections manged by leo_pod coudn't stop properly.
leo-project/leo_pod@391607c have fixed part of this issue.

Erlang's accept(connect) behaviour

If the number of active tcp connections exceed the rpc.num_of_acceptor,
leo_rpc_client_conns can succeed in connecting to a server but can't send/recv any data on this socket.
To solve this problem,
Quick fix is to set a sufficient large value to rpc.num_of_acceptor which depends on the number of storage nodes.

Owner

yosukehara commented Mar 31, 2015

In this solution, I've found errors and performance degradation as below. So I'll consider better/best solution of this issue.

17:40:55.749 [error] gen_server leo_rpc_client_manager terminated with reason: {{timeout,{gen_server,call,[<0.235.0>,stop]}},{gen_server,call,[<0.230.0>,raw_status]}} in gen_server:call/2 line 182
17:40:55.749 [error] CRASH REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: {{timeout,{gen_server,call,[<0.235.0>,stop]}},{gen_server,call,[<0.230.0>,raw_status]}} in gen_server:terminate/7 line 804
17:40:55.750 [error] Supervisor leo_rpc_client_sup had child leo_rpc_client_manager started with leo_rpc_client_manager:start_link(5000) at <0.99.0> exit with reason {{timeout,{gen_server,call,[<0.235.0>,stop]}},{gen_server,call,[<0.230.0>,raw_status]}} in context child_terminated
Owner

yosukehara commented Mar 31, 2015

I've provided an another solution, which disconnect a connection when reach rpc.client.max_requests_for_reconnection.

## RPC-Client's max number of requests for reconnection to a remote-node
## * 1..<integer>: specialized value
rpc.client.max_requests_for_reconnection = 64
Owner

yosukehara commented Mar 31, 2015

Also, I've taken a measure of the rpc-server - leo-project/leo_rpc@2decc07 ,which close unnecessary connections when reached timeout.

Owner

mocchira commented Apr 1, 2015

We finally have fixed this issue by leo-project/leo_rpc@3be4287.

Owner

yosukehara commented Apr 2, 2015

I've checked this issue again and then recognized this issues was fixed.

  • Server configuration:
{num_of_acceptors, 2}. %% default:128
{listen_port,    13076}.
{listen_timeout,  5000}.
{max_requests_for_reconnection, 64}.
  • Client(bench marker) configuration:
{mode, max}.
{duration, 10}.
{concurrent, 8}.
{remote_node_ip,   "127.0.0.1"}.
{remote_node_port, 13076}.
{driver, basho_bench_driver_leo_rpc}.
{key_generator, {int_to_bin_bigendian, {uniform_int, 1}}}.
{value_generator, {fixed_bin, 1024}}.
  • Result:
    leo_rpc_benchmark_1kb_20150402_1

yosukehara closed this Apr 2, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment