Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

[leo_rpc]Performance degradation due to leaking tcp connections #141

Closed
mocchira opened this Issue · 12 comments

2 participants

@mocchira
Owner

Under very high load, there are some possibilities that a LeoFS throughput degrades due to leaking tcp connections corresponding with Erlang processes.
But since leo_rpc will be used under a relatively low load, this is not critical for now.

@mocchira mocchira added this to the 1.0.0 milestone
@mocchira mocchira added the Bug label
@mocchira mocchira self-assigned this
@yosukehara yosukehara modified the milestone: Future works, 1.0.0
@mocchira mocchira modified the milestone: 1.2.8, Future works
@mocchira
Owner

Since it turned out that bulk-loading 10,000 objects with the mdc default settings can easily reproduce this issue, so we've looked into now.

Not using tcp connection pool might be most easiest solution.(in my local env, this works)

@mocchira
Owner

This feature branch have fixed this issue by disabling tcp connection pool.
https://github.com/leo-project/leo_rpc/tree/feature/disable_connection_pool

@yosukehara
Owner

I've fixed this issue leo-project/leo_rpc@92e28ee . When leo_rpc faced errors, it cannot reconnect a remote-node because it does not close an old connection. Then we need to check the mdc-replication mechanism through stress-tests and benchmarks more and more.

@mocchira
Owner

@yosukehara We still have tcp leaking issues on the latest develop branch.
The below error continue to generate on leo_storage(s).

 [W] storage_0@127.0.0.1 2015-03-31 01:46:45.543806 +0000    1427766405  leo_sync_remote_cluster:defer_stack/1        82  key:test/1662, cause:sending_data_to_remote
@yosukehara
Owner

@mocchira Thank you for sharing the info. I'll try to check this issue again.

@yosukehara
Owner

I've benchmarked leo_pc with large files, the result of which is as follows.
I could not face the issues then I'll check behavior of LeoFS Storage as the next step.

leo_rpc's benchmark result:

leo_rpc_bench_result_32mb

basho_bench configuration file:

{mode, max}.
{duration, 10}.
{concurrent, 64}.
{remote_node_ip,   "127.0.0.1"}.
{remote_node_port, 13076}.
{driver, basho_bench_driver_leo_rpc}.
{key_generator, {int_to_bin_bigendian, {uniform_int, 1}}}.
{value_generator, {fixed_bin, 33554432}}. %% 32MB
@mocchira
Owner

I found out the root causes of this issue.
There are two problems.

Process leak

Processes for rpc connections manged by leo_pod coudn't stop properly.
leo-project/leo_pod@391607c have fixed part of this issue.

Erlang's accept(connect) behaviour

If the number of active tcp connections exceed the rpc.num_of_acceptor,
leo_rpc_client_conns can succeed in connecting to a server but can't send/recv any data on this socket.
To solve this problem,
Quick fix is to set a sufficient large value to rpc.num_of_acceptor which depends on the number of storage nodes.

@yosukehara
Owner

In this solution, I've found errors and performance degradation as below. So I'll consider better/best solution of this issue.

17:40:55.749 [error] gen_server leo_rpc_client_manager terminated with reason: {{timeout,{gen_server,call,[<0.235.0>,stop]}},{gen_server,call,[<0.230.0>,raw_status]}} in gen_server:call/2 line 182
17:40:55.749 [error] CRASH REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: {{timeout,{gen_server,call,[<0.235.0>,stop]}},{gen_server,call,[<0.230.0>,raw_status]}} in gen_server:terminate/7 line 804
17:40:55.750 [error] Supervisor leo_rpc_client_sup had child leo_rpc_client_manager started with leo_rpc_client_manager:start_link(5000) at <0.99.0> exit with reason {{timeout,{gen_server,call,[<0.235.0>,stop]}},{gen_server,call,[<0.230.0>,raw_status]}} in context child_terminated
@yosukehara
Owner

I've provided an another solution, which disconnect a connection when reach rpc.client.max_requests_for_reconnection.

## RPC-Client's max number of requests for reconnection to a remote-node
## * 1..<integer>: specialized value
rpc.client.max_requests_for_reconnection = 64
@yosukehara
Owner

Also, I've taken a measure of the rpc-server - leo-project/leo_rpc@2decc07 ,which close unnecessary connections when reached timeout.

@mocchira
Owner

We finally have fixed this issue by leo-project/leo_rpc@3be4287.

@yosukehara
Owner

I've checked this issue again and then recognized this issues was fixed.

  • Server configuration:
{num_of_acceptors, 2}. %% default:128
{listen_port,    13076}.
{listen_timeout,  5000}.
{max_requests_for_reconnection, 64}.
  • Client(bench marker) configuration:
{mode, max}.
{duration, 10}.
{concurrent, 8}.
{remote_node_ip,   "127.0.0.1"}.
{remote_node_port, 13076}.
{driver, basho_bench_driver_leo_rpc}.
{key_generator, {int_to_bin_bigendian, {uniform_int, 1}}}.
{value_generator, {fixed_bin, 1024}}.
  • Result: leo_rpc_benchmark_1kb_20150402_1
@yosukehara yosukehara closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.