Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

acceptor-conns_sup pairs #198

Closed
wants to merge 2 commits into from
Closed

acceptor-conns_sup pairs #198

wants to merge 2 commits into from

Conversation

juhlig
Copy link
Contributor

@juhlig juhlig commented Jul 17, 2018

one conns_sup for each acceptor

@essen
Copy link
Member

essen commented Feb 25, 2019

Last night on IRC:

22:05:22          petrohi | essen: I was running test to see the difference between ranch 1.7 and juhlig2's PR. The workload is 100K "devices" each cycling through opening connection for 1 second and then closing it. 1.7 got  
                          | maxed out with ~200ms latency to open connection. juhlig2's maxed out at 130ms.                                                                                                                       
22:06:18            essen | sounds promising                                                                                                                                                                                      
22:06:22          petrohi | There is clear breaking point at about 60-70K devices when latency jumps from single digit millisecond to 100s in both cases.                                                                         
22:07:25          petrohi | Further, the Golang webserver maxes out at 110ms latency with 70-80K breakpoint.                                                                                                                      
22:07:59          petrohi | Which to me looks like I hit some bottleneck in the kernel.                                                                                                                                           
22:08:51            essen | keep the test around to try it with NIF sockets once that's implemented in OTP, it'll probably make Erlang closer to Go                                                                               
22:09:24          petrohi | When this is coming out?                                                                                                                                                                              
22:09:59            essen | OTP 22 or 23, so this or next year, probably next year                                                                                                                                                
22:10:19          petrohi | OK                                                                                                                                                                                                    
22:10:32            essen | still the proposed Ranch 2 changes look like a clear improvement so we can go with that                                                                                                               
22:10:42            essen | did you try establishing 100K connections and closing them all at once?                                                                                                                               
22:11:58          petrohi | Yes, tried that with 1M connections. In 1.7 case got supervisor message queue to 100K messages for few seconds.                                                                                       
22:12:02            essen | oh and considering the PR is fairly naive still there's probably some more improvements we can achieve, ideally there's only one process instead of an acceptor/sup pair                              
22:12:39          petrohi | With 32 acceptors on 32 CPU machine juhlig2's had no backup in message queues.                                                                                                                        
22:12:51            essen | impressive                                                                                                                                                                                            
22:14:00          petrohi | But, I was unable to reproduce case when I had backup for 10s of seconds. Even 1.7 was ready to accept within seconds.                                                                                
22:14:33            essen | probably takes longer if it happens on a real system with other busy processes around                                                                                                                 
22:15:01          petrohi | True.                                                                                                                                                                                                 
22:15:38          petrohi | Anyway, I think simple approach with acceptor pairing is big improvement.                                                                                                                             
22:16:42          petrohi | Any ideas on how to work around kernel bottleneck? Without it stressing the ranch would be much more intense.                                                                                         
22:16:49            essen | yeah, probably going to start from there, get the API right and then try to improve                                                                                                                   
22:17:33            essen | hm don't know specifically, just the usual sysctl flags that need tweaking                                                                                                                            
22:18:25          petrohi | Yes. Tried somaxconn, tcp_max_syn_backlog and netdev_max_backlog to no avail.                                                                                                                         
22:18:58          petrohi | Basically no difference from defaults. Just removed SYN flood warning in syslog.                                                                                                                      
22:34:56            essen | probably need some linux support :-) i'm off, thanks for the heads up!                                                                                                                                
-- Mon, 25 Feb 2019 --                                                                                                                                                                                                            
07:33:13          petrohi | After digging some more it turned out that breaking point in connection time from sub 1ms to 100s is caused by SYN buffer overflow. This causes SYNs to be dropped and clients transparently resending
                          | SYNs, which results in such a big hike in latency. SYN buffer overflows because accept is not picking up connections fast enough. I played with the number of acceptors from 1x number of CPUs to 16x 
                          | without much                                                                                                                                                                                          
07:33:15          petrohi | difference. So my conclusion at this point is that accept has natural contention that gets saturated at about 70K accepts per second. Juhlig's PR has a meaningful effect only between 65K and 70K    
                          | connections per second. With PR latency stays below 1ms. With mainline 1.7 it climbs to about 10ms. Above 70K  difference in latency is 130ms vs 200ms correspondingly. But at that point, the server 
                          | is dropping SYNs                                                                                                                                                                                      
07:33:17          petrohi | in millions.                                                                                                                                                                                          
09:30:55            essen | hm i see                                                                                                                                                                                              
09:45:23            essen | petrohi: by accept has natural contention do you mean Erlang's? i can't imagine the syscall having this problem and nobody fixing it since                                                            

So I think we will start work on master starting from this branch (after review of course) and then see what more improvements we can add. The important part is getting the API right as soon as possible, the internals can be changed afterwards, even after releasing 2.0, if for example we find a single process to work better than separate acceptor/sup.

/cc @petrohi

@juhlig
Copy link
Contributor Author

juhlig commented Feb 25, 2019

So basically you're saying to brush off the dust this PR acquired since I put it up (ie, resolve the merge conflicts), so it can be properly reviewed, yes? :)

@essen
Copy link
Member

essen commented Feb 25, 2019

I'm saying I will review it. There's no hurry though, I probably won't have time before next week anyway.

@juhlig
Copy link
Contributor Author

juhlig commented Feb 25, 2019

Yeah, same here. In theory, I have plenty of time since I'm on vacation this week, but family has already claimed it all, so there =^^=

@petermm
Copy link

petermm commented Mar 5, 2019

fyi https://stressgrid.com/blog/100k_cps_with_elixir/ - they use the PR and subsequently SO_REUSEPORT to push past the SYN bottleneck..

@essen
Copy link
Member

essen commented Mar 5, 2019

Yep mentioned it in #110.

@essen
Copy link
Member

essen commented Apr 25, 2019

I will begin work on getting this merged next week.

src/ranch.erl Outdated
[_, _, _, Protocol, _] = ranch_server:get_listener_start_args(Ref),
Protocol;
info(Ref, protocol_options) ->
get_protocol_options(Ref).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems completely unrelated and not necessary (I didn't see any change other than active/all_connections retrieval method having changed). Correct? So probably better to keep it all as it was and just replace the two relevant calls.

Copy link
Contributor Author

@juhlig juhlig Apr 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, now how did that end up in there... You're right, it is unrelated.

I had been playing with this quite a long time ago, then got distracted, then forgot. IIRC, my reason for this was so one could query a very specific info for a listener instead of all of it, which takes some extra time to collect only to be ignored,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to update the PR to master and remove this, I'll get to it on Wednesday myself.

{'DOWN', MonitorRef, process, SupPid, Reason} ->
error(Reason)
end.

-spec start_protocol(pid(), inet:socket()) -> ok.
start_protocol(SupPid, Socket) ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does not look necessary anymore, every calls are using the start_protocol/3 function.

src/ranch.erl Outdated
ListenerSup = ranch_server:get_listener_sup(Ref),
{_, ConnsSupSup, _, _} = lists:keyfind(ranch_conns_sup_sup, 1,
supervisor:which_children(ListenerSup)),
_ = [ConnsSup ! {remove_connection, Ref, self()} || {_, ConnsSup, _, _} <- supervisor:which_children(ConnsSupSup)],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many lines are too long, for example this one. Please try to limit to around 100 columns. (Basically if it's not fully visible on the Github UI it's probably too long.)

@essen
Copy link
Member

essen commented May 5, 2019

You forgot to add a copyright header to the new file, can you write in a comment the copyright line for you and i'll add it in my copy.

@juhlig
Copy link
Contributor Author

juhlig commented May 6, 2019

Ok, sure. Do you want your name in there or mine (both are ok with me), and 2011-2019 as the year or just 2019?

@juhlig
Copy link
Contributor Author

juhlig commented May 6, 2019

Jan Uhlig <ju@mailingwork.de>

@essen
Copy link
Member

essen commented May 6, 2019

Merged, thanks!

@essen essen closed this May 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants