Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue Tracker] PyTorch distributed RPC #96

Open
XuehaiPan opened this issue Oct 8, 2022 · 1 comment
Open

[Issue Tracker] PyTorch distributed RPC #96

XuehaiPan opened this issue Oct 8, 2022 · 1 comment
Assignees
Labels
bug Something isn't working distributed Something related to distributed training pytorch Something PyTorch related upstream Something upstream related

Comments

@XuehaiPan XuehaiPan added the bug Something isn't working label Oct 8, 2022
@XuehaiPan XuehaiPan added upstream Something upstream related pytorch Something PyTorch related labels Oct 8, 2022
@Benjamin-eecs Benjamin-eecs added this to the 0.6.0 milestone Oct 8, 2022
@XuehaiPan XuehaiPan removed this from the 0.6.0 milestone Oct 8, 2022
@tornadoyi
Copy link

@XuehaiPan
I found the key point of issue about init_rpc larger than N causing Resource temporarily unavailable.
Intuitively, The cause of this issue is due to the large number of connections initiated to rank 0 "simultaneously".
Based on the hypothesis above, I try to add one line simple code at distributed rpc code of PyTorch and rerun my test code, I got correct result without any error.
My code is insert into here After _init_rpc_states I sleep a while (sleep time equals rank). It indicates that all processors are going to connect leader one by one instead of connecting at same time.
image

But why does the isssu occured ? I check all my tcp configuration in kernel and limited configuration as below. It all looks right here.

net.ipv4.ip_local_port_range = 10000 65535
net.ipv4.tcp_max_syn_backlog = 16384

net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 120

-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) unlimited
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes unlimited
-n: file descriptors 1048576
-l: locked-in-memory size (kbytes) unlimited
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 2061498
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 0
-N 15: unlimited

Accutally, max number of simultaneous connections depends on two parts. One is net.ipv4.tcp_max_syn_backlog configuration. The value of my net.ipv4.tcp_max_syn_backlog = 16384 is enough to for simultaneous connections. The other is listen(fd, backlog) called in C++.
Therefore, I search listen calls in tensorpipe code and I found that all backlog = 128 at all listen calls as follow.
ibv listener
shm listener
uv listener

I dont know if my analysis is right, but after applying my patch as above for rpc code, my test code works every time.

@XuehaiPan XuehaiPan assigned XuehaiPan and unassigned Benjamin-eecs Dec 5, 2022
@XuehaiPan XuehaiPan added the distributed Something related to distributed training label Feb 16, 2023
@Benjamin-eecs Benjamin-eecs changed the title [Issue Tracker] PyTorch Distributed RPC [Issue Tracker] PyTorch distributed RPC Jul 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Something related to distributed training pytorch Something PyTorch related upstream Something upstream related
Projects
Status: In Progress
Development

No branches or pull requests

3 participants