[Issue Tracker] PyTorch distributed RPC #96

XuehaiPan · 2022-10-08T06:38:48Z

This is an issue tracker for the upstream issues:

Initialize RPC with large world size:
Pass nn.Module and nn.Parameter as RPC argument:
- [Distributed: RPC] Sending nn.Parameter as RPC argument automatically detaches from the computation graph pytorch/pytorch#86525

The text was updated successfully, but these errors were encountered:

tornadoyi · 2022-10-10T09:18:18Z

@XuehaiPan
I found the key point of issue about init_rpc larger than N causing Resource temporarily unavailable.
Intuitively, The cause of this issue is due to the large number of connections initiated to rank 0 "simultaneously".
Based on the hypothesis above, I try to add one line simple code at distributed rpc code of PyTorch and rerun my test code, I got correct result without any error.
My code is insert into here After _init_rpc_states I sleep a while (sleep time equals rank). It indicates that all processors are going to connect leader one by one instead of connecting at same time.

But why does the isssu occured ? I check all my tcp configuration in kernel and limited configuration as below. It all looks right here.

net.ipv4.ip_local_port_range = 10000 65535
net.ipv4.tcp_max_syn_backlog = 16384

net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 120

-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) unlimited
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes unlimited
-n: file descriptors 1048576
-l: locked-in-memory size (kbytes) unlimited
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 2061498
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 0
-N 15: unlimited

Accutally, max number of simultaneous connections depends on two parts. One is net.ipv4.tcp_max_syn_backlog configuration. The value of my net.ipv4.tcp_max_syn_backlog = 16384 is enough to for simultaneous connections. The other is listen(fd, backlog) called in C++.
Therefore, I search listen calls in tensorpipe code and I found that all backlog = 128 at all listen calls as follow.
ibv listener
shm listener
uv listener

I dont know if my analysis is right, but after applying my patch as above for rpc code, my test code works every time.

XuehaiPan added the bug Something isn't working label Oct 8, 2022

XuehaiPan assigned Benjamin-eecs Oct 8, 2022

XuehaiPan added upstream Something upstream related pytorch Something PyTorch related labels Oct 8, 2022

Benjamin-eecs added this to the 0.6.0 milestone Oct 8, 2022

XuehaiPan removed this from the 0.6.0 milestone Oct 8, 2022

XuehaiPan assigned XuehaiPan and unassigned Benjamin-eecs Dec 5, 2022

XuehaiPan added the distributed Something related to distributed training label Feb 16, 2023

Benjamin-eecs changed the title ~~[Issue Tracker] PyTorch Distributed RPC~~ [Issue Tracker] PyTorch distributed RPC Jul 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue Tracker] PyTorch distributed RPC #96

[Issue Tracker] PyTorch distributed RPC #96

XuehaiPan commented Oct 8, 2022

tornadoyi commented Oct 10, 2022

[Issue Tracker] PyTorch distributed RPC #96

[Issue Tracker] PyTorch distributed RPC #96

Comments

XuehaiPan commented Oct 8, 2022

tornadoyi commented Oct 10, 2022