New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After submitting 1k+ jobs with -V argument to qsub , it is failing with error "'pbs_iff: cannot connect to host', 'pbs_iff: all reserved ports in use" #2576
Comments
Do you have any news on that? I think I am experiencing the same on my new cluster. I wanted to move from PBS Pro 19.2.4 to Open PBS 22.05.11. |
PBS uses reserved ports by default for authentiaction. Those are ports <=1024, hence you cannot extend that. |
Ok, that would make sense, but since when does PBS behave like that? On my old cluster (PBS Pro 19.2.4) I have no problems like that and I am not using munge. Also why would anybody use reserved ports if it renders your PBS almost unusable with a cluster of four nodes and 30 users? That is something I don't understand. I just saw a difference between my two headnodes. On the PBS Pro 19.2.4 head node the port of the local address is always 15001 and the port of the foreign address is never 15001.
And on the OpenPBS 20.05.11 head node the port of the local address is either 15001 or a port lower than 1024. And whe it is lower than 1024 the port of the foreign address is 15001
|
From forever (20+ years back) :)
Reserved ports are the most basic way of authentication. ports < 1024 can only be bound to by privileged processes and hence this is one way to confirm that the connecting party (qsub/qstat) is actually being being by the person as being claimed. PBS uses a setuid program called pbs_iff to work on behalf of the non-root user to bind to a port <1024 and connects to pbs_server to tell it that the user who is about to connect via a normal port is a "legitimate" user of the source machine. You will see the same problem on any version of PBS - how quickly the ports get exhausted depends on other system services running on that host + TCP TIME_WAIT settings. (some distros do have different defaults) |
Ok. So than CentOS 7 behaves differently from Rocky Linux 8. As written above I can't see these reserved ports used on CentOS 7 and never had that problem. And on Rocky Linux it happens with even the smallest setup. But I will give Munge a try and see if it changes something. |
I have the same problem in OpenPBS 22.05.11 on RHEL 8 - when number of TIME_WAIT ports is more than 1024, pbs_iff throws error "'pbs_iff: cannot connect to host', 'pbs_iff: all reserved ports in use". Are there any other solutions except switching to Munge? |
Maybe try some net.ipv4.xxxx series , such as net.ipv4.tcp_tw_reuse, net.ipv4.tcp_tw_recycle, net.ipv4.tcp_fin_timeout, etc. Try to figure out a num fits your need. |
Description:
After submitting 1k+ jobs with -V argument to qsub , it is failing with error "'pbs_iff: cannot connect to host', 'pbs_iff: all reserved ports in use"
when I ran netstat -pant, I see most of the ports were occupied in TIME_WAIT state
[root@openpbs opc]# netstat -pant
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 5628/sshd
tcp 0 0 10.0.0.221:22 61.177.173.19:48352 SYN_RECV -
tcp 0 0 10.0.0.221:22 61.177.173.19:32683 SYN_RECV -
tcp 0 0 10.0.0.221:22 61.177.173.19:11002 SYN_RECV -
tcp 0 0 0.0.0.0:15001 0.0.0.0:* LISTEN 141002/pbs_server.b
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 5289/master
tcp 0 0 0.0.0.0:15002 0.0.0.0:* LISTEN 128481/pbs_mom
tcp 0 0 0.0.0.0:15003 0.0.0.0:* LISTEN 128481/pbs_mom
tcp 0 0 0.0.0.0:15007 0.0.0.0:* LISTEN 140912/postgres
tcp 0 0 0.0.0.0:17001 0.0.0.0:* LISTEN 128471/pbs_comm
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 4497/rpcbind
tcp 0 0 127.0.0.1:25 127.0.0.1:46808 TIME_WAIT -
tcp 0 0 127.0.0.1:25 127.0.0.1:47132 TIME_WAIT -
tcp 0 0 10.0.0.221:60216 10.0.0.221:15001 TIME_WAIT -
tcp 0 0 10.0.0.221:609 10.0.0.221:15001 TIME_WAIT -
tcp 0 0 10.0.0.221:821 10.0.0.221:15001 TIME_WAIT -
tcp 0 0 127.0.0.1:25 127.0.0.1:46562 TIME_WAIT -
tcp 0 0 127.0.0.1:25 127.0.0.1:46632 TIME_WAIT -
tcp 0 0 10.0.0.221:813 10.0.0.221:15001 TIME_WAIT -
tcp 0 0 127.0.0.1:25 127.0.0.1:46758 TIME_WAIT -
tcp 0 0 10.0.0.221:59586 10.0.0.221:15001 TIME_WAIT -
tcp 0 0 10.0.0.221:59580 10.0.0.221:15001 TIME_WAIT -
tcp 0 0 127.0.0.1:25 127.0.0.1:45830 TIME_WAIT -
tcp 0 0 10.0.0.221:60006 10.0.0.221:15001 TIME_WAIT -
tcp 0 0 127.0.0.1:25 127.0.0.1:45464 TIME_WAIT -
[root@openpbs opc]# netstat -pant | grep TIME_WAIT | wc -l
1024
The text was updated successfully, but these errors were encountered: