After submitting 1k+ jobs with -V argument to qsub , it is failing with error "'pbs_iff: cannot connect to host', 'pbs_iff: all reserved ports in use" #2576

Mahalty · 2023-03-20T08:33:45Z

Description:
After submitting 1k+ jobs with -V argument to qsub , it is failing with error "'pbs_iff: cannot connect to host', 'pbs_iff: all reserved ports in use"

when I ran netstat -pant, I see most of the ports were occupied in TIME_WAIT state

[root@openpbs opc]# netstat -pant
Active Internet connections Proto Recv-Q Send-Q Local Address tcp 0 0 0.0.0.0:22 tcp 0 0 10.0.0.221:22 tcp 0 0 10.0.0.221:22 tcp 0 0 10.0.0.221:22 tcp 0 0 0.0.0.0:15001 tcp 0 0 127.0.0.1:25 tcp 0 0 0.0.0.0:15002 tcp 0 0 0.0.0.0:15003 tcp 0 0 0.0.0.0:15007 tcp 0 0 0.0.0.0:17001 tcp 0 0 0.0.0.0:111 tcp 0 0 127.0.0.1:25 tcp 0 0 127.0.0.1:25 tcp 0 0 10.0.0.221:60216 tcp 0 0 10.0.0.221:609 tcp 0 0 10.0.0.221:821 tcp 0 0 127.0.0.1:25 tcp 0 0 127.0.0.1:25 tcp 0 0 10.0.0.221:813 tcp 0 0 127.0.0.1:25 tcp 0 0 10.0.0.221:59586 tcp 0 0 10.0.0.221:59580 tcp 0 0 127.0.0.1:25 tcp 0 0 10.0.0.221:60006 tcp 0 0 127.0.0.1:25 (servers and established)
Foreign Address State PID/Program name
0.0.0.0:* LISTEN 5628/sshd
61.177.173.19:48352 SYN_RECV -
61.177.173.19:32683 SYN_RECV -
61.177.173.19:11002 SYN_RECV -
0.0.0.0:* LISTEN 141002/pbs_server.b
0.0.0.0:* LISTEN 5289/master
0.0.0.0:* LISTEN 128481/pbs_mom
0.0.0.0:* LISTEN 128481/pbs_mom
0.0.0.0:* LISTEN 140912/postgres
0.0.0.0:* LISTEN 128471/pbs_comm
0.0.0.0:* LISTEN 4497/rpcbind
127.0.0.1:46808 TIME_WAIT -
127.0.0.1:47132 TIME_WAIT -
10.0.0.221:15001 TIME_WAIT -
10.0.0.221:15001 TIME_WAIT -
10.0.0.221:15001 TIME_WAIT -
127.0.0.1:46562 TIME_WAIT -
127.0.0.1:46632 TIME_WAIT -
10.0.0.221:15001 TIME_WAIT -
127.0.0.1:46758 TIME_WAIT -
10.0.0.221:15001 TIME_WAIT -
10.0.0.221:15001 TIME_WAIT -
127.0.0.1:45830 TIME_WAIT -
10.0.0.221:15001 TIME_WAIT -
127.0.0.1:45464 TIME_WAIT -

[root@openpbs opc]# netstat -pant | grep TIME_WAIT | wc -l
1024

hpcpanther · 2023-06-26T22:09:24Z

Do you have any news on that? I think I am experiencing the same on my new cluster. I wanted to move from PBS Pro 19.2.4 to Open PBS 22.05.11.
On the OpenPBS cluster at the beginning everything was fine, but when mor nodes were added and more users started to use the cluster, I saw the same errors as you do. After stumbling over your issue I ran this netstat command to count all the processes in TIME_WAIT state which are related to the PBS server process (port 15001)
netstat -pant | grep 15001 | grep TIME_WAIT | wc -l
What I found is that the highest number I can get there is 1024. When that number is reached pbs_iff throws that error. The number of processes in TIME_WAIT goes up and down, but never over 1024. When I look at my old PBS Pro head node I sess numbers bigger than 10000.
I checked ulimit ans sysctl but I can't find any difference or limitations in ports or max open files. The maximum number of useable ports is far from being reached. It looks to me like it is a limitation of PBS, but I could not find anything in the manuals. I really don't have any idea anymore. I posted that problem also in the OpenPBS community but there I got no answer.

subhasisb · 2023-06-27T11:45:59Z

PBS uses reserved ports by default for authentiaction. Those are ports <=1024, hence you cannot extend that.
If you want to run more users from the same submit host, then please switch to munge authentication. This is documented in the pbspro admin guide

hpcpanther · 2023-06-27T12:44:52Z

Ok, that would make sense, but since when does PBS behave like that? On my old cluster (PBS Pro 19.2.4) I have no problems like that and I am not using munge. Also why would anybody use reserved ports if it renders your PBS almost unusable with a cluster of four nodes and 30 users? That is something I don't understand.

I just saw a difference between my two headnodes. On the PBS Pro 19.2.4 head node the port of the local address is always 15001 and the port of the foreign address is never 15001.

tcp        0      0 0.0.0.0:15001           0.0.0.0:*               LISTEN      12873/pbs_server.bi
tcp        0      0 10.77.0.41:15001        10.77.0.41:40846        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:38656        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:51294        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:47340        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:50152        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:36218        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:35936        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:52654        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:49358        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:759          TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:48722        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:47120        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:41110        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:48124        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:42946        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:41820        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:42468        TIME_WAIT   -

And on the OpenPBS 20.05.11 head node the port of the local address is either 15001 or a port lower than 1024. And whe it is lower than 1024 the port of the foreign address is 15001

tcp        0      0 0.0.0.0:15001           0.0.0.0:*               LISTEN      3131341/pbs_server.
tcp        0      0 10.77.0.29:15001        10.77.0.29:36942        TIME_WAIT   -
tcp        0      0 10.77.0.29:728          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:836          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:15001        10.77.0.29:37146        TIME_WAIT   -
tcp        0      0 10.77.0.29:15001        10.77.0.29:37164        TIME_WAIT   -
tcp        0      0 10.77.0.29:855          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:642          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:710          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:575          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:15001        10.77.0.29:37186        TIME_WAIT   -
tcp        0      0 10.77.0.29:771          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:615          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:798          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:777          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:15001        10.77.0.29:36862        TIME_WAIT   -
tcp        0      0 10.77.0.29:659          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:591          10.77.0.29:15001        TIME_WAIT   -

subhasisb · 2023-06-28T05:30:18Z

but since when does PBS behave like that?

From forever (20+ years back) :)

Also why would anybody use reserved ports if it renders your PBS almost unusable with a cluster of four nodes and 30 users? That is something I don't understand.

Reserved ports are the most basic way of authentication. ports < 1024 can only be bound to by privileged processes and hence this is one way to confirm that the connecting party (qsub/qstat) is actually being being by the person as being claimed. PBS uses a setuid program called pbs_iff to work on behalf of the non-root user to bind to a port <1024 and connects to pbs_server to tell it that the user who is about to connect via a normal port is a "legitimate" user of the source machine.

You will see the same problem on any version of PBS - how quickly the ports get exhausted depends on other system services running on that host + TCP TIME_WAIT settings. (some distros do have different defaults)

hpcpanther · 2023-06-28T09:36:46Z

Ok. So than CentOS 7 behaves differently from Rocky Linux 8. As written above I can't see these reserved ports used on CentOS 7 and never had that problem. And on Rocky Linux it happens with even the smallest setup. But I will give Munge a try and see if it changes something.

Nikita-T86 · 2024-04-22T11:19:19Z

I have the same problem in OpenPBS 22.05.11 on RHEL 8 - when number of TIME_WAIT ports is more than 1024, pbs_iff throws error "'pbs_iff: cannot connect to host', 'pbs_iff: all reserved ports in use". Are there any other solutions except switching to Munge?

sourceonly · 2024-04-25T05:19:10Z

Maybe try some net.ipv4.xxxx series , such as net.ipv4.tcp_tw_reuse, net.ipv4.tcp_tw_recycle, net.ipv4.tcp_fin_timeout, etc. Try to figure out a num fits your need.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After submitting 1k+ jobs with -V argument to qsub , it is failing with error "'pbs_iff: cannot connect to host', 'pbs_iff: all reserved ports in use" #2576

After submitting 1k+ jobs with -V argument to qsub , it is failing with error "'pbs_iff: cannot connect to host', 'pbs_iff: all reserved ports in use" #2576

Mahalty commented Mar 20, 2023

hpcpanther commented Jun 26, 2023

subhasisb commented Jun 27, 2023

hpcpanther commented Jun 27, 2023 •

edited

subhasisb commented Jun 28, 2023 •

edited

hpcpanther commented Jun 28, 2023

Nikita-T86 commented Apr 22, 2024

sourceonly commented Apr 25, 2024

After submitting 1k+ jobs with -V argument to qsub , it is failing with error "'pbs_iff: cannot connect to host', 'pbs_iff: all reserved ports in use" #2576

After submitting 1k+ jobs with -V argument to qsub , it is failing with error "'pbs_iff: cannot connect to host', 'pbs_iff: all reserved ports in use" #2576

Comments

Mahalty commented Mar 20, 2023

hpcpanther commented Jun 26, 2023

subhasisb commented Jun 27, 2023

hpcpanther commented Jun 27, 2023 • edited

subhasisb commented Jun 28, 2023 • edited

hpcpanther commented Jun 28, 2023

Nikita-T86 commented Apr 22, 2024

sourceonly commented Apr 25, 2024

hpcpanther commented Jun 27, 2023 •

edited

subhasisb commented Jun 28, 2023 •

edited