Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After submitting 1k+ jobs with -V argument to qsub , it is failing with error "'pbs_iff: cannot connect to host', 'pbs_iff: all reserved ports in use" #2576

Open
Mahalty opened this issue Mar 20, 2023 · 7 comments

Comments

@Mahalty
Copy link
Contributor

Mahalty commented Mar 20, 2023

Description:
After submitting 1k+ jobs with -V argument to qsub , it is failing with error "'pbs_iff: cannot connect to host', 'pbs_iff: all reserved ports in use"

when I ran netstat -pant, I see most of the ports were occupied in TIME_WAIT state

[root@openpbs opc]# netstat -pant
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 5628/sshd
tcp 0 0 10.0.0.221:22 61.177.173.19:48352 SYN_RECV -
tcp 0 0 10.0.0.221:22 61.177.173.19:32683 SYN_RECV -
tcp 0 0 10.0.0.221:22 61.177.173.19:11002 SYN_RECV -
tcp 0 0 0.0.0.0:15001 0.0.0.0:* LISTEN 141002/pbs_server.b
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 5289/master
tcp 0 0 0.0.0.0:15002 0.0.0.0:* LISTEN 128481/pbs_mom
tcp 0 0 0.0.0.0:15003 0.0.0.0:* LISTEN 128481/pbs_mom
tcp 0 0 0.0.0.0:15007 0.0.0.0:* LISTEN 140912/postgres
tcp 0 0 0.0.0.0:17001 0.0.0.0:* LISTEN 128471/pbs_comm
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 4497/rpcbind
tcp 0 0 127.0.0.1:25 127.0.0.1:46808 TIME_WAIT -
tcp 0 0 127.0.0.1:25 127.0.0.1:47132 TIME_WAIT -
tcp 0 0 10.0.0.221:60216 10.0.0.221:15001 TIME_WAIT -
tcp 0 0 10.0.0.221:609 10.0.0.221:15001 TIME_WAIT -
tcp 0 0 10.0.0.221:821 10.0.0.221:15001 TIME_WAIT -
tcp 0 0 127.0.0.1:25 127.0.0.1:46562 TIME_WAIT -
tcp 0 0 127.0.0.1:25 127.0.0.1:46632 TIME_WAIT -
tcp 0 0 10.0.0.221:813 10.0.0.221:15001 TIME_WAIT -
tcp 0 0 127.0.0.1:25 127.0.0.1:46758 TIME_WAIT -
tcp 0 0 10.0.0.221:59586 10.0.0.221:15001 TIME_WAIT -
tcp 0 0 10.0.0.221:59580 10.0.0.221:15001 TIME_WAIT -
tcp 0 0 127.0.0.1:25 127.0.0.1:45830 TIME_WAIT -
tcp 0 0 10.0.0.221:60006 10.0.0.221:15001 TIME_WAIT -
tcp 0 0 127.0.0.1:25 127.0.0.1:45464 TIME_WAIT -

[root@openpbs opc]# netstat -pant | grep TIME_WAIT | wc -l
1024

@hpcpanther
Copy link

Do you have any news on that? I think I am experiencing the same on my new cluster. I wanted to move from PBS Pro 19.2.4 to Open PBS 22.05.11.
On the OpenPBS cluster at the beginning everything was fine, but when mor nodes were added and more users started to use the cluster, I saw the same errors as you do. After stumbling over your issue I ran this netstat command to count all the processes in TIME_WAIT state which are related to the PBS server process (port 15001)
netstat -pant | grep 15001 | grep TIME_WAIT | wc -l
What I found is that the highest number I can get there is 1024. When that number is reached pbs_iff throws that error. The number of processes in TIME_WAIT goes up and down, but never over 1024. When I look at my old PBS Pro head node I sess numbers bigger than 10000.
I checked ulimit ans sysctl but I can't find any difference or limitations in ports or max open files. The maximum number of useable ports is far from being reached. It looks to me like it is a limitation of PBS, but I could not find anything in the manuals. I really don't have any idea anymore. I posted that problem also in the OpenPBS community but there I got no answer.

@subhasisb
Copy link
Collaborator

PBS uses reserved ports by default for authentiaction. Those are ports <=1024, hence you cannot extend that.
If you want to run more users from the same submit host, then please switch to munge authentication. This is documented in the pbspro admin guide

@hpcpanther
Copy link

hpcpanther commented Jun 27, 2023

Ok, that would make sense, but since when does PBS behave like that? On my old cluster (PBS Pro 19.2.4) I have no problems like that and I am not using munge. Also why would anybody use reserved ports if it renders your PBS almost unusable with a cluster of four nodes and 30 users? That is something I don't understand.

I just saw a difference between my two headnodes. On the PBS Pro 19.2.4 head node the port of the local address is always 15001 and the port of the foreign address is never 15001.

tcp        0      0 0.0.0.0:15001           0.0.0.0:*               LISTEN      12873/pbs_server.bi
tcp        0      0 10.77.0.41:15001        10.77.0.41:40846        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:38656        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:51294        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:47340        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:50152        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:36218        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:35936        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:52654        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:49358        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:759          TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:48722        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:47120        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:41110        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:48124        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:42946        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:41820        TIME_WAIT   -
tcp        0      0 10.77.0.41:15001        10.77.0.41:42468        TIME_WAIT   -

And on the OpenPBS 20.05.11 head node the port of the local address is either 15001 or a port lower than 1024. And whe it is lower than 1024 the port of the foreign address is 15001

tcp        0      0 0.0.0.0:15001           0.0.0.0:*               LISTEN      3131341/pbs_server.
tcp        0      0 10.77.0.29:15001        10.77.0.29:36942        TIME_WAIT   -
tcp        0      0 10.77.0.29:728          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:836          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:15001        10.77.0.29:37146        TIME_WAIT   -
tcp        0      0 10.77.0.29:15001        10.77.0.29:37164        TIME_WAIT   -
tcp        0      0 10.77.0.29:855          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:642          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:710          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:575          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:15001        10.77.0.29:37186        TIME_WAIT   -
tcp        0      0 10.77.0.29:771          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:615          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:798          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:777          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:15001        10.77.0.29:36862        TIME_WAIT   -
tcp        0      0 10.77.0.29:659          10.77.0.29:15001        TIME_WAIT   -
tcp        0      0 10.77.0.29:591          10.77.0.29:15001        TIME_WAIT   -

@subhasisb
Copy link
Collaborator

subhasisb commented Jun 28, 2023

but since when does PBS behave like that?

From forever (20+ years back) :)

Also why would anybody use reserved ports if it renders your PBS almost unusable with a cluster of four nodes and 30 users? That is something I don't understand.

Reserved ports are the most basic way of authentication. ports < 1024 can only be bound to by privileged processes and hence this is one way to confirm that the connecting party (qsub/qstat) is actually being being by the person as being claimed. PBS uses a setuid program called pbs_iff to work on behalf of the non-root user to bind to a port <1024 and connects to pbs_server to tell it that the user who is about to connect via a normal port is a "legitimate" user of the source machine.

You will see the same problem on any version of PBS - how quickly the ports get exhausted depends on other system services running on that host + TCP TIME_WAIT settings. (some distros do have different defaults)

@hpcpanther
Copy link

Ok. So than CentOS 7 behaves differently from Rocky Linux 8. As written above I can't see these reserved ports used on CentOS 7 and never had that problem. And on Rocky Linux it happens with even the smallest setup. But I will give Munge a try and see if it changes something.

@Nikita-T86
Copy link

I have the same problem in OpenPBS 22.05.11 on RHEL 8 - when number of TIME_WAIT ports is more than 1024, pbs_iff throws error "'pbs_iff: cannot connect to host', 'pbs_iff: all reserved ports in use". Are there any other solutions except switching to Munge?

@sourceonly
Copy link

Maybe try some net.ipv4.xxxx series , such as net.ipv4.tcp_tw_reuse, net.ipv4.tcp_tw_recycle, net.ipv4.tcp_fin_timeout, etc. Try to figure out a num fits your need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants