Skip to content

Waiting to close ssh connection until socket is done sending data#6

Merged
KaushikMalapati merged 1 commit intopcdshub:masterfrom
KaushikMalapati:sshspam
Dec 1, 2025
Merged

Waiting to close ssh connection until socket is done sending data#6
KaushikMalapati merged 1 commit intopcdshub:masterfrom
KaushikMalapati:sshspam

Conversation

@KaushikMalapati
Copy link
Copy Markdown
Contributor

https://jira.slac.stanford.edu/browse/ECS-6845

If you open the switchtool gui or press refresh, you often get SSHExceptions like this

Exception (client): Error reading SSH protocol banner
Traceback (most recent call last):
  File "/cds/group/pcds/pyps/conda/py39/envs/pcds-5.4.2/lib/python3.9/site-packages/paramiko/transport.py", line 2271, in _check_banner
    buf = self.packetizer.readline(timeout)
  File "/cds/group/pcds/pyps/conda/py39/envs/pcds-5.4.2/lib/python3.9/site-packages/paramiko/packet.py", line 380, in readline
    buf += self._read_timeout(timeout)
  File "/cds/group/pcds/pyps/conda/py39/envs/pcds-5.4.2/lib/python3.9/site-packages/paramiko/packet.py", line 609, in _read_timeout
    raise EOFError()
EOFError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cds/group/pcds/pyps/conda/py39/envs/pcds-5.4.2/lib/python3.9/site-packages/paramiko/transport.py", line 2094, in run
    self._check_banner()
  File "/cds/group/pcds/pyps/conda/py39/envs/pcds-5.4.2/lib/python3.9/site-packages/paramiko/transport.py", line 2275, in _check_banner
    raise SSHException(
paramiko.ssh_exception.SSHException: Error reading SSH protocol banner

The exceptions happens when calling SSHClient.open

self.ssh.connect(host, self.port, self.user, self.pw,

and is caught, but I believe that it's still printed because paramiko raises an SSHException while handling the initial EOFError, which I don't know can be silenced without changing the library. Why do we get this error in the first place? If you don't send the exit command to a switch but still call ssh.close you no longer get the error. Presumably the switch continues to do some cleanup of the ssh session after and trying to create a new connection during that time fails. I saw this on all switch types that switchtool supports except cisco, althought the only cisco switch I know of is switch-fee-far which I can't ping. I found some discussion about ssh bugs fixed by newer firmware on some forums discussing ruckus models, but I'm not sure if it would help in this case and I also think it'd be unreasonable to try and update all of our switches purely for qol.

I could have just not called exit, but this felt unsatisfactory and also would have forced the removal of recv_exit_status (it would block forever since we are no longer exiting). I tried polling until the socket data was empty which worked, but I thought using the file descriptor looked a bit cleaner - the read call will block until EOF (socket is done reading) or we timeout (so we can't ever block indefinitely) or we get OSError (if the socket closes before/during the read). I ran this on every switch that pings and two switches a few dozen times and it seems to work. I'm not 100% sure since it might just be a lot more unlikely because of the additional time spent before attempting a new connection, but I'm also not sure how to debug this on a lower level since it seems like it's all happening switch-side (the original error happens when paramiko expects to read the ssh banner but get a zero byte message instead). Happy to discuss alternative ways of trying to prevent this issue or to find out what causes it in more detail.

@KaushikMalapati KaushikMalapati requested a review from a team as a code owner November 23, 2025 05:32
@tangkong
Copy link
Copy Markdown
Contributor

I haven't worked much with the switchtool/switches in general, but some preliminary poking around suggests that paramiko might be hamstringing us, being too low level. I wonder if eventually we consider moving to some abstraction library like netmiko.

I'll put away the showerthoughts and take a look at the actual problem/solution here

@KaushikMalapati
Copy link
Copy Markdown
Contributor Author

netmiko is a good idea, but it uses paramiko under the hood and throws the same SSHExceptions when doing looping connects

from netmiko import ConnectHandler
for i in range(10):
    with ConnectHandler(secret='secret', device_type='device_type', host='switch', username='username', password='password\') as net_connect:
        net_connect.enable()
        a = net_connect.send_command("show logging")
        print(i)
        net_connect.exit_enable_mode("exit")

althought maybe it has a easier way to block until complete cleanup?

@tangkong
Copy link
Copy Markdown
Contributor

My brief reading suggested that netmiko might wrap paramiko and handle some of the nuts and bolts of ssh connections for us.

I'm not going to suggest this PR becomes an investigation and migration to another ssh communication library. From what I can see, this seems to work well.

Copy link
Copy Markdown
Contributor

@tangkong tangkong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to fix the problem for me, as strange as the solution is (after we exit, we read from the socket again?)

Perhaps improving this requires a more concerted effort

@KaushikMalapati KaushikMalapati merged commit a1fbe8d into pcdshub:master Dec 1, 2025
@KaushikMalapati KaushikMalapati deleted the sshspam branch December 1, 2025 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants