Waiting to close ssh connection until socket is done sending data#6
Conversation
|
I haven't worked much with the switchtool/switches in general, but some preliminary poking around suggests that paramiko might be hamstringing us, being too low level. I wonder if eventually we consider moving to some abstraction library like I'll put away the showerthoughts and take a look at the actual problem/solution here |
|
netmiko is a good idea, but it uses paramiko under the hood and throws the same SSHExceptions when doing looping connects althought maybe it has a easier way to block until complete cleanup? |
|
My brief reading suggested that I'm not going to suggest this PR becomes an investigation and migration to another ssh communication library. From what I can see, this seems to work well. |
tangkong
left a comment
There was a problem hiding this comment.
This seems to fix the problem for me, as strange as the solution is (after we exit, we read from the socket again?)
Perhaps improving this requires a more concerted effort
https://jira.slac.stanford.edu/browse/ECS-6845
If you open the switchtool gui or press refresh, you often get SSHExceptions like this
The exceptions happens when calling SSHClient.open
switchtool/psnet/survey/command.py
Line 220 in 160f22b
and is caught, but I believe that it's still printed because paramiko raises an SSHException while handling the initial EOFError, which I don't know can be silenced without changing the library. Why do we get this error in the first place? If you don't send the exit command to a switch but still call ssh.close you no longer get the error. Presumably the switch continues to do some cleanup of the ssh session after and trying to create a new connection during that time fails. I saw this on all switch types that switchtool supports except cisco, althought the only cisco switch I know of is switch-fee-far which I can't ping. I found some discussion about ssh bugs fixed by newer firmware on some forums discussing ruckus models, but I'm not sure if it would help in this case and I also think it'd be unreasonable to try and update all of our switches purely for qol.
I could have just not called exit, but this felt unsatisfactory and also would have forced the removal of recv_exit_status (it would block forever since we are no longer exiting). I tried polling until the socket data was empty which worked, but I thought using the file descriptor looked a bit cleaner - the read call will block until EOF (socket is done reading) or we timeout (so we can't ever block indefinitely) or we get OSError (if the socket closes before/during the read). I ran this on every switch that pings and two switches a few dozen times and it seems to work. I'm not 100% sure since it might just be a lot more unlikely because of the additional time spent before attempting a new connection, but I'm also not sure how to debug this on a lower level since it seems like it's all happening switch-side (the original error happens when paramiko expects to read the ssh banner but get a zero byte message instead). Happy to discuss alternative ways of trying to prevent this issue or to find out what causes it in more detail.