Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vagrant up/ssh fails to connect (on vmware) without ssh_info_public=true #10730

Closed
ladar opened this issue Mar 12, 2019 · 6 comments
Closed

vagrant up/ssh fails to connect (on vmware) without ssh_info_public=true #10730

ladar opened this issue Mar 12, 2019 · 6 comments

Comments

@ladar
Copy link
Contributor

ladar commented Mar 12, 2019

There is a problem connecting to boxes via SSH when the guest is using a sufficiently recent version of the OpenSSH server, and (I presume) a sufficiently old version of the OpenSSH client. In this case the client is the default version provided by CentOS, and the server is the current version provided by Alpine 3.9, Ubuntu 19.04, RHEL 8, et al. See below for a more complete list. This issue/bug only occurs with the VMWare provider, and then only when using the default NAT'ed SSH behaviour. If ssh_info_public = true is supplied by the Vagrant file, the issue disappears. I believe a newer SSH client will also avoid this problem.

I first noticed this problem with Gentoo circa August 6th, when the default version of OpenSSH for Gentoo was updated. I spent a considerable amount of time at that point trying to debug the issue. I am making the assumption the same issue I worked on then is what's causing problems with other guests, as I don't have time to do a full debug session right now. Just hoping my write up can point someone else at this issue.

Specifically what I found is that SSH clients would handshake with the server, but ultimately encounter a fatal SSH protocol error, both during the vagrant up process, and via the vagrant ssh command. Using vagrant ssh-config and connecting directly via the command line SSH client would yield the same result, but has the benefit of providing additional debug information when the -vvv option is supplied.

When I first found this issue, Gentoo was the only distro affected, and then only on VMWare. Given I also use an older version VMWare Workstation, I thought the impact might be limited and went with the workaround instead. Namely I forced the Gentoo Roboxes on VMWare to use OpenSSH 7.5 instead of 7.6 or 7.7. On my clients, I also started using the ssh_info_public = true option, and didn't retest till this week. But based on my current tests, it seems more distros are now impacted, so I'm opening this bug.

As for my debugging efforts, I don't have time to generate ssh client/server debug logs right this second, but as I recall the connection, would essentially fail with a protocol error. I'm assuming the issue I first found with Gentoo is affecting other guests since it only seems to impact the most recent distro releases, and thus are more likely to include a very recent version of the OpenSSH server.

I don't know if the bug is with OpenSSH, or vagrant but seeing as it hasn't gotten fixed yet, I wanted to open this ticket. We may need to open a bug report with OpenSSH and/or VMWare if depending on what others report about it impacting other VMWare Workstations/OpenSSH client versions.

Vagrant version

vagrant-2.2.4-1.x86_64
vagrant-vmware-utility-1.0.7-1.x86_64
vagrant-vmware-desktop 2.0.3

Host operating system

CentOS 7.6.1810
VMWare Workstation 12.5.9

Guest operating system

Arch
Alpine 3.9
Fedora 28/29
RHEL 8
OpenBSD 6
Ubuntu 19.04

Gentoo is probably also affected, but I have workaround in place for that box at the moment. Also note, the issue is being reported using the v1.9.6 Robox version of the above guests, which is also what created the attached logs. It's likely I'll start bundling future Robox releases with a workaround, so be sure to use v1.9.6 when testing.

Vagrant

vagrant init roboxes/ubuntu1904 && vagrant up --provider vmware_desktop

Debug output

roboxes-alpine39-vmware.txt
roboxes-arch-vmware.txt
roboxes-fedora28-vmware.txt
roboxes-fedora29-vmware.txt
roboxes-hardenedbsd12-vmware.txt
roboxes-openbsd6-vmware.txt
roboxes-rhel8-vmware.txt
roboxes-ubuntu1904-vmware.txt

Expected behavior

Box should boot, and vagrant should configure the guest keypair, and be ready to connect without issue.

Actual behavior

What actually happened?
Vagrant starts the handshake/connection process, only to hang, and eventually timeout.

References

#10499
https://github.com/lavabit/robox/blob/e5c365e08e7f80f06e64e8dd4437ffb659441ee2/scripts/gentoo/vmware.sh#L10-L21

@njtman
Copy link

njtman commented Mar 18, 2019

+1
I am seeing the same issue with my macOS 10.14.3 guest OS.
Adding v.ssh_info_public = true to the vmware_desktop provider block in my Vagrantfile fixed the hanging SSH issue.

@chrisroberts
Copy link
Member

@ladar I'm starting to dig into this. Thanks for all the information you've provided!

@chrisroberts
Copy link
Member

This does appear to be caused by a change in OpenSSH, yet the bug resides in VMware. The diff which introduced the problem is here:

https://cvsweb.openbsd.org/src/usr.bin/ssh/readconf.c#rev1.284

I found some discussions about it here:

https://bugzilla.redhat.com/show_bug.cgi?id=1624437

with VMware specific discussions here:

https://communities.vmware.com/message/2803219
https://communities.vmware.com/thread/590825
vmware/open-vm-tools#287

Based on the redhat bugzilla thread, it appears there is an internal issue filed within VMware to address this, so it should hopefully be fixed at some point. Many suggestions were to update the IPQoS settings from the client, but I found doing that resulted in no difference of behavior.

Instead I found that updating the /etc/ssh/sshd_config on the server and including this:

IPQoS lowdelay throughput

and restarting sshd got everything working correctly again. So that seems to be the work around for making things work until the proper upstream fix is in place.

@njtman
Copy link

njtman commented Mar 21, 2019

/etc/ssh/sshd_config on the server and including this:

IPQoS lowdelay throughput

I can confirm that this also fixed the problem for me.

AntonioMeireles added a commit to AntonioMeireles/ClearLinux-packer that referenced this issue Mar 24, 2019
we've been setting `vmware.ssh_info_public = true` as an workaround to
what turned to be hashicorp/vagrant#10730.

we're now getting around this issue following the upstream suggested
way - disabling `IPQoS` on guest's sshd side.

Signed-off-by: António Meireles <antonio.meireles@reformi.st>
@chrisroberts
Copy link
Member

Closing this as it was an upstream issue

@ghost
Copy link

ghost commented May 22, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators May 22, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants