Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vagrant times out when launching vmware boxes if the GUI isn't running. #9902

Open
ladar opened this issue Jun 6, 2018 · 11 comments
Open

Vagrant times out when launching vmware boxes if the GUI isn't running. #9902

ladar opened this issue Jun 6, 2018 · 11 comments
Milestone

Comments

@ladar
Copy link
Contributor

ladar commented Jun 6, 2018

VMWare vagrant boxes fail to boot properly when the VMWare workstation GUI isn't running. Note the boxes themselves are launched, and running perfectly, however vagrant doesn't detect the boot and thus doesn't move to the provisioning stage. I tried an endless number of vmx and provider options, but nothing seemed to resolve the issue consistently.

The issue seems to be focused around how vagrant is looking for guest IP information. It relies on vmrun getGuestIPAddress or by looking at the /etc/vmware/vmnet8/dhcp.leases file, the latter being empty. The vmrun method works properly when the GUI is running, but returns spurious results otherwise. A vmrun checkToolsState indicates the tools are installed, but a vmrun getGuestIPAddress says they are not running. If you look at the debug output, this could be an issue with the open-vm-tools not sending the necessary RPC calls, although I can't figure out why launching the GUI fixes the issue.

[ladar@factory vagrant]$ vmrun checkToolsState /home/ladar/Desktop/vagrant/debian9-vmware/.vagrant/machines/default/vmware_workstation/e68b5767-410c-4c74-8b59-c59fca2b8d7b/generic-debian9-vmware.vmx
installed
[ladar@factory vagrant]$ vmrun getGuestIPAddress /home/ladar/Desktop/vagrant/debian9-vmware/.vagrant/machines/default/vmware_workstation/e68b5767-410c-4c74-8b59-c59fca2b8d7b/generic-debian9-vmware.vmx
Error: The VMware Tools are not running in the virtual machine: /home/ladar/Desktop/vagrant/debian9-vmware/.vagrant/machines/default/vmware_workstation/e68b5767-410c-4c74-8b59-c59fca2b8d7b/generic-debian9-vmware.vmx
 INFO vmware_driver: Reading an accessible IP for machine...
 INFO vmware_driver: Trying vmrun getGuestIPAddress...
 INFO subprocess: Starting process: ["/usr/bin/vmrun", "getGuestIPAddress", "/home/ladar/Desktop/vagrant/debian9-vmware/.vagrant/machines/default/vmware_workstation/e68b5767-410c-4c74-8b59-c59fca2b8d7b/generic-debian9-vmware.vmx"]
 INFO subprocess: Command not in installer, restoring original environment...
DEBUG subprocess: Selecting on IO
DEBUG subprocess: stdout: Error: The VMware Tools are not running in the virtual machine: /home/ladar/Desktop/vagrant/debian9-vmware/.vagrant/machines/default/vmware_workstation/e68b5767-410c-4c74-8b59-c59fca2b8d7b/generic-debian9-vmware.vmx
DEBUG subprocess: Waiting for proc
DEBUG vmware_driver: Trying to get MAC address for ethernet0
DEBUG vmware_driver: No explicitly set MAC, looking or auto-generated one...
DEBUG vmware_driver:  -- MAC: 00:0c:29:21:02:f0
 INFO vmware_driver: Reading DHCP lease for '00:0c:29:21:02:f0' on 'vmnet8'
 INFO vmware_driver: DHCP leases file: /etc/vmware/vmnet8/dhcpd/dhcpd.leases
 INFO dhcp_lease_file: Initialized DHCP helper: /etc/vmware/vmnet8/dhcpd/dhcpd.leases
 INFO dhcp_lease_file: Looking for IP for MAC: 00:0c:29:21:02:f0
 INFO dhcp_lease_file:   - IP: 

I was unable to isolate why it works with the GUI running. My best guesses is the workflow depends on access to the vmblock device, which uses fuse, and ssh sessions don't get setup correctly with fuse permissions. Alternate theories include VM registration. While the boxes show up properly via vmrun list the vmrun listRegisteredVM command does not work with my version VMWare workstation.

/usr/lib/vmware/bin/vmware-vmblock-fuse -o subtype=vmware-vmblock,default_permissions,allow_other /var/run/vmblock-fuse

I also noticed strange behaviour in the /var/log/vmnet file. It seemed to get very active when the GUI was launched, which leads me to think the virtual network devices don't work properly without a GUI session. Strangely, the virtual network interfaces do in fact work without the GUI, but perhaps the vagrant network mappings aren't loaded properly until the GUI is launched.

Either way, I spent 12 hours trying to figure out why my VMWare roboxes weren't working ... with a large chunk of that time thinking the problem was my guest network configuration. I suspect a number of other open issues have this problem, whatever it may be, at their heart.

Vagrant version

Version: 2.1.1
VMWare Plugin: 5.0.4

Host operating system

CentOS 7.5.1804

Hypervisor

VMware Workstation 12.5.9-7535481

Guest operating system

Multiple guest operating systems are failing.

Vagrantfile

Vagrant.configure(2) do |config|
  config.vm.box = "generic/debian9"
  ["vmware_fusion", "vmware_workstation", "vmware_desktop"].each do |provider|
    config.vm.provider provider do |v, override|
      v.gui = false
      v.vmx["memsize"] = "1024"
      v.vmx["numvcpus"] = "1"
      v.vmx["cpuid.coresPerSocket"] = "1"
    end
  end
end

Debug output

The vmware.log and debug.txt file are attached. They're too long. The default console output follows. Essentially the machine boots, and configures itself properly, but vagrant stalls, " Waiting for machine to boot." If you were inclined, you may connect directly to the machine using the listed port, ie. ssh vagrant@localhost -p 2212 ... so that is working fine, and once inside the guest, you can access the internet without a problem.

+ mkdir debian9-vmware
+ vagrant plugin list
vagrant-vmware-workstation (5.0.4)
+ cp --force debian9.tpl debian9-vmware/Vagrantfile
+ cd debian9-vmware
+ '[' vmware == vmware ']'
+ vagrant box add --clean --force --provider vmware_desktop generic/debian9
==> box: Loading metadata for box 'generic/debian9'
    box: URL: https://vagrantcloud.com/generic/debian9
==> box: Adding box 'generic/debian9' (v1.6.12) for provider: vmware_desktop
    box: Downloading: https://vagrantcloud.com/generic/boxes/debian9/versions/1.6.12/providers/vmware_desktop.box
   ==> box: Successfully added box 'generic/debian9' (v1.6.12) for 'vmware_desktop'!
+ vagrant up --provider vmware_workstation
Bringing machine 'default' up with 'vmware_workstation' provider...
==> default: Cloning VMware VM: 'generic/debian9'. This can take some time...
==> default: Checking if box 'generic/debian9' is up to date...
==> default: Verifying vmnet devices are healthy...
==> default: Preparing network adapters...
==> default: Fixed port collision for 22 => 2222. Now on port 2212.
==> default: Starting the VMware VM...
==> default: Waiting for the VM to receive an address...
==> default: Forwarding ports...
    default: -- 22 => 2212
==> default: Waiting for machine to boot. This may take a few minutes...
Timed out while waiting for the machine to boot. This means that
Vagrant was unable to communicate with the guest machine within
the configured ("config.vm.boot_timeout" value) time period.

If you look above, you should be able to see the error(s) that
Vagrant had when attempting to connect to the machine. These errors
are usually good hints as to what may be wrong.

If you're using a custom box, make sure that networking is properly
working and you're able to connect to the machine. It is a common
problem that networking isn't setup properly in these boxes.
Verify that authentication configurations are also setup properly,
as well.

If the box appears to be booting properly, you may want to increase
the timeout ("config.vm.boot_timeout") value.

Expected behavior

The machines should boot, and configure themselves properly, like good little robot boxes. When the VMWare Workstation GUI is running on the console, this is precisely what happens. It doesn't appear to matter whether vagrant is run from the console, or via ssh in this scenario. All 24 of my generic vmware boxes boot properly.

Actual behavior

When the VMWare workstation GUI isn't running, and I attempt to run vagrant via ssh, all of the boxes I bothered to try failed. (I think I only tested vagrant from console with the GUI closed, a couple of times, and don't recall the result). The issue may be limited to open-vm-tools roboxes, as I focused builds which used various versions of the open-vm-tools, although I believe I saw the same result with the proprietary tools, I haven't gone back and tested since I realized the GUI was the determining factor. It's important to note that opening the GUI while a box is "hung" will instantly fix the problem, and running the same operation with the GUI already running works perfectly. The enable_vmrun_ip_lookup option had no effect (along with a large number of other things I tried).

Steps to reproduce

Run the commands above over SSH without the VMWare GUI.

References

I believe a number of other reports are related to this issue, as people probably didn't realize what the problem was.

P.S. Yes. I tried rebooting already.

P.P.S. Packer doesn't appear to be affected.

vmware.log
debug.txt

@rodenberg
Copy link

I can't get my vagrant image of windows 98 to run pls halp!

@briancain
Copy link
Member

@ladar - apologies for the slow response time but, do you receive this error if you upgrade to the latest version of the vmware plugin? thanks!

@ladar
Copy link
Contributor Author

ladar commented Jul 13, 2018

@briancain I think it was mentioned above, but buried. I'm testing with 25 different distros, and while I haven't double checked, I believe the boxes which fail are all using the open-vm-tools package instead of the proprietary VMWare plugin/package. Which means, I think, the boxes which succeed are using the proprietary tools. Since the open-vm-tools package is pulled from the distro repository, different versions are in play. If you want, I can kickoff a test run and give you a list of passing/failing boxes.

Long story short, I think it has to do with the way vagrant looks for the box IP address. Since the open-vm-tools communicate using a different rpc interface, the needed info probably isn't available given the way vagrant looks for it. What makes this weird is that having the gui running fixes the issue. If I had to guess it's because with the gui updates the state information some where, or in some way, that vagrant can access it. As I recall vagrant looks at the dhcp state table, and with the vmrun command. It's possible the gui needs to be running to update the dhcp address list, as that file is owned by root.

I would have chaulked this up to the open-vm-tools, except that if I manually run vagrant ssh via a different connection, vagrant does detect the ip and manages to connect, while the vagrant up command continues to wait, and eventually times out. Which makes me think vagrant ssh is looking for the IP address in some way that the vagrant up command doesn't.

This is all theory of course. After spending however many hours I did debugging this frustrating issue, only to realize that having the gui running fixed the issue, I moved on with out testing any of my theories. I figured I'd leave the last bit to the professionals.

@chrisroberts
Copy link
Member

@ladar Do you still run into this issue when using the latest version of the vagrant-vmware-utility and vagrant-vmware-desktop plugin? I had run into this situation as well and had made some modifications to prevent that behavior when the GUI was not open. If you're still running into this behavior I'll see about getting it reproduced locally and determine the source.

Cheers!

@ladar
Copy link
Contributor Author

ladar commented Oct 17, 2018

@chrisroberts I did the last time I ran a full test (~1 month ago). I will kickoff another test run. To confirm, what vagrant/vagrant-vmware-desktop versions should I be using?

@chrisroberts
Copy link
Member

@ladar latest versions of everything would be ideal

@chrisroberts
Copy link
Member

and thanks!

@ladar
Copy link
Contributor Author

ladar commented Oct 18, 2018

@chrisroberts switching from vagrant-vmware-workstation to vagrant-vmware-desktop seems to have fixed most of the issues, with two exceptions.

All of the generic boxes work correctly with v.ssh_info_public = true however the Arch, and Fedora 28 boxes, aka generic/arch and generic/fedora28 still fail without the public IP setting. This seems to be a NAT issue, not a problem determining the IP address. The generic/gentoo box would also fails, if I hadn't already implemented a workaround in the build configuration for that box.

I believe this problem is an issue with the way vagrant does the SSH port mapping, and the latest version of OpenSSH. I will post debug logs shortly.

@ladar
Copy link
Contributor Author

ladar commented Oct 18, 2018

@chrisroberts
Copy link
Member

@ladar Awesome, thanks so much for this! I'll dig through these debug logs today and see what I can find.

@ladar
Copy link
Contributor Author

ladar commented Oct 18, 2018

You might also try running the boxes themselves, headless, to see if it works on your machine:

vagrant init generic/arch && vagrant up --provider vmware_workstation

and...

vagrant init generic/fedora28 && vagrant up --provider vmware_workstation

Knowing whether it works on your machine could narrow down the number of possible culprits... such as whether it's an issue with VMWare Workstation v12.5.9 and not the more current v14.1.3 or v15.0.0. Or perhaps an issue with the ssh client on CentOS 7.5...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants