Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fresh vagrant up fails due to machine being locked #8468

Closed
samueljc opened this issue Apr 11, 2017 · 6 comments · Fixed by #8951
Closed

fresh vagrant up fails due to machine being locked #8468

samueljc opened this issue Apr 11, 2017 · 6 comments · Fixed by #8951

Comments

@samueljc
Copy link

Vagrant version(s)

1.9.1
1.9.3

Virtualbox version

5.1.14r112924

Host operating system

Ubuntu 16.04

Guest operating system

Ubuntu 16.04

Vagrantfile

Vagrant.configure("2") do |config|
  config.vm.box = "bento/ubuntu-16.04"
end

Debug output

https://gist.github.com/samueljc/74d5d10e99358831da630cf755a33299

Expected behavior

Vagrant machine comes up.

Actual behavior

Vagrant machine fails to come up and reports that it can't continue because VBoxManage failed due to the machine being locked.

Where exactly it fails varies. I've seen the up fail while clearing forwarded ports, setting forwarded ports, clearing previous network interfaces, setting network interfaces, and during startvm.

Other than startvm everything else uses modifyvm. Looking at the driver for Virtualbox 5.0, only one of the modifyvm commands has retryable. Would it be reasonable/safe to retry such commands if they fail with a short sleep between attempts?

Steps to reproduce

Note, this doesn't happen reliably but seems to happen much more frequently if the host machine is under heavy load. We've had the issue happen occasionally in our gitlab-ci pipeline which brings up multiple machines simultaneously - note, all the machines in the pipeline are headless, linked clones.

I was able to reproduce it somewhat reliably (1 or 2 bad executions per attempt) using the provided Vagrantfile and creating 8 machines in parallel.

References

I've seen other instances of machines getting locked but the posts are about rescuing them. Rescuing a locked machine doesn't help me though; I need the machines to come up reliably and they'll be disposed of after using them.

@samueljc
Copy link
Author

Not sure how much more I can add to satisfy the 'needs-repro' tag as this issue arises from flakiness, but here's the closest thing to a reliable way of reproducing the problem that I've got.

A gist of the script I used to reproduce this issue using the Vagrantfile described above: https://gist.github.com/samueljc/a6c9508e50b2899761086acccbf03984

I also tested it with both the following Vagrantfile to show that the problem exists even while using linked clones and explicitly declaring the resource footprint.

Vagrant.configure("2") do |config|
  config.vm.box = "bento/ubuntu-16.04"
  config.vm.provider('virtualbox') do |vb|
    vb.linked_clone = true
    vb.memory = 1024
    vb.cpus = 1
  end
end

@chrisroberts
Copy link
Member

@samueljc Hi! The needs-repro isn't for you. It's simply a way to let me know I need to reproduce the error locally to identify the root cause and impact of a fix. The Vagrantfiles you provided are perfect. Thanks!

samueljc pushed a commit to samueljc/vagrant that referenced this issue Apr 25, 2017
Issue: hashicorp#8468

A lot of vboxmanage commands are flakey and frequently cause
bringing multiple machines up at once to fail, especially when
the host system is under heavy load. Most commands are also safe
to retry and just result in a no-op, so we can simply add
'retryable' to a lot of existing calls. For the others we need to
do a little bit of cleanup or reevaluate the parameters before
trying again.
samueljc pushed a commit to samueljc/vagrant that referenced this issue Apr 25, 2017
Issue: hashicorp#8468

A lot of vboxmanage commands are flakey and frequently cause
bringing multiple machines up at once to fail, especially when
the host system is under heavy load. Most commands are also safe
to retry and just result in a no-op, so we can simply add
'retryable' to a lot of existing calls. For the others we need to
do a little bit of cleanup or reevaluate the parameters before
trying again.
@samueljc
Copy link
Author

Tried my hand at patching this. Using the changes I tested it largely the same way as before but looked for the exit status instead of that specific error, and things performed much better.

Admittedly it didn't run indefinitely. After about a dozen cycles of bringing 8 boxes up at once 1 of them failed to come up even after retrying a command 3 times. Much better than before though where at least one of them would usually fail on the first batch.

@wasosa
Copy link

wasosa commented May 15, 2017

@chrisroberts: Hi! I'm interested in seeing this issue fixed, so I took a stab at reviewing Samuel's patch (#8525), in case that helps. Thanks @samueljc!

@marek-obuchowicz
Copy link

Is there any update on this?
I also observed this issue on vagrant 1.9.2 and after upgrading to 1.9.7. It's a CI system where only on job is setup and people don't have access to this machine. On random steps in our pipeline, which involve vagrant up, i'm getting:

+ vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Clearing any previously set forwarded ports...
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
    default: Adapter 1: nat
    default: Adapter 2: hostonly
==> default: Forwarding ports...
    default: 3306 (guest) => 3306 (host) (adapter 1)
    default: 10004 (guest) => 10004 (host) (adapter 1)
    default: 10005 (guest) => 10005 (host) (adapter 1)
    default: 10007 (guest) => 10007 (host) (adapter 1)
    default: 15672 (guest) => 15672 (host) (adapter 1)
    default: 58080 (guest) => 59080 (host) (adapter 1)
    default: 22 (guest) => 2222 (host) (adapter 1)
==> default: Running 'pre-boot' VM customizations...
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 127.0.0.1:2222
    default: SSH username: vagrant
    default: SSH auth method: private key
==> default: Machine booted and ready!
[default] GuestAdditions 5.1.26 running --- OK.
==> default: Checking for guest additions in VM...
==> default: Setting hostname...
==> default: Configuring and enabling network interfaces...
==> default: Exporting NFS shared folders...
==> default: Preparing to edit /etc/exports. Administrator privileges will be required...
● nfs-kernel-server.service - LSB: Kernel NFS server support
   Loaded: loaded (/etc/init.d/nfs-kernel-server)
   Active: active (running) since Tue 2017-08-15 19:12:21 CEST; 8h ago
  Process: 12102 ExecStop=/etc/init.d/nfs-kernel-server stop (code=exited, status=0/SUCCESS)
  Process: 13560 ExecStart=/etc/init.d/nfs-kernel-server start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/nfs-kernel-server.service
           └─13585 /usr/sbin/rpc.mountd --manage-gids
==> default: Mounting NFS shared folders...
==> default: Mounting shared folders...
    default: /vagrant => /var/lib/jenkins/jobs/siroop-vm-loki/workspace
==> default: [vagrant-hostmanager:guests] Updating hosts file on active guest virtual machines...
Vagrant can't use the requested machine because it is locked! This
means that another Vagrant process is currently reading or modifying
the machine. Please wait for that Vagrant process to end and try
again. Details about the machine are shown below:

Name: default
Provider: virtualbox

There's no one accessing anything on this machine and it breaks on random basis (some builds are ok, for some the error happens, at random step involving vagrant up. Please let me know if this should be reported as seperate issue, or is related to this one.

briancain pushed a commit to briancain/vagrant that referenced this issue Sep 6, 2017
Issue: hashicorp#8468

A lot of vboxmanage commands are flakey and frequently cause
bringing multiple machines up at once to fail, especially when
the host system is under heavy load. Most commands are also safe
to retry and just result in a no-op, so we can simply add
'retryable' to a lot of existing calls. For the others we need to
do a little bit of cleanup or reevaluate the parameters before
trying again.
briancain pushed a commit to briancain/vagrant that referenced this issue Sep 6, 2017
Issue: hashicorp#8468

A lot of vboxmanage commands are flakey and frequently cause
bringing multiple machines up at once to fail, especially when
the host system is under heavy load. Most commands are also safe
to retry and just result in a no-op, so we can simply add
'retryable' to a lot of existing calls. For the others we need to
do a little bit of cleanup or reevaluate the parameters before
trying again.
k-oguma pushed a commit to k-oguma/vagrant that referenced this issue Nov 9, 2017
Issue: hashicorp#8468

A lot of vboxmanage commands are flakey and frequently cause
bringing multiple machines up at once to fail, especially when
the host system is under heavy load. Most commands are also safe
to retry and just result in a no-op, so we can simply add
'retryable' to a lot of existing calls. For the others we need to
do a little bit of cleanup or reevaluate the parameters before
trying again.
@ghost
Copy link

ghost commented Mar 31, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Mar 31, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants