Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On vagrant up with an additional private network, the Ansible provisioner randomly fails on an SSH timeout #4860

Closed
x-yuri opened this issue Nov 22, 2014 · 22 comments

Comments

@x-yuri
Copy link

x-yuri commented Nov 22, 2014

Not sure if it's vagrant's fault:

$ sed -i '/^192.168.88.10/d' ~/.ssh/known_hosts && vagrant destroy -f && vagrant up && TO=local mina create_keys
==> default: Forcing shutdown of VM...
==> default: Destroying VM and associated drives...
==> default: Running cleanup tasks for 'ansible' provisioner...
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Importing base box 'ubuntu/trusty64'...
==> default: Matching MAC address for NAT networking...
==> default: Checking if box 'ubuntu/trusty64' is up to date...
==> default: Setting the name of the VM: domain_default_1416649876282_50981
==> default: Clearing any previously set forwarded ports...
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
    default: Adapter 1: nat
    default: Adapter 2: hostonly
==> default: Forwarding ports...
    default: 22 => 2222 (adapter 1)
==> default: Running 'pre-boot' VM customizations...
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 127.0.0.1:2222
    default: SSH username: vagrant
    default: SSH auth method: private key
    default: Warning: Connection timeout. Retrying...
    default: Warning: Remote connection disconnect. Retrying...
    default: Warning: Remote connection disconnect. Retrying...
==> default: Machine booted and ready!
==> default: Checking for guest additions in VM...
==> default: Configuring and enabling network interfaces...
==> default: Mounting shared folders...
    default: /vagrant => /srv/http/domain
==> default: Running provisioner: ansible...
ANSIBLE_FORCE_COLOR=true ANSIBLE_HOST_KEY_CHECKING=false PYTHONUNBUFFERED=1 ansible-playbook --private-key=/home/yuri/.vagrant.d/insecure_private_key --user=vagrant --vault-password-file=provisioning/.vault_pass_dev --inventory-file=provisioning/hosts.dev -vvvv --limit='all' provisioning/playbook.yml

PLAY [prepare] **************************************************************** 

GATHERING FACTS *************************************************************** 
<192.168.88.10> ESTABLISH CONNECTION FOR USER: vagrant
<192.168.88.10> REMOTE_MODULE setup
<192.168.88.10> EXEC ['ssh', '-C', '-tt', '-vvv', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/home/yuri/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'IdentityFile="/home/yuri/.vagrant.d/insecure_private_key"', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '192.168.88.10', "/bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1416649927.12-208953686388775 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1416649927.12-208953686388775 && echo $HOME/.ansible/tmp/ansible-tmp-1416649927.12-208953686388775'"]
fatal: [domain.local.encrypted] => SSH encountered an unknown error. The output was:
OpenSSH_6.7p1, OpenSSL 1.0.1j 15 Oct 2014
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: auto-mux: Trying existing master
debug1: Control socket "/home/yuri/.ansible/cp/ansible-ssh-192.168.88.10-22-vagrant" does not exist
debug2: ssh_connect: needpriv 0
debug1: Connecting to 192.168.88.10 [192.168.88.10] port 22.
debug2: fd 3 setting O_NONBLOCK
debug1: connect to address 192.168.88.10 port 22: Connection timed out
ssh: connect to host 192.168.88.10 port 22: Connection timed out


TASK: [prepare_system | set /etc/timezone] ************************************ 
FATAL: no hosts matched or all hosts have already failed -- aborting


PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/home/yuri/playbook.retry

domain.local.encrypted  : ok=0    changed=0    unreachable=1    failed=0   

Ansible failed to complete successfully. Any error output should be
visible above. Please fix these errors and try again.

This error happens occasionally (not every time). So, first vagrant says the machine is up and ready, but then ansible says that it can't connect. Can this be fixed in any way? Could you suggest anything to further investigate the issue?

@gildegoma
Copy link
Collaborator

You're using a custom inventory file (provisioning/hosts.dev), but Vagrant dynamically allocates the forwarded port for SSH connections.

Note that for simple use case, I'd recommend to rely on auto-generated inventory by simply removing ansible.inventory_path option in your Vagrantfile.

If you still encounter problems, could you please gist your Vagrantfile and provisioning/hosts.dev, and maybe a playbook example that can reproduce the issue.

@x-yuri
Copy link
Author

x-yuri commented Nov 22, 2014

I'll try to set up a testing VM. Meanwhile, could you elaborate on "but Vagrant dynamically allocates the forwarded port for SSH connections" part? Actually I don't really understand what's happening. As far as I can tell, vagrant always sets up port forwarding like this: 2222 (host) => 22 (guest). At least for this VM. And ansible seems to connect using port 22. What strikes me, why does ansible even happen to connect? Shouldn't it be using port 2222?

UPD Well, it must be because ansible connects using host-only network. But I still don't understand how Vagrant "allocates the forwarded port for SSH connections."

@gildegoma
Copy link
Collaborator

Well, it must be because ansible connects using host-only network

Yes indeed, my comment about port forwarding only referred to this host-only connection, which is Vagrant's default ssh path. In your case, you've defined a extra private (or public) network (192.168.88.10), where using default port 22 is absolutely correct.

With a quick Internet search, I found that there is several SSH issues when running Mac OS X Mavericks (and maybe affecting Yosemite as well). I could not check them, as I only have an OS X 10.9 box, but please have a look at:

If you still encounter problems, could you please provide following information:

  • Host Operating System details
  • Vagrant version
  • Ansible version
  • gist of your Vagrantfile
  • gist of your ansible inventory
  • gist of your ansible playbook

Hint: for faster debugging, you also can execute the ansible-playbook generated by Vagrant directly from the shell:

ANSIBLE_FORCE_COLOR=true ANSIBLE_HOST_KEY_CHECKING=false PYTHONUNBUFFERED=1 ansible-playbook --private-key=/home/yuri/.vagrant.d/insecure_private_key --user=vagrant --vault-password-file=provisioning/.vault_pass_dev --inventory-file=provisioning/hosts.dev -vvvv --limit='all' provisioning/playbook.yml
(You'll see these command when ansible.verbose is activated)

@gildegoma
Copy link
Collaborator

@x-yuri I'm closing this until we receive more information from your part.

@x-yuri
Copy link
Author

x-yuri commented Nov 30, 2014

Sorry for the delay. So, with this setup it still fails occasionally:

$ cat ./Vagrantfile
# -*- mode: ruby -*-
# vi: set ft=ruby :

VAGRANTFILE_API_VERSION = "2"
Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
  config.vm.box = "ubuntu/trusty64"
  config.vm.network "private_network", ip: "192.168.88.11"
  config.vm.provision :ansible do |ansible|
    ansible.playbook = "playbook.yml"
    ansible.inventory_path = "hosts"
    ansible.verbose = 'vvvv'
  end
end
$ cat ./hosts
[default]
h1 ansible_ssh_host=192.168.88.11 ansible_ssh_user=vagrant
$ cat ./playbook.yml
---
- hosts: default
  tasks:
    - name: test task
      shell: echo test

Hint: for faster debugging, you also can execute the ansible-playbook generated by Vagrant directly from the shell:

ANSIBLE_FORCE_COLOR=true ANSIBLE_HOST_KEY_CHECKING=false PYTHONUNBUFFERED=1 ansible-playbook --private-key=/home/yuri/.vagrant.d/insecure_private_key --user=vagrant --vault-password-file=provisioning/.vault_pass_dev --inventory-file=provisioning/hosts.dev -vvvv --limit='all' provisioning/playbook.yml
(You'll see these command when ansible.verbose is activated)

I thought the problem had to do with ansible running too early. Was I wrong? Anyway, I tested it this way: vagrant destroy -f && vagrant up.

You're using a custom inventory file (provisioning/hosts.dev), but Vagrant dynamically allocates the forwarded port for SSH connections.

Still, what exactly do you mean by dynamically allocating ports?

The issues at those links has nothing to do with mine. The first has ssh: connect to host 127.0.0.1 port 22: Connection refused error, and the second fatal: [127.0.0.1] => to use the 'ssh' connection type with passwords, you must install the sshpass program.

UPD I just confirmed that it must have to do with ansible running too early. I ran vagrant destroy -f && vagrant up until I get the error. Then I ran ANSIBLE_FORCE_COLOR=true ANSIBLE_HOST_KEY_CHECKING=false PYTHONUNBUFFERED=1 ansible-playbook --private-key=/home/yuri/.vagrant.d/insecure_private_key --user=vagrant --limit='default' --inventory-file=hosts -vvvv playbook.yml and it succeeded:

$ vagrant destroy -f && vagrant up
==> default: Forcing shutdown of VM...
==> default: Destroying VM and associated drives...
==> default: Running cleanup tasks for 'ansible' provisioner...
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Importing base box 'ubuntu/trusty64'...
==> default: Matching MAC address for NAT networking...
==> default: Checking if box 'ubuntu/trusty64' is up to date...
==> default: Setting the name of the VM: 5_default_1417385755633_17573
==> default: Clearing any previously set forwarded ports...
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
    default: Adapter 1: nat
    default: Adapter 2: hostonly
==> default: Forwarding ports...
    default: 22 => 2222 (adapter 1)
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 127.0.0.1:2222
    default: SSH username: vagrant
    default: SSH auth method: private key
    default: Warning: Connection timeout. Retrying...
    default: Warning: Remote connection disconnect. Retrying...
    default: Warning: Remote connection disconnect. Retrying...
==> default: Machine booted and ready!
==> default: Checking for guest additions in VM...
==> default: Configuring and enabling network interfaces...
==> default: Mounting shared folders...
    default: /vagrant => /home/yuri/_/5
==> default: Running provisioner: ansible...
ANSIBLE_FORCE_COLOR=true ANSIBLE_HOST_KEY_CHECKING=false PYTHONUNBUFFERED=1 ansible-playbook --private-key=/home/yuri/.vagrant.d/insecure_private_key --user=vagrant --limit='default' --inventory-file=hosts -vvvv playbook.yml

PLAY [default] **************************************************************** 

GATHERING FACTS *************************************************************** 
<192.168.88.11> ESTABLISH CONNECTION FOR USER: vagrant
<192.168.88.11> REMOTE_MODULE setup
<192.168.88.11> EXEC ['ssh', '-C', '-tt', '-vvv', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/home/yuri/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'IdentityFile="/home/yuri/.vagrant.d/insecure_private_key"', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '192.168.88.11', "/bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1417385811.38-111995198978464 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1417385811.38-111995198978464 && echo $HOME/.ansible/tmp/ansible-tmp-1417385811.38-111995198978464'"]
fatal: [h1] => SSH encountered an unknown error. The output was:
OpenSSH_6.7p1, OpenSSL 1.0.1j 15 Oct 2014
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: auto-mux: Trying existing master
debug1: Control socket "/home/yuri/.ansible/cp/ansible-ssh-192.168.88.11-22-vagrant" does not exist
debug2: ssh_connect: needpriv 0
debug1: Connecting to 192.168.88.11 [192.168.88.11] port 22.
debug2: fd 3 setting O_NONBLOCK
debug1: connect to address 192.168.88.11 port 22: Connection timed out
ssh: connect to host 192.168.88.11 port 22: Connection timed out


TASK: [test task] ************************************************************* 
FATAL: no hosts matched or all hosts have already failed -- aborting


PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/home/yuri/playbook.retry

h1                         : ok=0    changed=0    unreachable=1    failed=0   

Ansible failed to complete successfully. Any error output should be
visible above. Please fix these errors and try again.
$ ANSIBLE_FORCE_COLOR=true ANSIBLE_HOST_KEY_CHECKING=false PYTHONUNBUFFERED=1 ansible-playbook --private-key=/home/yuri/.vagrant.d/insecure_private_key --user=vagrant --limit='default' --inventory-file=hosts -vvvv playbook.yml

PLAY [default] **************************************************************** 

GATHERING FACTS *************************************************************** 
<192.168.88.11> ESTABLISH CONNECTION FOR USER: vagrant
<192.168.88.11> REMOTE_MODULE setup
<192.168.88.11> EXEC ['ssh', '-C', '-tt', '-vvv', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/home/yuri/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'IdentityFile="/home/yuri/.vagrant.d/insecure_private_key"', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '192.168.88.11', "/bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1417386104.06-113950359141030 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1417386104.06-113950359141030 && echo $HOME/.ansible/tmp/ansible-tmp-1417386104.06-113950359141030'"]
<192.168.88.11> PUT /tmp/tmpiwjrvt TO /home/vagrant/.ansible/tmp/ansible-tmp-1417386104.06-113950359141030/setup
<192.168.88.11> EXEC ['ssh', '-C', '-tt', '-vvv', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/home/yuri/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'IdentityFile="/home/yuri/.vagrant.d/insecure_private_key"', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '192.168.88.11', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python /home/vagrant/.ansible/tmp/ansible-tmp-1417386104.06-113950359141030/setup; rm -rf /home/vagrant/.ansible/tmp/ansible-tmp-1417386104.06-113950359141030/ >/dev/null 2>&1'"]
ok: [h1]

TASK: [test task] ************************************************************* 
<192.168.88.11> ESTABLISH CONNECTION FOR USER: vagrant
<192.168.88.11> REMOTE_MODULE command echo test #USE_SHELL
<192.168.88.11> EXEC ['ssh', '-C', '-tt', '-vvv', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/home/yuri/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'IdentityFile="/home/yuri/.vagrant.d/insecure_private_key"', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '192.168.88.11', "/bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1417386115.24-57373581652877 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1417386115.24-57373581652877 && echo $HOME/.ansible/tmp/ansible-tmp-1417386115.24-57373581652877'"]
<192.168.88.11> PUT /tmp/tmpr5GMav TO /home/vagrant/.ansible/tmp/ansible-tmp-1417386115.24-57373581652877/command
<192.168.88.11> EXEC ['ssh', '-C', '-tt', '-vvv', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/home/yuri/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'IdentityFile="/home/yuri/.vagrant.d/insecure_private_key"', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '192.168.88.11', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python /home/vagrant/.ansible/tmp/ansible-tmp-1417386115.24-57373581652877/command; rm -rf /home/vagrant/.ansible/tmp/ansible-tmp-1417386115.24-57373581652877/ >/dev/null 2>&1'"]
changed: [h1] => {"changed": true, "cmd": "echo test", "delta": "0:00:00.023101", "end": "2014-11-30 22:22:28.782800", "rc": 0, "start": "2014-11-30 22:22:28.759699", "stderr": "", "stdout": "test"}

PLAY RECAP ******************************************************************** 
h1                         : ok=2    changed=1    unreachable=0    failed=0

UPD I'm running arch linux, vagrant-1.6.5 and ansible-1.7.2.

@gildegoma
Copy link
Collaborator

@x-yuri thanks for all the info. This is odd... What I see here, is that with your Arch Linux you're using one of the most recent versions of OpenSSH (version 6.7 was released on October 6, 2014). Ubuntu 14.04 is using 6.6p1, Homebrew for Mac provides 6.6.p1_1.

What VM provider are you using actually? KVM? Virtualbox?

I reopen this issue and flag it to hosts/arch for now. I would recommend to search and ask with Arch community as well (https://wiki.archlinux.org/index.php/Vagrant, https://bbs.archlinux.org/search.php)

@gildegoma gildegoma reopened this Dec 1, 2014
@gildegoma gildegoma changed the title provisioner times out Ansible provisioner times out on Arch Linux with OpenSSH 6.7p1 Dec 1, 2014
@gildegoma
Copy link
Collaborator

@x-yuri I renamed the issue. Does vagrant ssh command work ?

@x-yuri
Copy link
Author

x-yuri commented Dec 1, 2014

I'm using virtualbox, and vagrant ssh works. It looks like ansible starts too early, when ssh connection cannot be established.

@gildegoma
Copy link
Collaborator

Please, can you execute VAGRANT_LOG=debug vagrant provision and gist the full log output

@gildegoma
Copy link
Collaborator

and maybe also for vagrant ssh (for comparison)

@x-yuri
Copy link
Author

x-yuri commented Dec 1, 2014

I hope there's no sensitive information there.

@gildegoma
Copy link
Collaborator

@x-yuri the last example that you gisted does not fully correspond to the original problem report:

  1. it now uses the autogenarate ansible inventory (--inventory-file=/home/yuri/_/5/.vagrant/provisioners/ansible/inventory)
  2. and it actually works fine

That said, I agree with your analysis that at least the "private network (192.168.88.11)" connectivity is not ready when the ansible provisioner is started.

Next, could you please:

  • keep your Vagrantfile configured without ansible.inventory_path option (autogenerated inventory), and try: vagrant destroy -f && VAGRANT_LOG=debug vagrant up. I expect that everything will run fine, as the basic localhost network is also used by other vagrant internals to ssh into the VM during the bootstrap.
  • Re-configured back to ansible.inventory_path = "hosts" and again execute vagrant destroy -f && VAGRANT_LOG=debug vagrant up to try to understand why or how the private network is slow to be enabled.

Note: I updated the issue title again (for helping internet searches).

@gildegoma gildegoma changed the title Ansible provisioner times out on Arch Linux with OpenSSH 6.7p1 vagrant up on Arch Linux Host: the private network is not available when Ansible provisioner begins Dec 2, 2014
@gildegoma
Copy link
Collaborator

@x-yuri what is your exact versions for:

  • Virtualbox
  • Arch Linux distro
  • Linux kernel

@x-yuri
Copy link
Author

x-yuri commented Dec 2, 2014

Well, if you expected me to gist vagrant provision while it fails, than first, I believe I couldn't possibly do that. Because vagrant provision is run after vagrant up and I didn't see it failing. However I didn't tested it thoroughly. Additionally, if the reason is ansible starting too early, we should inspect vagrant up, not vagrant provision.

it now uses the autogenarate ansible inventory (--inventory-file=/home/yuri/_/5/.vagrant/provisioners/ansible/inventory)

Sorry, I was experimenting with commenting corresponding line out and forgot to uncomment it back.

Re-configured back to ansible.inventory_path = "hosts" and again execute vagrant destroy -f && VAGRANT_LOG=debug vagrant up to try to understand why or how the private network is slow to be enabled.

To make things clear (I'm not sure you noticed the point), it's not just about reexecuting vagrant. I've got to run it until it fails. I'm going to try this command this time: for i in {1..10}; do vagrant destroy -f && VAGRANT_LOG=debug vagrant up || break; done.

As for the versions, virtualbox is of version 4.3.18. Then, arch linux has no such thing as version, the closest thing I can probably provide is the last time I ran pacman -Syu, it was on 2014-11-17 01:29. And the kernel is of version 3.17.3.

Also, I wanted to make it clear, that I didn't do any tweaks to the kernel, and I did minimal adjustments to the configuration of the software I'm using, just the things I needed. Also, I hope it's not only about arch linux :) But I've probably got that to prove yet.

@gildegoma
Copy link
Collaborator

@x-yuri sorry, if was unclear, but no worries we'll keep digging into it bravely 🏥. Also thank you for the software versions clarification, you're running only on bleeding edges 🚀, and that could be a source of "yet unknown" integration troubles, that's all.

if the reason is ansible starting too early, we should inspect vagrant up, not vagrant provision

Correct, that was my mistake, I should have asked you to run VAGRANT_LOG=debug vagrant up instead, but I mixed myself with copy-paste and so on. Sorry for that 😿.
Note that vagrant up automatically kicks the provisioner(s) on the VM initial bootsrap, hence the value to use VAGRANT_LOG=debug vagrant up (with a new VM to be created)

So, my questions are:

  • with hosts: does vagrant destroy -f; vagrant up systematically fails (ansible timeout) with your current host setup?
  • with autogenerated inventory: does vagrant destroy -f; vagrant up succeed to run the ansible provisioner without timeout error?

@tkellen
Copy link

tkellen commented Dec 3, 2014

Just chiming in that I am experiencing this as well. I also have a custom inventory file which contains the following: vagrant ansible_ssh_host=192.168.33.40. I'm doing this so I can set variables specific to development in host_vars/vagrant.

The problem is intermittent and running vagrant provision manually works fine after the VM finishes booting.

@x-yuri
Copy link
Author

x-yuri commented Dec 3, 2014

@tkellen Which OS is it?

@gildegoma This is output for failed and successful runs. It didn't occur to me that less will save the output with escape sequences, so I had to remove them with this script afterwards. As for making it fail with auto-generated inventory, I couldn't reproduce it yet. Probably, I will leave my computer trying to make it happen for a night. I hope I didn't miss any of your requests.

@tkellen
Copy link

tkellen commented Dec 3, 2014

Multiple hosts. I've seen it on OSX and Ubtunu.

@gildegoma
Copy link
Collaborator

@tkellen Thanks for your feedbacks, which obviously enlarge the scope of the problem (I'll update again the issue title and labels).

@x-yuri Awesome work to fetch the logs, thanks a lot! I've started to check and compare vagrant-fail and vagrant-success outputs. In both runs, the setup of 192.168.88.x private network succeed and looks the same (VirtualBox commands and Ubuntu network reconfiguration), but maybe there is a random latency until /sbin/ifup eth1 is really effective... and I don't think that at the moment Vagrant checks/wait until additional network are operational, before going further. This could explain that sometimes the ansible provisioner is fired to early, but I need to further into this (later, not today)...

But for the next investigation steps I propose you two kinds of possible "workarounds":

  • Increase SSH connection timeout (e.g. ansible.raw_arguments = ['--timeout=300'] via Vagrantfile or via a local ansible.cfg file if you prefer)

I also noticed that when ansible-playbook fails, we see this message (also reported in ansible/ansible#7759):

debug1: auto-mux: Trying existing master
debug1: Control socket "/home/yuri/.ansible/cp/ansible-ssh-192.168.88.11-22-vagrant" does not exist

Usually, I don't think this should be a problem (SSH simply recreates the control socket file), but maybe it introduces some additional delay that possibly lead to the observed timeout error. So I think you also could try to:

  • Disable ControlPersist with ansible.raw_ssh_args = ['ControlPersist=no']

It would be of great help If you could try these two custom configurations 🆘

@gildegoma gildegoma changed the title vagrant up on Arch Linux Host: the private network is not available when Ansible provisioner begins On vagrant up with an additional private network, the Ansible provisioner randomly fails on an SSH timeout Dec 3, 2014
@x-yuri
Copy link
Author

x-yuri commented Dec 5, 2014

@tkellen To make it clear, was the error exactly "Connection timed out"?

I've seen it on OSX and Ubtunu.

My coworkers had similar complaints, and that's exactly the OSes they use. It's just hard to make them confirm it.

As for ansible.raw_arguments = ['--timeout=300'], it succeeded 352 times. UPD Something might be off here. I need to recheck this.

The way it looks I'll soon be an expert at testing vagrant :)

Just to have it stored somewhere, this command has proved to be easier to use: for (( i = 1; 1; i++ )); do vagrant destroy -f && VAGRANT_LOG=debug vagrant up || break; echo $i > /dev/tty; done &> log

@x-yuri
Copy link
Author

x-yuri commented Dec 6, 2014

The thing is this issue is hard to reproduce. I don't know how many times I should let it recreate VM to be sure that the issue is gone. Supposedly computer must be under heavy load for it to manifest itself. However by now I haven't yet manage to come up with a reliable way to make it happen, let alone fast one.

And I'm not so sure now, that the issue can't happen with auto-generated inventory.

So, do you have any suggestion for how to force it happen? One of the options would be to do it from inside the VM. Also, could you point me out where in the code vagrant waits for network to be operational?

@gildegoma
Copy link
Collaborator

The thing is this issue is hard to reproduce.

@x-yuri You did more than pretty well, and I thank you again so much for your efforts. I don't think we need any more tests on your side ❤️ (and sorry I asked too much things, it is a bit hard to be complete in asynchronous communication).

And I'm not so sure now, that the issue can't happen with auto-generated inventory.

  • auto-generated inventory is 100% safe because it uses the "official" ssh_info returned by the VM provider. The same SSH connectivity is used by all vagrant core components (e.g. to configure your private network) during the VM bootstrap phase.
  • when using an extra (private or public) network address in a custom Ansible inventory, it is possible to encounter some random timeout errors (subject of this problem report). To avoid these errors, it is recommended to increase the SSH connection timeout of ansible-playbook command with ansible.raw_arguments = ['--timeout=30'] (the default value is 10 seconds and this should be empirically adjusted to your environment performances).

could you point me out where in the code vagrant waits for network to be operational

Here are some parts that may interest you:


I close this issue with following further possible enhancement ideas to be evaluated in the future:

  • Add a new ansible.timeout option to the Ansible provisioner, with a higher default value (e.g. 30 seconds)
  • Extend the provider / communicator sanity checks, to optionally check the availability of additional networks

@mitchellh what do you think? Are you open to accept pull requests implementing the above ideas?

@gildegoma gildegoma added bug and removed needs info labels Dec 7, 2014
gildegoma added a commit that referenced this issue Jan 12, 2015
This change helps to avoid troubles like reported in #5018 and #4860.

Note that for sake of configuration simplicity, no new `ansible.timeout`
option has been added. The users who want to set a different value can
rely on `ansible.raw_arguments`.
gildegoma added a commit that referenced this issue Jan 12, 2015
@ghost ghost locked and limited conversation to collaborators Apr 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants