tests/installed: bump reboot timeout to 180s #1545

jlebon · 2018-04-18T19:00:33Z

We've been seeing a lot of CI test failures due to ansible timing out
waiting for the host to come back up after a reboot.

cgwalters · 2018-04-19T14:37:02Z

I think the problem is more that the reboot playbook is still racy. It's really easy to reproduce just by lowering the timeouts locally.

I am struggling to figure out how to to wrestle Ansible into doing this in a race-free way. Our existing playbooks mostly get by with having long delays, which is exactly the thing we were trying to avoid by doing VM-in-container.

The core problem with the existing code is being able to retry from ssh-level failures if we happen to get back in before the reboot. What I have so far is:

# Stop sshd (thus preventing new connections) and kill our current user's
# connection so that we can't race to get back in to the system while it's
# shutting down
- name: restart hosts
  when: (not skip_shutdown is defined) or (not skip_shutdown)
  shell: |
    systemctl stop sshd
    # Run a reboot outside of our current session to ensure that terminating
    # our session will not kill the reboot command.
    systemd-run systemctl reboot
    # Now kill our current ssh connection; run via systemd-run just for good measure
    systemd-run loginctl terminate-user $(id -u)
  async: 1
  poll: 0
  ignore_errors: true

Followed by the wait_for which seems pretty robust but...the problem I'm hitting next is that with ssh connection sharing in use, that channel is terminated which Ansible doesn't understand and errors out fatally. (The wait_for is fine since it makes a new TCP connection)

Basically it feels like we're really fighting the system; we'd need to move to a model where we generate an inventory and then break out of a playbook each time we do a reboot and resume as a distinct playbook or something.

cgwalters · 2018-04-19T14:54:15Z

Ansible does have a meta: reset_connection but...it fails if the connection was actually already torn down. Sigh.

jlebon · 2018-04-19T18:27:05Z

Hmm, that's tricky. What if instead of wait_for, we shell out to a connection: local script that just keeps retrying until the boot ID is different? I can take a crack at that if you'd like.

cgwalters · 2018-04-19T18:35:04Z

we shell out to a connection: local script that just keeps retrying until the boot ID is different?

I think the problem with that is twofold:

Requires replicating the ansible ssh configuration
Doesn't solve the ControlMaster issue

That saidd I've been banging on this most of the morning and am happy to context switch away 😄

cgwalters · 2018-04-19T18:56:07Z

Here's what I've been playing with:

diff --git a/tests/installed/tasks/reboot.yml b/tests/installed/tasks/reboot.yml
index fd077102..d149697e 100644
--- a/tests/installed/tasks/reboot.yml
+++ b/tests/installed/tasks/reboot.yml
@@ -33,34 +33,40 @@
   command: cat /proc/sys/kernel/random/boot_id
   register: orig_bootid
 
+# Stop sshd (thus preventing new connections) and kill our current user's
+# connection so that we can't race to get back in to the system while it's
+# shutting down
 - name: restart hosts
   when: (not skip_shutdown is defined) or (not skip_shutdown)
-  shell: sleep 3 && shutdown -r now
-  async: 1
-  poll: 0
+  shell: |
+    sleep 3
+    systemctl stop sshd
+    systemd-run systemctl reboot
   ignore_errors: true
 
-# NB: The following tasks use local actions, so we need to explicitly ensure
-# that they don't use sudo, which may require a password, and is not necessary
-# anyway.
+- meta: reset_connection
 
-- name: wait for hosts to come back up
-  local_action:
-    wait_for host={{ real_ansible_host }}
-    port={{ real_ansible_port | default('22') }}
-    state=started
-    delay=30
-    timeout={{ timeout }}
-    search_regex="OpenSSH"
-  become: false
+# NB: The wait_for is executed locally and doesn't require privs, so avoid sudo
+- debug:
+    msg: "Waiting for reboot: {{ ansible_date_time.iso8601 }}"
+- wait_for_connection:
+    delay: 3
+    timeout: 120
+    search_regex: "OpenSSH"
+- debug:
+    msg: "SSH port is up {{ ansible_date_time.iso8601 }}"
 
-# I'm not sure the retries are even necessary, but I'm keeping them in
-- name: Wait until bootid changes
+- meta: reset_connection
+
+- name: Assert that the bootid changed
   command: cat /proc/sys/kernel/random/boot_id
   register: new_bootid
   until: new_bootid.stdout != orig_bootid.stdout
-  retries: 6
-  delay: 10
+  retries: 60
+  delay: 1
+- assert:
+    that:
+      - new_bootid.stdout != orig_bootid.stdout
 
 # provide an empty iterator when a list is not provided
 # http://docs.ansible.com/ansible/playbooks_conditionals.html#loops-and-conditionals

cgwalters · 2018-04-19T20:00:14Z

I failed to give up on this and ended up with: #1548

jlebon · 2018-04-19T20:23:38Z

Nice!

I took a few minutes to take an inventory (pun intended) of the last few recent flakes we've seen. Of the 9 examined, actually only 2 of them were due to rebooting, while 3 of them were due to itest-pull.sh taking too long (some close to finishing their final fsck). Seems like we still want something like the second commit here, right? Will follow up on the other 4 flakes I saw.

cgwalters · 2018-04-19T20:25:19Z

Seems like we still want something like the second commit here, right?

Ah sure, actually might as well bump it even higher I guess.

It seems like 240 retries is just not long enough for all the non-destructive tests running in parallel to finish. Let's crank that up to 500 retries.

cgwalters · 2018-04-19T21:35:40Z

The description said 500, content only did 300 though. Since we really want both I slurped this into #1548

tests/installed: increase async retries to 500

bc2817f

It seems like 240 retries is just not long enough for all the non-destructive tests running in parallel to finish. Let's crank that up to 500 retries.

jlebon force-pushed the pr/bump-timeout branch from 06c3b35 to bc2817f Compare April 19, 2018 20:27

cgwalters closed this Apr 19, 2018

jlebon deleted the pr/bump-timeout branch June 14, 2018 01:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests/installed: bump reboot timeout to 180s #1545

tests/installed: bump reboot timeout to 180s #1545

jlebon commented Apr 18, 2018

cgwalters commented Apr 19, 2018

cgwalters commented Apr 19, 2018

jlebon commented Apr 19, 2018

cgwalters commented Apr 19, 2018

cgwalters commented Apr 19, 2018

cgwalters commented Apr 19, 2018

jlebon commented Apr 19, 2018

cgwalters commented Apr 19, 2018

cgwalters commented Apr 19, 2018

tests/installed: bump reboot timeout to 180s #1545

tests/installed: bump reboot timeout to 180s #1545

Conversation

jlebon commented Apr 18, 2018

cgwalters commented Apr 19, 2018

cgwalters commented Apr 19, 2018

jlebon commented Apr 19, 2018

cgwalters commented Apr 19, 2018

cgwalters commented Apr 19, 2018

cgwalters commented Apr 19, 2018

jlebon commented Apr 19, 2018

cgwalters commented Apr 19, 2018

cgwalters commented Apr 19, 2018