Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests/installed: bump reboot timeout to 180s #1545

Closed
wants to merge 1 commit into from

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Apr 18, 2018

We've been seeing a lot of CI test failures due to ansible timing out
waiting for the host to come back up after a reboot.

@cgwalters
Copy link
Member

I think the problem is more that the reboot playbook is still racy. It's really easy to reproduce just by lowering the timeouts locally.

I am struggling to figure out how to to wrestle Ansible into doing this in a race-free way. Our existing playbooks mostly get by with having long delays, which is exactly the thing we were trying to avoid by doing VM-in-container.

The core problem with the existing code is being able to retry from ssh-level failures if we happen to get back in before the reboot. What I have so far is:

# Stop sshd (thus preventing new connections) and kill our current user's
# connection so that we can't race to get back in to the system while it's
# shutting down
- name: restart hosts
  when: (not skip_shutdown is defined) or (not skip_shutdown)
  shell: |
    systemctl stop sshd
    # Run a reboot outside of our current session to ensure that terminating
    # our session will not kill the reboot command.
    systemd-run systemctl reboot
    # Now kill our current ssh connection; run via systemd-run just for good measure
    systemd-run loginctl terminate-user $(id -u)
  async: 1
  poll: 0
  ignore_errors: true

Followed by the wait_for which seems pretty robust but...the problem I'm hitting next is that with ssh connection sharing in use, that channel is terminated which Ansible doesn't understand and errors out fatally. (The wait_for is fine since it makes a new TCP connection)

Basically it feels like we're really fighting the system; we'd need to move to a model where we generate an inventory and then break out of a playbook each time we do a reboot and resume as a distinct playbook or something.

@cgwalters
Copy link
Member

Ansible does have a meta: reset_connection but...it fails if the connection was actually already torn down. Sigh.

@jlebon
Copy link
Member Author

jlebon commented Apr 19, 2018

Hmm, that's tricky. What if instead of wait_for, we shell out to a connection: local script that just keeps retrying until the boot ID is different? I can take a crack at that if you'd like.

@cgwalters
Copy link
Member

we shell out to a connection: local script that just keeps retrying until the boot ID is different?

I think the problem with that is twofold:

  • Requires replicating the ansible ssh configuration
  • Doesn't solve the ControlMaster issue

That saidd I've been banging on this most of the morning and am happy to context switch away 😄

@cgwalters
Copy link
Member

Here's what I've been playing with:

diff --git a/tests/installed/tasks/reboot.yml b/tests/installed/tasks/reboot.yml
index fd077102..d149697e 100644
--- a/tests/installed/tasks/reboot.yml
+++ b/tests/installed/tasks/reboot.yml
@@ -33,34 +33,40 @@
   command: cat /proc/sys/kernel/random/boot_id
   register: orig_bootid
 
+# Stop sshd (thus preventing new connections) and kill our current user's
+# connection so that we can't race to get back in to the system while it's
+# shutting down
 - name: restart hosts
   when: (not skip_shutdown is defined) or (not skip_shutdown)
-  shell: sleep 3 && shutdown -r now
-  async: 1
-  poll: 0
+  shell: |
+    sleep 3
+    systemctl stop sshd
+    systemd-run systemctl reboot
   ignore_errors: true
 
-# NB: The following tasks use local actions, so we need to explicitly ensure
-# that they don't use sudo, which may require a password, and is not necessary
-# anyway.
+- meta: reset_connection
 
-- name: wait for hosts to come back up
-  local_action:
-    wait_for host={{ real_ansible_host }}
-    port={{ real_ansible_port | default('22') }}
-    state=started
-    delay=30
-    timeout={{ timeout }}
-    search_regex="OpenSSH"
-  become: false
+# NB: The wait_for is executed locally and doesn't require privs, so avoid sudo
+- debug:
+    msg: "Waiting for reboot: {{ ansible_date_time.iso8601 }}"
+- wait_for_connection:
+    delay: 3
+    timeout: 120
+    search_regex: "OpenSSH"
+- debug:
+    msg: "SSH port is up {{ ansible_date_time.iso8601 }}"
 
-# I'm not sure the retries are even necessary, but I'm keeping them in
-- name: Wait until bootid changes
+- meta: reset_connection
+
+- name: Assert that the bootid changed
   command: cat /proc/sys/kernel/random/boot_id
   register: new_bootid
   until: new_bootid.stdout != orig_bootid.stdout
-  retries: 6
-  delay: 10
+  retries: 60
+  delay: 1
+- assert:
+    that:
+      - new_bootid.stdout != orig_bootid.stdout
 
 # provide an empty iterator when a list is not provided
 # http://docs.ansible.com/ansible/playbooks_conditionals.html#loops-and-conditionals

@cgwalters
Copy link
Member

I failed to give up on this and ended up with: #1548

@jlebon
Copy link
Member Author

jlebon commented Apr 19, 2018

Nice!

I took a few minutes to take an inventory (pun intended) of the last few recent flakes we've seen. Of the 9 examined, actually only 2 of them were due to rebooting, while 3 of them were due to itest-pull.sh taking too long (some close to finishing their final fsck). Seems like we still want something like the second commit here, right? Will follow up on the other 4 flakes I saw.

@cgwalters
Copy link
Member

Seems like we still want something like the second commit here, right?

Ah sure, actually might as well bump it even higher I guess.

It seems like 240 retries is just not long enough for all the
non-destructive tests running in parallel to finish. Let's crank that up
to 500 retries.
@cgwalters
Copy link
Member

The description said 500, content only did 300 though. Since we really want both I slurped this into #1548

@cgwalters cgwalters closed this Apr 19, 2018
@jlebon jlebon deleted the pr/bump-timeout branch June 14, 2018 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants