Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests/installed: Make reboot task less racy #1548

Closed
wants to merge 7 commits into from

Conversation

cgwalters
Copy link
Member

This took a whole lot of experimentation. I hit upon the idea
of doing a systemctl stop sshd to avoid the situation where we
might ssh back into the system while it's in the process of shutting
down.

Ultimately the other fix is disabling ControlMaster; see
for example: ansible/ansible#17935

@jlebon
Copy link
Member

jlebon commented Apr 19, 2018

@rh-atomic-bot r+ 4cd789f

@rh-atomic-bot
Copy link

⌛ Testing commit 4cd789f with merge 760d285...

rh-atomic-bot pushed a commit that referenced this pull request Apr 19, 2018
This took a whole lot of experimentation.  I hit upon the idea
of doing a `systemctl stop sshd` to avoid the situation where we
might ssh back into the system while it's in the process of shutting
down.

Ultimately the other fix is disabling `ControlMaster`; see
for example: ansible/ansible#17935

Closes: #1548
Approved by: jlebon
@rh-atomic-bot
Copy link

💔 Test failed - status-atomicjenkins

cgwalters and others added 2 commits April 19, 2018 17:24
This took a whole lot of experimentation.  I hit upon the idea
of doing a `systemctl stop sshd` to avoid the situation where we
might ssh back into the system while it's in the process of shutting
down.

Ultimately the other fix is disabling `ControlMaster`; see
for example: ansible/ansible#17935
It seems like 240 retries is just not long enough for all the
non-destructive tests running in parallel to finish. Let's crank that up
to 500 retries.
@jlebon
Copy link
Member

jlebon commented Apr 20, 2018

Ahh OK, that's an instance of #1549.
@rh-atomic-bot r+ bafa31c

@rh-atomic-bot
Copy link

🙀 bafa31c is not a valid commit SHA. Please try again with 9414bea.

@cgwalters
Copy link
Member Author

@rh-atomic-bot r+ 9414bea

@rh-atomic-bot
Copy link

⌛ Testing commit 9414bea with merge 0013a6b...

rh-atomic-bot pushed a commit that referenced this pull request Apr 20, 2018
This took a whole lot of experimentation.  I hit upon the idea
of doing a `systemctl stop sshd` to avoid the situation where we
might ssh back into the system while it's in the process of shutting
down.

Ultimately the other fix is disabling `ControlMaster`; see
for example: ansible/ansible#17935

Closes: #1548
Approved by: cgwalters
rh-atomic-bot pushed a commit that referenced this pull request Apr 20, 2018
It seems like 240 retries is just not long enough for all the
non-destructive tests running in parallel to finish. Let's crank that up
to 500 retries.

Closes: #1548
Approved by: cgwalters
@jlebon
Copy link
Member

jlebon commented Apr 20, 2018

Whoops! I have this filter that marks all @rh-atomic-bot notifications as read unless it's "Test timed out" or "Test failed". I tweaked it to also leave the "try again" emails unread.

@rh-atomic-bot
Copy link

💔 Test failed - status-atomicjenkins

@jlebon
Copy link
Member

jlebon commented Apr 20, 2018

@rh-atomic-bot retry

@rh-atomic-bot
Copy link

⌛ Testing commit 9414bea with merge eca179b...

rh-atomic-bot pushed a commit that referenced this pull request Apr 20, 2018
This took a whole lot of experimentation.  I hit upon the idea
of doing a `systemctl stop sshd` to avoid the situation where we
might ssh back into the system while it's in the process of shutting
down.

Ultimately the other fix is disabling `ControlMaster`; see
for example: ansible/ansible#17935

Closes: #1548
Approved by: cgwalters
rh-atomic-bot pushed a commit that referenced this pull request Apr 20, 2018
It seems like 240 retries is just not long enough for all the
non-destructive tests running in parallel to finish. Let's crank that up
to 500 retries.

Closes: #1548
Approved by: cgwalters
@rh-atomic-bot
Copy link

💔 Test failed - status-atomicjenkins

@cgwalters
Copy link
Member Author

@rh-atomic-bot retry

@rh-atomic-bot
Copy link

⌛ Testing commit 9414bea with merge bb90ca8...

rh-atomic-bot pushed a commit that referenced this pull request Apr 20, 2018
This took a whole lot of experimentation.  I hit upon the idea
of doing a `systemctl stop sshd` to avoid the situation where we
might ssh back into the system while it's in the process of shutting
down.

Ultimately the other fix is disabling `ControlMaster`; see
for example: ansible/ansible#17935

Closes: #1548
Approved by: cgwalters
rh-atomic-bot pushed a commit that referenced this pull request Apr 20, 2018
It seems like 240 retries is just not long enough for all the
non-destructive tests running in parallel to finish. Let's crank that up
to 500 retries.

Closes: #1548
Approved by: cgwalters
@rh-atomic-bot
Copy link

💔 Test failed - status-atomicjenkins

@jlebon
Copy link
Member

jlebon commented Apr 20, 2018

@rh-atomic-bot retry

@rh-atomic-bot
Copy link

⌛ Testing commit 9414bea with merge f46ead0...

rh-atomic-bot pushed a commit that referenced this pull request Apr 20, 2018
This took a whole lot of experimentation.  I hit upon the idea
of doing a `systemctl stop sshd` to avoid the situation where we
might ssh back into the system while it's in the process of shutting
down.

Ultimately the other fix is disabling `ControlMaster`; see
for example: ansible/ansible#17935

Closes: #1548
Approved by: cgwalters
@rh-atomic-bot
Copy link

💔 Test failed - status-atomicjenkins

@cgwalters
Copy link
Member Author

cgwalters commented Apr 20, 2018 via email

@rh-atomic-bot
Copy link

⌛ Testing commit 1463077 with merge 81705ce...

rh-atomic-bot pushed a commit that referenced this pull request Apr 20, 2018
This took a whole lot of experimentation.  I hit upon the idea
of doing a `systemctl stop sshd` to avoid the situation where we
might ssh back into the system while it's in the process of shutting
down.

Ultimately the other fix is disabling `ControlMaster`; see
for example: ansible/ansible#17935

Closes: #1548
Approved by: cgwalters
rh-atomic-bot pushed a commit that referenced this pull request Apr 20, 2018
It seems like 240 retries is just not long enough for all the
non-destructive tests running in parallel to finish. Let's crank that up
to 500 retries.

Closes: #1548
Approved by: cgwalters
rh-atomic-bot pushed a commit that referenced this pull request Apr 20, 2018
Let's only print if the commit isn't already partial; this
addresses a spam of "marking commit partial" from fsck.

Closes: #1548
Approved by: cgwalters
rh-atomic-bot pushed a commit that referenced this pull request Apr 20, 2018
No need to poll every second, there's going to be some latency
here and we want to avoid the overhead of polling.

Closes: #1548
Approved by: cgwalters
rh-atomic-bot pushed a commit that referenced this pull request Apr 20, 2018
Hopefully we'll fix this soon.

Closes: #1548
Approved by: cgwalters
@jlebon
Copy link
Member

jlebon commented Apr 20, 2018

@rh-atomic-bot, stop being such a jerk!

@rh-atomic-bot
Copy link

💔 Test failed - status-atomicjenkins

@cgwalters
Copy link
Member Author

OK, I thought I had it, it seemed fairly robust locally but apparently no dice. I briefly looked at doing a custom Ansible module, but (again just from initial investigation) it seems we'd really need the engine to understand this; we basically want an atomic combination of- shell: reboot + - wait_for_connection.

@cgwalters
Copy link
Member Author

@rh-atomic-bot r+ 951052c

@rh-atomic-bot
Copy link

⌛ Testing commit 951052c with merge d168de5...

rh-atomic-bot pushed a commit that referenced this pull request Apr 23, 2018
This took a whole lot of experimentation.  I hit upon the idea
of doing a `systemctl stop sshd` to avoid the situation where we
might ssh back into the system while it's in the process of shutting
down.

Ultimately the other fix is disabling `ControlMaster`; see
for example: ansible/ansible#17935

Closes: #1548
Approved by: cgwalters
rh-atomic-bot pushed a commit that referenced this pull request Apr 23, 2018
It seems like 240 retries is just not long enough for all the
non-destructive tests running in parallel to finish. Let's crank that up
to 500 retries.

Closes: #1548
Approved by: cgwalters
rh-atomic-bot pushed a commit that referenced this pull request Apr 23, 2018
Let's only print if the commit isn't already partial; this
addresses a spam of "marking commit partial" from fsck.

Closes: #1548
Approved by: cgwalters
rh-atomic-bot pushed a commit that referenced this pull request Apr 23, 2018
No need to poll every second, there's going to be some latency
here and we want to avoid the overhead of polling.

Closes: #1548
Approved by: cgwalters
rh-atomic-bot pushed a commit that referenced this pull request Apr 23, 2018
Hopefully we'll fix this soon.

Closes: #1548
Approved by: cgwalters
@cgwalters
Copy link
Member Author

Going back to longer timeouts seems to be OK in some manual testing in the CI env. We'll just live with races temporarily...I'll investigate other approaches.

@rh-atomic-bot
Copy link

💔 Test failed - status-atomicjenkins

@jlebon
Copy link
Member

jlebon commented Apr 23, 2018

Well, the PR test definitely passed, so 🍾 🎊.
Let's give it another try, it should ⚡.
@rh-atomic-bot retry

@rh-atomic-bot
Copy link

⚡ Test exempted: pull fully rebased and already tested.

rh-atomic-bot pushed a commit that referenced this pull request Apr 23, 2018
It seems like 240 retries is just not long enough for all the
non-destructive tests running in parallel to finish. Let's crank that up
to 500 retries.

Closes: #1548
Approved by: cgwalters
rh-atomic-bot pushed a commit that referenced this pull request Apr 23, 2018
Let's only print if the commit isn't already partial; this
addresses a spam of "marking commit partial" from fsck.

Closes: #1548
Approved by: cgwalters
rh-atomic-bot pushed a commit that referenced this pull request Apr 23, 2018
No need to poll every second, there's going to be some latency
here and we want to avoid the overhead of polling.

Closes: #1548
Approved by: cgwalters
rh-atomic-bot pushed a commit that referenced this pull request Apr 23, 2018
Hopefully we'll fix this soon.

Closes: #1548
Approved by: cgwalters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants