Fix wait() logic wrt timeout #67

Merged
merged 4 commits into from Apr 28, 2015

Conversation

Projects
None yet
5 participants
Contributor

AdamIsrael commented Apr 9, 2015

This PR addresses lp:1430488.

The waiter.wait blocks for the full duration of the timeout, and helpers.timeout creates a signal-based timeout that's triggered when the timeout expires. This lead to juju_agent() being run against a unit still installing, which would throw an exception.

I made waiter.wait block for a smaller, arbitrary value of time (15 seconds), check the unit status, and repeat until it's ready or the timeout is expired.

amulet/sentry.py
ready = False
try:
+ now = time.time()
with helpers.timeout(timeout):
# Make sure we're in a 'started' state across the board
@mbruzek

mbruzek Apr 10, 2015

Contributor

The previous code does not actually check for "started" here. I would recommend removing this comment or changing it to reflect what the code is actually waiting for in status.

Contributor

mbruzek commented Apr 10, 2015

@AdamIsrael I am not a reviewer for amulet code base, just curious really. I looked this over and it appears the side effect is the code will check status more frequently, but does seem to be better waiting logic.

Contributor

AdamIsrael commented Apr 16, 2015

Thanks for the feedback Matt. I'm no longer convinced that this new logic fixes the underlying issue. I'll update the pull request after I've done further testing.

Fix wait behavior for new units
The previous version of wait_for_status would only check to see if a
public-address was available for a unit, but some providers (amazon)
return the public-address while the unit is still allocating, which
could cause wait_for_status to return before the unit was started.
Contributor

AdamIsrael commented Apr 17, 2015

Hey @mbruzek, thanks for the feedback! I've dug into the issue deeper and updated this pull request.

There were two issues:

wait_for_status only checked to see if a unit had a public-address before considering it to be ready, but at least one provider (amazon) returns the public-address while the unit is still being allocated. This led to a condition where calling add_unit() would return before the unit was available, and subsequent calls against the unit could fail.

@@ -235,7 +240,7 @@ def wait_for_status(self, juju_env, services, timeout=300):
ready = True
status = waiter.status(juju_env)
for service in services:
- if not 'units' in status['services'][service]:
+ if 'units' not in status['services'][service]:
@chuckbutler

chuckbutler Apr 20, 2015

Contributor

Yoda conditionals :D

amulet/sentry.py
- while not ready:
+ now = time.time()
+ while not ready and now + timeout < time.time():
+ waiter.wait(timeout=15)
@chuckbutler

chuckbutler Apr 20, 2015

Contributor

This is arbitrary right? or is there science being 15 seconds

@AdamIsrael

AdamIsrael Apr 20, 2015

Contributor

Totally arbitrary. I thought about calculating a sane value based on timeout, but 15 seemed sane enough.

amulet/sentry.py
- break
- else:
- ready = True
+ if unit['agent-state'] == 'started':
@chuckbutler

chuckbutler Apr 20, 2015

Contributor

👍 on finding the status as started

amulet/sentry.py
+ if status is None:
+ ready = False
+ break
+ if 'hook' in status and status['hook']:
@chuckbutler

chuckbutler Apr 20, 2015

Contributor

This is wrt the new extended status support thats coming right? +1 again on this. very nice

@AdamIsrael

AdamIsrael Apr 20, 2015

Contributor

If you mean the 'hook' in status, that was existing. I'm not exactly sure what the use case for it is, so I left it as is.

Contributor

chuckbutler commented Apr 20, 2015

I have a few comments for clarity, but over this LGTM

amulet/sentry.py
- waiter.wait(timeout=timeout)
- while not ready:
+ now = time.time()
+ while not ready and now + timeout < time.time():
@tvansteenburgh

tvansteenburgh Apr 20, 2015

Member

I'm not sure you want the extra time checking here. Clients are expecting this method (wait()) to raise if units aren't ready within the timeout. I think just while not ready is better here.

AdamIsrael added some commits Apr 23, 2015

Switch test to precise, so it will actually run.
rsyslog-forwarder isn’t available in trusty, so this test was always
failing.
Contributor

AdamIsrael commented Apr 23, 2015

While reviewing @tvansteenburgh's comments, I discovered that I wasn't solving the underlying problem.

UnitSentry._run relied on the default juju run timeout of 5 minutes, and this method blocks if other hook contexts are executing. In the case of the ubuntu-repository-cache charm, that hook context was doing a full rsync of the Ubuntu archive and takes longer than 5 minutes to run.

Talisman.wait_for_status now waits until all units are in a 'started' state and have a public-address assigned, so the Talisman.wait method will now honor the timeout parameter when waiting for the juju_agent method to be run against the unit.

PR #68 adds a test for calling add_unit against an already deployed service.

@AdamIsrael AdamIsrael added the bug label Apr 23, 2015

marcoceppi added a commit that referenced this pull request Apr 28, 2015

@marcoceppi marcoceppi merged commit 5c7927d into juju:lp-1430488 Apr 28, 2015

1 check failed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment