Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker/lease fix the handling of multiple concurrent errors #9730

Merged
merged 1 commit into from
Feb 11, 2019

Conversation

howbazaar
Copy link
Contributor

This fix is for the observed failures we have seen in openstack deploys with solutions QA.

The uniter worker would get stuck setting up the initial remote state watcher. This watcher would try to claim the leadership, and the ClaimLeadership facade call would never return.

On the apiserver, the worker/lease was in a state where it was trying to restart, but waiting on the in flight wait group. If there were two concurrent claims, both of which hit errors, the first would be processed causing the worker/lease loop to exit. This triggered a defer on the wait group, which wasn't being decremented until the second in flight claim finished. However the second in flight one was blocked trying to send the error down the errors channel. This was no longer being selected on.

The fix is to not use an errors channel, but instead just kill the catacomb.

Bug reference

https://bugs.launchpad.net/juju/+bug/1815397

Copy link
Contributor

@babbageclunk babbageclunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@babbageclunk
Copy link
Contributor

$$merge$$

@babbageclunk
Copy link
Contributor

The blocked ClaimLeadership API request might explain the API server blocking when trying to restart in https://bugs.launchpad.net/juju/+bug/1813261

I'm not sure what the underlying error is - a timeout won't kill the manager. Once this fix is landed we'll potentially be able to see the manager crashing for some other reason.

@jujubot jujubot merged commit 336e93d into juju:2.5 Feb 11, 2019
@jameinel jameinel mentioned this pull request Feb 13, 2019
jujubot added a commit that referenced this pull request Feb 13, 2019
#9748

## Description of change

This just brings develop up-to-date with all the 2.5 patches:
 prdesc Merge pull request #9745 from jameinel/2.5-lease-invalid-retries-1815719
 prdesc Merge pull request #9740 from wallyworld/cmr-multi-offer-fix
 prdesc Merge pull request #9742 from jameinel/2.5-worker-lease-1815468
 prdesc Merge pull request #9736 from jameinel/2.5-leadership-client
 prdesc Merge pull request #9737 from jameinel/2.5-worker-lease-1815468
 prdesc Merge pull request #9722 from achilleasa/fix-1812227
 prdesc Merge pull request #9734 from babbageclunk/state-worker-dep-message
 prdesc Merge pull request #9733 from babbageclunk/raftlease-stop-global-clock
 prdesc Merge pull request #9735 from howbazaar/2.5-mongo-systemd-ulimit
 prdesc Merge pull request #9724 from babbageclunk/raftlease-upgrade-blank
 prdesc Merge pull request #9731 from howbazaar/2.5-status-close-error
 prdesc Merge pull request #9730 from howbazaar/2.5-lease-race
 prdesc Merge pull request #9727 from achilleasa/fix-1814638
 prdesc Merge pull request #9728 from wallyworld/rename-delete-storage-pool
 prdesc Merge pull request #9712 from jameinel/2.5-leases-nextTick
 prdesc Merge pull request #9709 from jameinel/2.5-update-testing-clock

## QA steps

See individual patches.

## Documentation changes

See individual patches.

## Bug reference

 prdesc https://bugs.launchpad.net/juju/+bug/1815719
 prdesc https://bugs.launchpad.net/juju/+bug/1813151
 prdesc https://bugs.launchpad.net/juju/+bug/1815179
 prdesc https://bugs.launchpad.net/juju/+bug/1815471
 prdesc https://bugs.launchpad.net/juju/+bug/1815468
 prdesc https://bugs.launchpad.net/juju/+bug/1812227
 prdesc https://bugs.launchpad.net/juju/+bug/1815405
 prdesc https://bugs.launchpad.net/juju/+bug/1813996
 prdesc https://bugs.launchpad.net/juju/+bug/1813995
 prdesc https://bugs.launchpad.net/juju/+bug/1815397
 prdesc https://bugs.launchpad.net/juju/+bug/1814638
 prdesc https://bugs.launchpad.net/juju/+bug/1814556
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants