Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WiP] Refactor agent launchers, weave to use Observable #2582

Open
wants to merge 190 commits into
base: master
Choose a base branch
from

Conversation

SpComb
Copy link
Contributor

@SpComb SpComb commented Jul 14, 2017

Follows #2704

Fixes #2134 remaining launchers also call Docker::Container#start! and raise errors
Fixes #1639 weaveexec ps for expose migration no longer breaks with WEAVE_DEBUG=1 (stdout vs stderr)
Fixes #1397 weave launcher calls either launch-router or attach-router
Hopefully fixes #1395 by allowing weave actors to recover from crashes

This is a major refactoring that is likely to also include some new bugs, and needs testing before release.

  • Simplify error handling by allowing actors to recover from crashes
  • Untangle the Kontena::NetworkAdapters::Weave into separate Kontena::Launchers::Weave and Kontena::Workers::WeaveWorker with their own Observer/Observable states
    • The Kontena::Launchers::Weave Observable replaces the network_adapter:start notification / wait_weave_running? sleep (Weave router is running)
    • The Kontena::NetworkAdapters::Weave Observable replaces the network:ready notification / wait_network_ready? sleep (IPAM is available for allocations)
  • Replace all wait_until! calls with Kontena::Observer#observe
  • Have Kontena::Workers::WeaveWorker observe Kontena::Launchers::Etcd instead of dns:add notification... but this still fails across etcd/weave restarts ❗

TODO

  • Testing
  • Figure out if the weave/IPAM error handling is robust enough without the sleeps for API readyness

Refactor the ObservableObserver pattern

Many of the actors now do observe => update => update_observable/reset_observable, which needs better support from the Observer/Observable implementation.

  • Fix observes to be exclusive

    Observer should wait for any still running observe block to complete before calling it again.

    Meanwhile, it can also coalesce updates that come in faster than the observer is able to observe them.

  • Retry failed observes

    Some part of the Observer update should rescue retryable errors provide some sane retry policy.

    The wip update pattern rescues errors by logging them and setting reset_observable, but this only works for errors caused by out-of-date observables, relying on observable updates to retry... not a valid assumption.

    Letting the actor crash and having the celluloid supervisor restart it does not provide a sane retry policy: there is no backoff, and the actor crashes seem to leak memory: Agent: Actor crash-and-restart loop will eat lot of resources #2231

  • Optimize deep observer chains with multiple node_info_worker observables to avoid unnecessary updates

    If the weave_launcher observes the node_info_worker, and the etcd_launcher observes both node_info_worker and weave_launcher, the etcd_launcher observe block should only run once both the node_info_worker and the weave_launcher have updated.

    This behaves correctly during the initial startup, but on subsequent node_info_worker updates, the etcd_launcher observe will also run a second time after the weave_launcher also updates.

  • Optimize launchers that now update on each node info update, not just once at startup

    The various launchers subscribed to network_adapter:start, which the Kontena::NetworkAdapters::Weave only published once (unless it crashed). Now the same launcher observe blocks run on every node_info update. This means extra Docker container inspect calls.

  • Smarter error handling for reporting agent health?

    Future PR...

@SpComb SpComb added the agent label Jul 14, 2017

# @return [Array<String>] 192.168.66.0/24
def grid_trusted_subnets
@grid['trusted_subnets']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ not needed, there's an attr_reader.

SpComb added a commit that referenced this pull request Sep 8, 2017
Implements new `Kontena::Observer` primitives/behaviors required for the #2582 refactoring:

* Change the `Kontena::Observable` to be a standalone class, using a mutex to synchronize observers and observable updates 
* Change the `Kontena::Observer` to be a standalone class, using the actor mailbox to wait for observable updates
* Replace the `observer.async.update_observe` update calls with custom mailbox messages, handled by the observing task using [`Celluloid.receive`](https://github.com/celluloid/celluloid/wiki/Protocol-Interaction)
* Changed async `observe(observable) do |value| ... end` to yield from the same task
* Implement sync `value = observe(observable)` that just returns instead of yielding
* The sync API also supports a timeout
* Use sync `@node = observe(Actor[:node_info_worker])` for service pod and volume managers
* Use lightweight `Observable` instances for the RPC client requests

With the mailbox-based observer protocol, the `Kontena::Observer` is now also usable outside of the top-level actor classes. This means that the `Kontena::Observer.observe` method can be used in normal object instances like the `Kontena::ServicePods::Creator`. The observer uses `Celluloid.current_actor.mailbox` to receive updates, and `Celluloid.receive` to suspend the observing task.

With the mutex-based observable synchronization, the `Kontena::Observable` is now also usable outside of the top-level actor class. It is used by the `RpcClient` actor as a kind of hybrid between a Future and a Condition.

Compared to the existing code, where the behavior of the `wait_until!` `sleep` would depend on what kind of class the `WaitHelper` was included into, and the `observer.async.update_observe` implementation only allowed the top-level actor object to observe anything.

The new observe interface requires the use of [`Celluloid.receive`](https://github.com/celluloid/celluloid/wiki/Protocol-Interaction) API to suspend the observing task while waiting for observable updates. This API isn't ideal, though, because apart from the special case of a synchronous observe of a single observable, the observer must be prepared to receive multiple observable update messages at any time. However, if the observing task ever suspends outside of the call to `Celluloid.receive`, then any observable messages received by the actor will just get discarded. Hence, the new `observe` implementation must make extensive use of `Celluloid.exclusive { ... }` - this allows any concurrent observable update messages to remain queued up in the actor mailbox until the task returns to the `Celluloid.receive` call.
@SpComb
Copy link
Contributor Author

SpComb commented Dec 11, 2017

Note: I don't really expect this PR to get reviewed/merged in this form. This is more about experimenting with how I think the agent could get refactored... once I think I have something working, I'll try and split out smaller PRs.

That may cause some extra work though, because a lot of this work is about the inter-dependencies between actors...

@jakolehm jakolehm removed this from the 1.5.0 milestone Jan 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants