-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WiP] Refactor agent launchers, weave to use Observable #2582
base: master
Are you sure you want to change the base?
Conversation
883be21
to
4a1cfab
Compare
|
||
# @return [Array<String>] 192.168.66.0/24 | ||
def grid_trusted_subnets | ||
@grid['trusted_subnets'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ not needed, there's an attr_reader.
…ude actor spawning costs
Implements new `Kontena::Observer` primitives/behaviors required for the #2582 refactoring: * Change the `Kontena::Observable` to be a standalone class, using a mutex to synchronize observers and observable updates * Change the `Kontena::Observer` to be a standalone class, using the actor mailbox to wait for observable updates * Replace the `observer.async.update_observe` update calls with custom mailbox messages, handled by the observing task using [`Celluloid.receive`](https://github.com/celluloid/celluloid/wiki/Protocol-Interaction) * Changed async `observe(observable) do |value| ... end` to yield from the same task * Implement sync `value = observe(observable)` that just returns instead of yielding * The sync API also supports a timeout * Use sync `@node = observe(Actor[:node_info_worker])` for service pod and volume managers * Use lightweight `Observable` instances for the RPC client requests With the mailbox-based observer protocol, the `Kontena::Observer` is now also usable outside of the top-level actor classes. This means that the `Kontena::Observer.observe` method can be used in normal object instances like the `Kontena::ServicePods::Creator`. The observer uses `Celluloid.current_actor.mailbox` to receive updates, and `Celluloid.receive` to suspend the observing task. With the mutex-based observable synchronization, the `Kontena::Observable` is now also usable outside of the top-level actor class. It is used by the `RpcClient` actor as a kind of hybrid between a Future and a Condition. Compared to the existing code, where the behavior of the `wait_until!` `sleep` would depend on what kind of class the `WaitHelper` was included into, and the `observer.async.update_observe` implementation only allowed the top-level actor object to observe anything. The new observe interface requires the use of [`Celluloid.receive`](https://github.com/celluloid/celluloid/wiki/Protocol-Interaction) API to suspend the observing task while waiting for observable updates. This API isn't ideal, though, because apart from the special case of a synchronous observe of a single observable, the observer must be prepared to receive multiple observable update messages at any time. However, if the observing task ever suspends outside of the call to `Celluloid.receive`, then any observable messages received by the actor will just get discarded. Hence, the new `observe` implementation must make extensive use of `Celluloid.exclusive { ... }` - this allows any concurrent observable update messages to remain queued up in the actor mailbox until the task returns to the `Celluloid.receive` call.
Note: I don't really expect this PR to get reviewed/merged in this form. This is more about experimenting with how I think the agent could get refactored... once I think I have something working, I'll try and split out smaller PRs. That may cause some extra work though, because a lot of this work is about the inter-dependencies between actors... |
2595d7e
to
909df56
Compare
Follows #2704
Fixes #2134 remaining launchers also call
Docker::Container#start!
and raise errorsFixes #1639
weaveexec ps
for expose migration no longer breaks withWEAVE_DEBUG=1
(stdout vs stderr)Fixes #1397 weave launcher calls either
launch-router
orattach-router
Hopefully fixes #1395 by allowing weave actors to recover from crashes
This is a major refactoring that is likely to also include some new bugs, and needs testing before release.
Kontena::NetworkAdapters::Weave
into separateKontena::Launchers::Weave
andKontena::Workers::WeaveWorker
with their own Observer/Observable statesKontena::Launchers::Weave
Observable replaces thenetwork_adapter:start
notification /wait_weave_running?
sleep (Weave router is running)Kontena::NetworkAdapters::Weave
Observable replaces thenetwork:ready
notification /wait_network_ready?
sleep (IPAM is available for allocations)wait_until!
calls withKontena::Observer#observe
Kontena::Workers::WeaveWorker
observeKontena::Launchers::Etcd
instead ofdns:add
notification... but this still fails across etcd/weave restarts ❗TODO
Refactor the
ObservableObserver
patternMany of the actors now do
observe
=>update
=>update_observable/reset_observable
, which needs better support from theObserver
/Observable
implementation.Fix observes to be exclusive
Observer
should wait for any still runningobserve
block to complete before calling it again.Meanwhile, it can also coalesce updates that come in faster than the observer is able to observe them.
Retry failed observes
Some part of the Observer update should rescue retryable errors provide some sane retry policy.
The wip
update
pattern rescues errors by logging them and settingreset_observable
, but this only works for errors caused by out-of-date observables, relying on observable updates to retry... not a valid assumption.Letting the actor crash and having the celluloid supervisor restart it does not provide a sane retry policy: there is no backoff, and the actor crashes seem to leak memory: Agent: Actor crash-and-restart loop will eat lot of resources #2231
Optimize deep observer chains with multiple
node_info_worker
observables to avoid unnecessary updatesIf the
weave_launcher
observes thenode_info_worker
, and theetcd_launcher
observes bothnode_info_worker
andweave_launcher
, theetcd_launcher
observe block should only run once both thenode_info_worker
and theweave_launcher
have updated.This behaves correctly during the initial startup, but on subsequent
node_info_worker
updates, theetcd_launcher
observe will also run a second time after theweave_launcher
also updates.Optimize launchers that now update on each node info update, not just once at startup
The various launchers subscribed to
network_adapter:start
, which theKontena::NetworkAdapters::Weave
only published once (unless it crashed). Now the same launcherobserve
blocks run on everynode_info
update. This means extra Docker container inspect calls.Smarter error handling for reporting agent health?
Future PR...