[WiP] Refactor agent launchers, weave to use Observable #2582

SpComb · 2017-07-14T13:06:14Z

Follows #2704

Fixes #2134 remaining launchers also call Docker::Container#start! and raise errors
Fixes #1639 weaveexec ps for expose migration no longer breaks with WEAVE_DEBUG=1 (stdout vs stderr)
Fixes #1397 weave launcher calls either launch-router or attach-router
Hopefully fixes #1395 by allowing weave actors to recover from crashes

This is a major refactoring that is likely to also include some new bugs, and needs testing before release.

Simplify error handling by allowing actors to recover from crashes
Untangle the Kontena::NetworkAdapters::Weave into separate Kontena::Launchers::Weave and Kontena::Workers::WeaveWorker with their own Observer/Observable states
- The Kontena::Launchers::Weave Observable replaces the network_adapter:start notification / wait_weave_running? sleep (Weave router is running)
- The Kontena::NetworkAdapters::Weave Observable replaces the network:ready notification / wait_network_ready? sleep (IPAM is available for allocations)
Replace all wait_until! calls with Kontena::Observer#observe
Have Kontena::Workers::WeaveWorker observe Kontena::Launchers::Etcd instead of dns:add notification... but this still fails across etcd/weave restarts ❗

TODO

Testing
Figure out if the weave/IPAM error handling is robust enough without the sleeps for API readyness

Refactor the `ObservableObserver` pattern

Many of the actors now do observe => update => update_observable/reset_observable, which needs better support from the Observer/Observable implementation.

Fix observes to be exclusive

Observer should wait for any still running observe block to complete before calling it again.

Meanwhile, it can also coalesce updates that come in faster than the observer is able to observe them.
Retry failed observes

Some part of the Observer update should rescue retryable errors provide some sane retry policy.

The wip update pattern rescues errors by logging them and setting reset_observable, but this only works for errors caused by out-of-date observables, relying on observable updates to retry... not a valid assumption.

Letting the actor crash and having the celluloid supervisor restart it does not provide a sane retry policy: there is no backoff, and the actor crashes seem to leak memory: Agent: Actor crash-and-restart loop will eat lot of resources #2231
Optimize deep observer chains with multiple node_info_worker observables to avoid unnecessary updates

If the weave_launcher observes the node_info_worker, and the etcd_launcher observes both node_info_worker and weave_launcher, the etcd_launcher observe block should only run once both the node_info_worker and the weave_launcher have updated.

This behaves correctly during the initial startup, but on subsequent node_info_worker updates, the etcd_launcher observe will also run a second time after the weave_launcher also updates.
Optimize launchers that now update on each node info update, not just once at startup

The various launchers subscribed to network_adapter:start, which the Kontena::NetworkAdapters::Weave only published once (unless it crashed). Now the same launcher observe blocks run on every node_info update. This means extra Docker container inspect calls.
Smarter error handling for reporting agent health?

Future PR...

kke · 2017-08-09T11:48:16Z

agent/lib/kontena/models/node.rb

+
+  # @return [Array<String>] 192.168.66.0/24
+  def grid_trusted_subnets
+    @grid['trusted_subnets']


@ not needed, there's an attr_reader.

…de.grid_token

…ude actor spawning costs

Implements new `Kontena::Observer` primitives/behaviors required for the #2582 refactoring: * Change the `Kontena::Observable` to be a standalone class, using a mutex to synchronize observers and observable updates * Change the `Kontena::Observer` to be a standalone class, using the actor mailbox to wait for observable updates * Replace the `observer.async.update_observe` update calls with custom mailbox messages, handled by the observing task using [`Celluloid.receive`](https://github.com/celluloid/celluloid/wiki/Protocol-Interaction) * Changed async `observe(observable) do |value| ... end` to yield from the same task * Implement sync `value = observe(observable)` that just returns instead of yielding * The sync API also supports a timeout * Use sync `@node = observe(Actor[:node_info_worker])` for service pod and volume managers * Use lightweight `Observable` instances for the RPC client requests With the mailbox-based observer protocol, the `Kontena::Observer` is now also usable outside of the top-level actor classes. This means that the `Kontena::Observer.observe` method can be used in normal object instances like the `Kontena::ServicePods::Creator`. The observer uses `Celluloid.current_actor.mailbox` to receive updates, and `Celluloid.receive` to suspend the observing task. With the mutex-based observable synchronization, the `Kontena::Observable` is now also usable outside of the top-level actor class. It is used by the `RpcClient` actor as a kind of hybrid between a Future and a Condition. Compared to the existing code, where the behavior of the `wait_until!` `sleep` would depend on what kind of class the `WaitHelper` was included into, and the `observer.async.update_observe` implementation only allowed the top-level actor object to observe anything. The new observe interface requires the use of [`Celluloid.receive`](https://github.com/celluloid/celluloid/wiki/Protocol-Interaction) API to suspend the observing task while waiting for observable updates. This API isn't ideal, though, because apart from the special case of a synchronous observe of a single observable, the observer must be prepared to receive multiple observable update messages at any time. However, if the observing task ever suspends outside of the call to `Celluloid.receive`, then any observable messages received by the actor will just get discarded. Hence, the new `observe` implementation must make extensive use of `Celluloid.exclusive { ... }` - this allows any concurrent observable update messages to remain queued up in the actor mailbox until the task returns to the `Celluloid.receive` call.

SpComb · 2017-12-11T12:14:52Z

Note: I don't really expect this PR to get reviewed/merged in this form. This is more about experimenting with how I think the agent could get refactored... once I think I have something working, I'll try and split out smaller PRs.

That may cause some extra work though, because a lot of this work is about the inter-dependencies between actors...

Tero Marttila added 6 commits July 14, 2017 12:36

agent: use ipaddress for node info IPs

406d91d

agent: new refactored weave launcher

67156bb

more weave launcher -> weave exec refactoring

a7e9443

refactor etcd, ipam launchers to observable

5514a61

refactor agent weave worker / network adapter

9fe4453

fixup weave

64bcac5

SpComb added the agent label Jul 14, 2017

TODO, fixes

4a1cfab

SpComb force-pushed the refactor/agent-weave branch from 883be21 to 4a1cfab Compare July 14, 2017 13:43

SpComb mentioned this pull request Jul 14, 2017

[WiP] Reconfigurable grid weave secret #2583

Open

Tero Marttila added 6 commits July 31, 2017 12:23

Merge branch 'master' into refactor/agent-weave

6baf17e

fix WeaveExec spec

13eaf52

fix log worker spec

093e1c9

fix service pods creator spec

c2ba9eb

spec weaveexec

83e3aeb

Merge branch 'master' into refactor/agent-weave

779c02a

kke reviewed Aug 9, 2017

View reviewed changes

Tero Marttila added 13 commits August 9, 2017 15:31

weave launcher: use node info grid token as weave password

9b56758

weave launcher spec

22c362a

spec weaveexec reset on errors

3e00de6

spec weave launcher start

8e86cfc

spec launcher helper

3f61037

spec, fix etcd launcher

8585bf4

spec weave launcher calls on actor

9046dfc

TODO

e7d7622

supervise ipam cleaner from agent

63f2f6c

fix, spec ipam launcher

67c00b3

revert weave launcher to use KONTENA_TOKEN instead of non-existent No…

14e0608

…de.grid_token

refactor, spec cadvisor launcher

2a6e4cc

node: grid_iprange, return strings instead of IPAddress

c38e537

Tero Marttila added 9 commits September 8, 2017 12:59

observable? -> ready?

b2fd9b2

doc and spec async observe timeout

204f5ba

spec async observe break

8e64c99

retry concurrent observer race spec if necessary

555f3ab

cleaner Celluloid.exclusive patch

da5e900

Merge branch 'master' into feature/agent-observable-wait

1f03115

better benchmark result reporting; fix multi-actor benchmarks to excl…

0668366

…ude actor spawning costs

also sleep for other singleactor cases, improves max latency

a986f9e

spec and fix non-persistent observe

75426b6

Merge branch 'feature/agent-observable-wait' into refactor/agent-weave

cb8d06e

SpComb mentioned this pull request Sep 27, 2017

Use per-node tokens for agent-server websocket authentication #2501

Open

SpComb mentioned this pull request Nov 10, 2017

Actor crashed - network_ready - container went missing #3027

Open

Tero Marttila added 6 commits December 11, 2017 13:20

Merge branch 'master' into refactor/agent-weave

a6c0c8a

fix observable changes

d26e6b3

fix network_adapters/weave spec

084d9d7

fixup etcd launcher spec

580a6fa

fixup weave worker spec

7f30a60

cleanup unused Celluloid::Notifications

bbd7ab8

Tero Marttila added 6 commits December 12, 2017 10:55

propagate apply errors as observable crashes

fb7ec76

refactor WeaveExec to not be an actor

e972983

fix WeaveWorker container events subscribe

0789ad7

refactor Kontena::NetworkAdapters::ContainerConfigurer

42146a7

refactor Kontena::NetworkAdapters::ContainerAttacher/Releaser

32f5a01

fix ipam launcher to activate plugin

909df56

SpComb force-pushed the refactor/agent-weave branch from 2595d7e to 909df56 Compare December 12, 2017 10:27

Tero Marttila added 2 commits December 13, 2017 11:56

refactor etcd launcher options

5e519fb

Merge branch 'master' into refactor/agent-weave

335d46e

jakolehm removed this from the 1.5.0 milestone Jan 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WiP] Refactor agent launchers, weave to use Observable #2582

[WiP] Refactor agent launchers, weave to use Observable #2582

SpComb commented Jul 14, 2017 •

edited

Loading

kke Aug 9, 2017

SpComb commented Dec 11, 2017

[WiP] Refactor agent launchers, weave to use Observable #2582

Are you sure you want to change the base?

[WiP] Refactor agent launchers, weave to use Observable #2582

Conversation

SpComb commented Jul 14, 2017 • edited Loading

TODO

Refactor the ObservableObserver pattern

kke Aug 9, 2017

Choose a reason for hiding this comment

SpComb commented Dec 11, 2017

SpComb commented Jul 14, 2017 •

edited

Loading

Refactor the `ObservableObserver` pattern