From d55bbbd6cc1a9556976abd7bc7b7529c6c65b081 Mon Sep 17 00:00:00 2001 From: Konrad `ktoso` Malawski Date: Thu, 5 Nov 2020 13:52:45 +0900 Subject: [PATCH] wip reentrant actors discussion wip wip --- proposals/nnnn-actors.md | 547 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 534 insertions(+), 13 deletions(-) diff --git a/proposals/nnnn-actors.md b/proposals/nnnn-actors.md index a1c3d82ea2f..81d467547e2 100644 --- a/proposals/nnnn-actors.md +++ b/proposals/nnnn-actors.md @@ -140,29 +140,550 @@ extension BankAccount { } ``` -#### Closures and local functions +### Actor reentrancy (discussion) -The restrictions on only allowing access to (non-`async`) actor-isolated declarations on `self` only work so long as we can ensure that the code in which `self` is valid is executing non-concurrently on the actor. For methods on the actor class, this is established by the rules described above: `async` function calls are serialized via the actor's queue, and non-`async` calls are only allowed when we know that we are already executing (non-concurrently) on the actor. +One critical point that needs to be discussed and fleshed out is whether actors are [reentrant](https://en.wikipedia.org/wiki/Reentrancy_(computing)) by default or not. -However, `self` can also be captured by closures and local functions. Should those closures and local functions have access to actor-isolated state on the captured `self`? Consider an example where we want to close out a bank account and distribute the balance amongst a set of accounts: +Currently, this proposal takes the adventerous approach of assuming all actors are reentrant by default. This allows for the runtime to claim the complete elimination of deadlocks, offers opportunity for scheduling optimization techniques where a "high priority task" _must_ be executed as soon as possible, however at the cost of _interleaving_ with any other actor-isolated function's execution any other asynchronous function declared on this actor. This poses a mental burden on end-users of the actor runtime. + +The decision between reentrant and non-reentrant actor behaviors is a difficult balancing act between various tradeoffs. In this section we would like to provide some examples of the challenges and tradeoffs to be taken, to finally consider what the right set of tradeoffs for Swift Actors would be. + +#### Reentrant actors: "Interleaving" execution + +With reentrant actors the actor-isolated functions may "interleave." + +Interleaving executions still respect the actor's single-threaded illusion–i.e. no two functions will ever execute *concurrently* on any given actor–however they may _interleave_ at suspension points. In broad terms this means that actors are "thread-safe" but are not "data race safe." + +> The default being interleaving is a quite daring departure from all other known actor runtimes, thus we need to be really sure we want to and need to make it IMHO. + +Interleving execution weakens the threading model offered by swift actors in the sense that actor isolated-functions on reentrant actors are not "high-level atomic" in the same sense a synchronous function is. A synchronous function always runs from it's invocation to it's return without any suspension point, and thus _cannot_ interleave with any other execution. + + +> Empirically: we know that both an "non-reentrant" and "reentrant" await are useful; though the non-reentrant one was used _more frequently_ in some real applications we worked with. + +To further clarify the implications of this, let us consider the following snippet: ```swift -extension BankAccount { - func close(distributingTo accounts: [BankAccount]) async { - let transferAmount = balance / accounts.count - - accounts.forEach { account in - balance = balance - transferAmount // is this safe? - Task.runDetached { - await account.deposit(amount: transferAmount) - } +enum Opinion { + case noIdea + case goodIdea + case badIdea +} + +// reentrant actor +actor class DecisionMaker { + var opinion: Judgement = .noIdea + + func thinkOfGoodIdea() async -> Decision { + opinion = .goodIdea // <1> + _ = await consultWith(firstFriend, myOpinion: opinion) // <2> + return opinion // 🤨 // <3> + } + + func thinkOfBadIdea() async { + opinion = .badIdea // <4> + _ = await consultWith(firstFriend, myOpinion: opinion) // <5> + return opinion // 🤨 // <6> + } +} +``` + +> Reentrant code is notoriously difficult to program with–one is protected from low level data-races, however the higher level semantic races may still happen. In the examples shown, the states are small enough that they are simple to fit in one's hear so they do not appear as tricky, however in the real world races can be dauntingly hard to debug. + +In the example above the `DecisionMaker` is invoked think of a good or bad idea, shares that opinion with a friend, then _ignores_ their friends advice, and returns their _own opinion_. Since the actor is reentrant this code is wrong and will return an _arbitrary opinion_ rather than the opinion the actor initially stored at the beginning of its thinking process. + +This is examplified by the following piece of code, exercising the `decisionMaker` actor: + +```swift +async let shouldBeGood = decisionMaker.thinkOfGoodIdea() // runs async +async let shouldBeBad = decisionMaker.thinkOfBadIdea() // runs async + +await firstDecision // could be .goodIdea or .badIdea (!) +await secondDecision +``` + +This snippet _may_ result (depending on timing of the resumptions) in the following execution: + +```swift +opinion = .goodIdea // <1> +// suspend: await consultWith(...) // <2> +opinion = .badIdea // | <4> (!) +// suspend: await consultWith(...) // | <5> +// resume: await consultWith(...) // <2> +return opinion // <3> +// resume: await consultWith(...) // <5> +return opinion // <6> +``` + +But it _may_ also result in the "naively expected" execution, i.e. without interleaving. + +With this example we have showcased that reentrant actors, by design, do not prevent "high-level" race conditions. They only prevent "low-level" data-races, as in: concurrent access to some variable they protect etc. + +> Existing actor implementations err on the side of non-reentrant actors by default, allowing for reentrancy to be optionally opted-into. +> +> Erlang/Elixir ([gen_server](https://medium.com/@eduardbme/erlang-gen-server-never-call-your-public-interface-functions-internally-c17c8f28a1ee)), Orleans ([grains](https://dotnet.github.io/orleans/docs/grains/reentrancy.html)), Akka ([persist/persistAsync](https://doc.akka.io/docs/akka/current/persistence.html#relaxed-local-consistency-requirements-and-high-throughput-use-cases), though it does not have async/await), _all opt for non-reentrant behavior by default_, and allow programmers to _opt into_ reentrant whenever it would be needed. +> +> Most successful actor system implementations are primarily focused on server side systems, while Swift needs to serve both applications and server side systems well. +> +> Could it be that the cases for when we want to optimize for interleaving + +Therefore, a reentrant actor _does not_ prevent high level logical races from happening! + +So what are the advantage of such semantics? Because it prevents deadlocks, and allows for "high priority" work to interleave with less urgent work happening on the same actor (which may be important for UI programming, though is much less commonly necessary on the server). + +#### Non-reentrant actors: Deadlocks + +A non-reentrant actor is an actor which will _not_ execute any other actor-isolated asynchronous function on itself while it is awaiting a task to complete. + +If we bring back our example from the previous section, it is _trivially correct_ with non-reentrant actors: + +```swift +// *non-reentrant* actor +actor class DecisionMaker { + var opinion: Judgement = .noIdea + + func thinkOfGoodIdea() async -> Decision { + opinion = .goodIdea + _ = await consultWith(firstFriend, myOpinion: opinion) + return opinion // ✅ always .goodIdea + } + + func thinkOfBadIdea() async { + opinion = .badIdea + _ = await consultWith(firstFriend, myOpinion: opinion) + return opinion // ✅ always .badIdea + } +} +``` + +This allows programmers to really embrace the "as if single threaded, normal function" programming mindset when using (non-reentrant) actors. This benefit comes at a cost however: the potential for deadlocks. + +> Deadlocks in this discussion refer to actors asynchronously waiting on "each other," or on "future work of "self". +> +> No thread blocking is necessary to manifest this issue. + +Deadlocks can, in the non-reentrant model, can appear in the following situations: + +- **dependency loops** + - **description:** actor `A` requesting (and awaiting) on a call to `B`, and `B` then calling (and awaiting) on something from `A` directly (or indirectly) + - **solution:** such loops are possible to detect and crash on with tracing systems or debugging systems in general, and are usually easy to resolve once the call-chain is diagnosed +- **awaiting on future work** (performed by self) + - **description:** is a form of the loop case, however can happen in some more suprising cases, say spawning a detached task that calls into self + - **note:** The naive form of `await self.asyncWork()` _is fine_ under Swift rules, because this task would immediately start executing the `asyncWork` rather than "enqueue asyncWork to be worked on later on" like other actor model implementations would – we are less prone to self deadlocks than other implementations (!) + - **solution:** this again is possible to diagnose with tracing and debugging tools + +To illustrate the issue a bit more, let us consider this example: + +```swift +// non-reentrant +actor class Kitchen { + func order(order: MealOrder, from waiter: Waiter) async -> Confirmation { + await waiter.areYouSure() // deadlock (!) + } +} + +// non-reentrant +actor class Waiter { + let kitchen: Kitchen + func order(order: MealOrder) async -> Confirmation { + await kitchen.order(order: order, waiter: self) + } +} +``` + +In this example the deadlock is relatively simple to spot and diagnose. Perhaps such simple cases we could even diagnose satically someday. It may happen however that deadlocks are not easy to diagnose, in which case tracing and other diagnostic systems could help. + +Deadlocked actors would be sitting around as inactive zombies forever, because normal swift async calls do not include timeouts. + +Other runtimes solve this by making *every single actor call have a timeout*. This would mean that each await could potentially throw, and that either timeouts or deadlock detection would have to always be enabled. + +It is easy to point out a small mistake in actors spanning a few lines of code, however programming complex actors with reentrancy can be quite a challange. In this specific example, the solution–in hindsight–is simple, we should store the opinion in a function local variable, or in other words, any state the actor needs to to complete an execution "atomically" it must copy into local function scope. This can be hard to remember and manage consistently. + +> Disclaimer: Depending on one's background one can actually claim that deadlocks are better (!), than interleaving because they can be detected more easily and fixed more consistently. + +#### Non-reentrant actors: Deadlocks with async lets + +Swift also offers the `async let` declaration style, allowing for expressing structured bounded number of asynchronous child tasks being performed concurrently. If tasks depend on the actor itself completing them, and the actor were non-reentrant, it would be trivial to cause deadlocks by (semantically) blocking the actor from ever processing the future work the actor is waiting on. + +If actors were non-reentrant, the `async let` in the following example would cause a deadlock: + + +```swift +actor class Wallet { + func checkDebt() async -> Debt { ... } +func checkCash() async -> Balance { ... } + +func checkAvailableAmount() async -> Balance { + async let debt = checkDebt() + async let cash = checkCash() + return await cash - debt.amount // if non-reentrant: deadlock (!) // TODO: I guess? +} +``` + +This might mean that if we allow marking some actors non-reentrant they might need to be forbidden from using `async let` declarations...? + +#### Reentrancy Summary + +So, from a consistency point of view, one might want to prefer non-reentrant actors, but from an high-priority work scheduling in the style of "run this now, at any cost" reentrant actors offer an useful model, preventing data-races, while allowing this interleaved execution which–if one is *very careful*–can be utilized to some benefit. + +--- + +The tradeoff between reentrant and non-reentrant actors is not new, and has been tackled in various ways by existing implementations. + +The closest other actor runtime to Swift's proposed model is Microsoft's [Orleans](https://dotnet.github.io/orleans/docs/grains/reentrancy.html), due to it's inherent use of async/await for all actor interaction. Microsoft's Orleans is primarily tuned to server-side and even transactional (due to some extra features) workloads. Orleans actors are called "grains," and we'll use that term to refer to them for consistency. + +> Grain activations are single-threaded and, by default, process each request from beginning to completion before the next request can begin processing. In some circumstances, it may be desirable for an activation to process other requests while one request is waiting for an asynchronous operation to complete. + > + > For this and other reasons, Orleans gives the developer some control over the request interleaving behavior. Multiple requests may be interleaved in the following cases: +> +> - The grain class is marked as `[Reentrant]` + > - The interface method is marked as `[AlwaysInterleave]` + > - The requests within the same call chain + > - The grain's MayInterleave predicate returns `true` + +Orleans default mode is to be non-reentrant as it simplifies day to day programming, however at the risk of deadlocks. Deadlocks there fore _must_ always be accounted for, which is done by timeouts timeouts -- i.e. every grain call must have a timeout. + +This makes sense for Orleans (and Akka) where their focus is on distributed systems and calls usually imply communication over the network. Swift actors however are a core piece of the language and concurrency story, burdening every single actor call with setting timeouts and scheduling timeout events for when calls do not complete within given deadline. Therefore this model may be tricky to use as default mode for Swift actors. + +Such mode however is tremendously useful to be able to provide. Some actors may serve as "transactional" process manager (also known as "[sagas](https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/saga/saga)") which are carefully implemented to avoid deadlocks. + + + +**Alternative: Keep ONLY reentrant actors and aggresively go for "per linear task (worker) actor"** + +One way to deal with ensuring "linear execution" is to ensure that an actor _cannot_ be interrupted by any other async function calls other than the ones it issues itself. This may be tricky in some situations, and makes it hard to mutate state back in such worker's parent actor but generally looks like this: + +```swift +actor class Parent { + actor class Worker { + func clean(room: Room) async { + cleanALittleBit() + } + func stopCleaning() async { + fatalError("oh no!") + } + } + + var cleaningRooms: [Room: Task.Handle] + + func askChildToCleanRoom(room: Room) async throws -> { + if let roomHandle = cleaningRooms[room] else { + _ = await try cleaningRooms.get() + return + } else { + let worker = Worker() + let handle = Task.runDetached { + // TODO lost task context; how to restore? + await worker.clean(room: room) + } + cleaningRooms[room] = handle + } +} +``` + +--- + +**Alternative: Optional non-reentrant actors** + +We could consider the reentrant actors by default however allow opt-ing into the more consistent but deadlock prone `NonReentrant` mode by annotating an actor or providing some configuration when spawning it. + +> Sakkana or Akka style would be that each actor when spawned can be given some "props" (like used by actors in theater) which can configure their behavior. +> +> This is not trivial to implement but we have it solid in Sakkana; It is why the "mailbox" is so crucial; in our world we'd need an extra status bit in the Task I suppose? + +It could take the form of: + +```swift +protocol Actor { + static var reentrancy: Actor Reentrancy { get } // defaults to `.reentrant` +} +``` + +and + +```swift +protocol NonReentrantActor { +} + +extension NonReentrantActor { + public static var reentrancy = .nonReentrant +} +``` + +This would allow users knowingly opt into the more deadlock prone but more consistent mode of operation. This needs cooperation from the `Executor`. + +A reentrant actor's `enqueue` implementation is trivial: + +```swift +// pseudo impl +func enqueue(partialTask: PartialAsyncTask) { + queue.async { partialTask.run() } +} +``` + +while a non-reentrant actor has to maintain a `.awaiting` state in addition to the usual states that an actor (or rather task, can be in): + + +```swift +// pseudo code! + +var stashed: [PartialAsyncTask] = [] +func enqueue(partialTask: PartialAsyncTask) { + if self.awaiting { + stashed.append(partialTask) // however we'd spell "don't execute yet" + } else { + assert(stashed.isEmpty) + queue.async { partialTask.run() } + } +} + +// pseudo code! +func wakeUp() { + assert(self.awaiting) + self.awaiting = false + +} + +``` + +**Reentrant actors and asynchronous initialization** + +It is possible to emulate non-reentrant behavior in reentrant actors by protecting their entry points with precondition checks. It is somewhat sub-optimal to implement manually, and somewhat labour intensive, hwoever this is a pattern which may be expected to appear whenever an actor has to lazily initialize some thing: + +```swift +actor class UserRepository { + let connection: DBConnection? + + func findBy(name: String) async throws -> ... { + guard let connection = self.connection else { + await try establishConnection() + return findBy(name: name) } - thief.deposit(amount: balance) + // do things with `connection`... + } + + func establishConnection() async throws -> DBConnection { + if let connection = connection { + return connection + } + + self.connection = await DBConnection.make(...) + return connection + } +} +``` + + +**Asynchronous initializers (are not-reentrant)** + +It may be beneficial to allow initializers be asynchronous, to allow them to call asynchronous functions. Thanks to this we would be able to remove the `?` from the `connection` property from the previous example, as well as get rid of the `guard let connection` boilerplate which was plagueing the previous example in every `find...` implementation one could write: + +```swift +actor class UserRepository { + let connection: DBConnection? + + init() async { + connection = await DBConnection.make(...) + } + + fund send(message: Message) async { + log.info("sending: \(message)") + await connection.send(message) + } +} +``` + +Now the repository actor is much nicer to read, we got rid of a lot of boilerplate. Is this safe in face of reenterace though? One might initially be worried that it might not be since async functions can interleave, however `init(...) async` is special because of it not returning the constructed reference to the actor until the initializer completes, as such, there is _no way_ to invoke any actor-isolated functions on it until it has finished initializing: + +```swift +let users = await UserRepository() +users.find... +``` + +An asynchronous initializer therefore is asynchronous, may invoke and await on asynchronous functions, however will _never_ interleave with any other invocations. This is by construction: no other code can invoke functions on an not yet constructed instance after all. + +> Examples, in the wild, of this pattern are not hard to find, however one we can quickly mention here is from Swift AWS Lambda Runtime, which looks like this: `typealias HandlerFactory = (InitializationContext) -> EventLoopFuture`. If you have any code which is similar to this, consider refactoring it into an asynchronous initializer. + + +#### Other notes: + +- Orleans: +- By default grains (actors) are **non-reentrant** [Orleans: Reentrancy](https://dotnet.github.io/orleans/docs/grains/reentrancy.html) +- ([Orleans Best Practices.pdf](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/Orleans20Best20Practices.pdf)) +- Deadlock in case of call cycles, e.g. "call itself" (through some other actors or just directly) +- Deadlocks are automatically broken with timeouts + - `[Reentrant]` attribute to make grain class reentrant +- Reentrant code may "interlave" + - Dealing with interleaving is error prone + - Used to allow _partial_ interleaving with `[Readonly]` but lost ability to semantically enforce this, so now it is hidden [https://github.com/dotnet/orleans/issues/1468](https://github.com/dotnet/orleans/issues/1468) + + +- Akka: + - Akka does not suffer _as much_ from the same issue because it has no way to "await" in the core actors + - There exists `persist() { ... }` which semantically prevents the actor from receiving any other messages than the "persist-success" however it is actually implemented by _receiving, and stashing_ messages until the special mesage arrives. + + +Reentrant Akka persistence actor example ([see here](https://doc.akka.io/docs/akka/current/persistence.html#relaxed-local-consistency-requirements-and-high-throughput-use-cases)): + +```scala + override def receiveCommand: Receive = { + case c: String => { + sender() ! c + persistAsync(s"evt-$c-1") { e => + sender() ! e + } + persistAsync(s"evt-$c-2") { e => + sender() ! e + } + } + } +} +``` + +Between the `sender() ! e` callbacks other messages could have been processed. `persistAsync` does not prevent the actor from handling other messages between issuing the persist calls and triggering their callbacks. + +Non-reentrant Akka persistence actor example: + +```scala + override def receiveCommand: Receive = { + case c: String => { + sender() ! c + persist(s"evt-$c-1") { e => + sender() ! e + } + persist(s"evt-$c-2") { e => + sender() ! e + } + defer(s"evt-$c-3") { e => + sender() ! e + } + } + } +``` + +This code _does_ prevent any new messages from being received until the persist() calls all return and run their callbacks. The execution is guaranteed to be: + +```scala +// ONE +// evt-ONE-1 +// evt-ONE-2 +// evt-ONE-3 +// callback evt-ONE-1 +// callback evt-ONE-2 +// callback evt-ONE-3 +// TWO +// evt-TWO-1 +// ... +``` + +> Note: We did have long discussions about this in the Akka team way back then; the default used to be the reentrant one, because we thought about the performance benefit -- realistically, people preferred the understandability of the programming model; and we allow opting into the "async" version which is reentrant. + +--- + +```swift +actor class LinearRegister { + let a: Keystore + let b: Keystore + + // reentrant + func compareAndSwap(expected: T, set value: T) async throws -> T { +guard !self.inProgress else { +throw RegisterError.otherOperationAlreadyInProgress +} + +self.inProgress = true +defer { self.inProgress = false } // however we complete, we're done; other CAS may start + +guard expected == await a.prepare(expected: expected, ...) else { throw ... } +guard expected == await b.prepare(expected: expected, ...) else { throw ... } + +await try a.accept(value, ...) else { throw ... } +await try b.accept(value, ...) else { throw ... } + + +} +} + +let keystore: Keystore = ... +async let first = keystore.compareAndSwap(expected: "initial", set: "first") +async let second = keystore.compareAndSwap(expected: "first", set: "second") +``` + +```swift +// NOT reentrant +actor class Keystore { + let a: Keystore + let b: Keystore + + func compareAndSwap(expected: T, set value: T) async throws -> T { +guard expected == await a.prepare(expected: expected) else { throw ... } +guard expected == await b.prepare(expected: expected) else { throw ... } + + +} +} +``` + + + --- + + Reentrant actors also make it hard to fulfil FIFO on async function calls, if they themselfes have to suspend. For example attempting to implement such `Enqueue` protocol such that we always first return to the _first_ caller, and then to thr next caller is difficult with reentrancy: + +```swift +protocol Enqueue: Actor { + func enqueue(message: Any) async -> Done +} +``` + +```swift +actor class EnqueueActor: Enqueue { + func enqueue(message: Any) async -> Done { + await self._enqueue(message: message) + return Done() } } ``` +It is difficult to fulfil the contract that in the bellow example `a` will first get done and only then actor `b` will get done: + + +```swift +actor class Enqueuer { + let queue: Enqueue = ... + + func a() async { + await queue.enqueue("a") +} + +func b() async { + await queue.enqueue("b") +} +} + +let enqueuer = Enqueuer() +async let a = enqueuer.a() +async let b = enqueuer.b() + +// in non-reenterant actors +// we would expect a be completed before b, because the pair-wise ordering is +// between the same actors: `main -> enqueuer -> queue` +// however this is not necessarily true under reneterant actors (!) +// the pairwise ordering guarantee is somewhat weaker, we are guaranteed that: +// - enqueuer sends "a" before "b" +// - queue gets "a" before "b" +// everything else can happen in any order; +// `a` may complete before `b` +``` + +--- + +#### Closures and local functions + +The restrictions on only allowing access to (non-`async`) actor-isolated declarations on `self` only work so long as we can ensure that the code in which `self` is valid is executing non-concurrently on the actor. For methods on the actor class, this is established by the rules described above: `async` function calls are serialized via the actor's queue, and non-`async` calls are only allowed when we know that we are already executing (non-concurrently) on the actor. + +However, `self` can also be captured by closures and local functions. Should those closures and local functions have access to actor-isolated state on the captured `self`? Consider an example where we want to close out a bank account and distribute the balance amongst a set of accounts: + The closure is accessing (and modifying) `balance`, which is part of the `self` actor's isolated state. Once the closure is formed and passed off to a function (in this case, `Sequence.forEach`), we no longer have control over when and how the closure is executed. On the other hand, we "know" that `forEach` is a synchronous function that invokes the closure on successive elements in the sequence. It is not concurrent, and the code above would be safe. If, on the other hand, we used a hypothetical parallel for-each, we would have a data race when the closure executes concurrently on different elements: