New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Cancellable Future
#1684
feat: Cancellable Future
#1684
Conversation
This patch updates Kotlin codegen to use `suspendCancellableCoroutine` and `CancellableContinuation` to be able to intercept cancellation of a `Job`. This patch alone makes the code to fail as expected, even if the cancellation isn't handled yet.
Should not be necessary once https://youtrack.jetbrains.com/issue/KT-47902 is fixed.
cc @mhammond since Ben is in holidays. |
I tried running the tests mentioned in #1677 (repeating those tests 100, 1000 times) which were getting stuck before at the 2nd-3rd iteration and with these changes I got until >50 iterations of repeating the same tests before I canceled them, so it seems to solve the issue! |
65bf9bf
to
2fbf296
Compare
2fbf296
to
9ea66ca
Compare
`RustFuture` has a `future` field. It was: `UnsafeCell<F>`, now it becomes `RwLock<Option<Pin<Box<dyn Future<Output = T>>>>>`. First, `RwLock` is probably safer to use than `UnsafeCell` here. Second, `Option` is required for the next commits (being able to cancel a `Future`). Third, `Pin` is required to keep the multi-thread safety, and to simplify the next commits. Fourth, `Box<dyn Future<Output = T>>` erases the need for the `F` generic parameter The primary goal of this patch is to remove the `F` generic parameter.
This patch adds a new `FfiType` variant: `Future`, which represents a pointer to a `RustFuture`.
…ncelled. An async function in the foreign language side can be cancelled. For example, in Kotlin: ``` var job = launch { someAsyncFunction() } job.cancel() ``` The problem is: The `Future` running inside `someAsyncFunction` was not dropped/cancelled. It can easily create dead-lock if some locks are used inside `someAsyncFunction` because they are never released. This patch focuses on Kotlin only for the codegen part. However, let's start by explaining the modifications on the Rust side. The procedural macros generate a function with a C ABI, that's not new. What is new is that now this function returns a pointer to the `RustFuture`. That's why `RustFuture::into_raw` is now public. Previously, it was just calling `future.wake()` and nothing was returned from this C function. But to be able to cancel/drop the `RustFuture`, we need a pointer to it, otherwise the implementation is a complete black- box from the foreign language point of view: the `RustFuture` is consumed by `RustFuture::wake` (then leaked, then fetched again by the waker etc.) and that's it. OK, so now, it returns a pointer to the `RustFuture`. The procedural macros generates a _new_ function with a C ABI to drop a `RustFuture`. What it does is: ```rust ::uniffi::rust_call(uniffi_call_status, || { let uniffi_rust_future = ::uniffi::RustFuture::<#return_ty, crate::UniFfiTag>::from_raw( uniffi_future_ptr, ); Ok(uniffi_rust_future.cancel_or_drop_inner_future()) }); ``` First off, `RustFuture::from_raw` is now also public. Second, now it becomes obvious why the `F` generic parameter of `RustFuture` was creating a problem! How can we build a `RustFuture` if we don't know the type of the `Future` itself? This generic parameter has been removed in a previous patch (686f827). What does `RustFuture::cancel_or_drop_inner_future`? It increments the `wake_counter` so that it ensures the `RustFuture` _cannot_ be awaken anymore. It also takes the value inside `RustFuture.future`, leaving a `None`, which also means that the `RustFuture` cannot be awaken too. Double guards. By taking the `Future`, we can drop it, thus releasing all the code inside the `Future`. Our problem is now solved from the Rust point of view! Now, let's move on the foreign language land. This patch focuses on Kotlin, another patch may come for Swift and Python soon. A previous patch (0df55b2) has switched `Continuation` for `CancellableContinuation`. We can now install a `CompletionHandler` that will be called by `CancellableContinuation.invokeOnCancellation`, but also when the `CancellableContinuation.resume` or `.resumeWithException` have been called. In all cases, we are sure the `CompletionHandler` is called. What this `CompletionHandler` does is the following: * Call the `…_uniffi_drop` FFI function, * “Free” `callbackHolder` by freeing its previous value (the `callback`), hopefully to indicate the GC that it can be collected. It may solve the problem mentioned in mozilla#1677 (comment) as a bonus. The `completionHandler` is also passed to the function callback handler, so that it can be called after `continuation.resume` or `continuation.resumeWithException`. Note that `completionHandler` receives `null` if it's a completed or resumed termination, or a `CancellationException` if the continuation has been cancelled. That's why it's valid to see `continuation.resumeWithException(exception); completionHandler(null)`; we must not mix those exceptions. The rest of the patch is extra code to `FfiFunction` on the Rust side, to add “companion” functions, it's a simple way to add extra functions related to one `FfiFunction`.
This patch creates a new `use_shared_resource` async function. It uses a `Mutex` to lock a fake resource, the lock is released after `release_after` milliseconds. This patch adds 2 new tests: 1. The first test does the following. It calls `useSharedResource(100U)` within a `Job` but the `Job` is cancelled after 50ms, which doesn't let the time for `useSharedResource` to free the lock. Hopefully, with our previous patches, the `RustFuture` is cancelled properly, and the lock is released. That's new! After the `delay`, `useSharedResource` is called again. Before the previous patches, because `job.cancel()` was doing nothing, we had to wait on the first `useSharedResource` to finish, to acquire a new lock and continue. 2. The second test does almost like the first, but it doesn't call `job.cancel()`. It basically tests that the second call to `useSharedResource` waits on the first one to finish.
Since `RustFuture::future` holds a `Pin` to the `Future`, the entire `RustFuture` no longer needs to be wrapped inside a `Pin`. It simplifies the code a little bit.
9ea66ca
to
db62d75
Compare
Now, `RustFuture` constraints the `Future` with the `'static` lifetime bound. This patch ensures that `uniffi_self` lives long enough, because `rustc` was complaining about it. Nonetheless, a new problem arised: `RustBuffer` cannot be sent inside a `RustFuture` because it doesn't implement `Send`. And so far, it seems safe to implement `Send` for `RustBuffer`, which is what this patch does. And now, everything seems to work.
Can you explain this one more? I don't see where the extra reference gets stored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like a lot about this PR: the doc updates, the fix for #1677. But I wonder if we should take a different approach to fixing the cancellation issue.
It seems to me that there's currently a fundamental problem with the ForeignExecutor scheduling: some scheduled calls never happen. The main cause seems to be that once the executor (AKA the CoroutineScope in Kotlin) is canceled then future scheduled calls will be ignored. Another cause that I'm not 100% sure about is that a call is successfully scheduled, but the executor is cancelled before that call fires. Because of this, we're leaking Arc pointers because we schedule a wake callback, but the callback never happens.
If that analysis is correct, then I'd rather update the ForeignExecutor API to handle this. I think there's 2 changes that need to be made:
ForeignExecutorCallback
should return a value indicating if the callback was scheduled successfully. We can then check this value inschedule_do_wake()
and if we see a failure then unleak the Arc pointer.- The callbacks themselves should get an additional arg specifying if the executor was cancelled.
wake_callback()
should check that and if there was a cancellation, it should still callArc::from_raw()
to unleak the pointer, but not calldo_wake()
.
I think this will require a lot less changes on the Rust side. I also think this should be do-able from the Kotlin side. I think the first bullet is just a matter of checking isCancelled
. The second bullet might be a bit more tricky, but it seems to me that there has to be a way to do that.
The main reason I'm suggesting this is that I want to start using the ForeignExecutor for more features (#1675). I think we're going to run into the same issue there if we don't fix things at the ForeignExecutor level here. Also, in general, it feels like this PR is working around a broken primitive operation rather than fixing the primitive itself.
I'm not totally sure that analysis is correct, feel free to push back if I missed something. If you do think it's correct, I'd be happy to work on the Rust ForeignExecutor side of things if that would help since I'm already working with that code for #1675. |
It exists a very rare case where it's possible to cancel a `Job`, and thus a `CoroutineScope` before the FFI binding building and running the `RustFuture` can be called, and thus, no `rustFuturePtr` can be computed. In this rare case, the `completionHandler` will run `rustFuturePtr!!` and boom. This patch catches this case by checking that the `rustFuturePtr` isn't `null` before calling the `…_drop` function to drop the `RustFuture`.
I'm not sure. I was surprised to see that this patch fixes the problem, but it wasn't my first intent. I've zero experience with Kotlin :-p. |
To clarify, I did open the PR because we are hitting this problem hard, and it's blocking us. I know it's not the best fix possible, and I'm OK if you close this and start something from scratch :-). I'm aware this is a dressing on a deeper problem. |
@bendk Your suggestion would work well with https://developer.apple.com/documentation/swift/withtaskcancellationhandler(operation:oncancel:) for Swift too. I've a branch to work with this, and it fits pretty well with the current design of the PR. It would fit pretty well with your suggestion too. |
The biggest problem with Kotlin is that you don't know when a |
Actually, there is: on one hand, if you already have a reference to the Also, if you change the call from |
Just a heads-up, this PR (in its current state) seems to introduce a regression for the following situation: // error: future cannot be sent between threads safely
// the trait `Send` is not implemented for `*const c_void`
// note: captured value is not `Send`
// required by a bound in `RustFuture::<T, UT>::new`
#[derive(uniffi::Object)]
pub struct MyFirstObj;
#[uniffi::export]
impl MyFirstObj {
pub async fn do_thing(&self, second_obj: Arc<MySecondObj>) {}
// ^^^^^^^^^^ has type `*const c_void` which is not `Send`
}
#[derive(uniffi::Object)]
pub struct MySecondObj; Such code used to compile without issues on 0.24.3. |
Yeah, because the |
@OtaK as per the discussion above, this PR is unlikely to be merged anyways. But it's interesting that we don't have a test case like that, it should probably be added to the next PR that aims to fix the cancellation problem. |
d0bc131
to
9e2ffac
Compare
When we execute a future on Kotlin, we create a `jni.Callback` instance that Rust calls when the Future is ready. We need to hold a reference to that object or else the JVM will finalize it and when Rust tries to invoke the callback bad things will happen. In the unit tests this causes them to stall because the Kotlin coroutine continutation is never completed. The `callbackHolder` variable was intended to hold a reference, but it doesn't seem to be working (see [arg0d's analysis](mozilla#1677 (comment))). I belive the issue is that variable isn't actually used anywhere so Kotlin frees it before the coroutine has completed and the reason mozilla#1684 fixes the issue is that it uses the variable in the completion handler (https://github.com/mozilla/uniffi-rs/pull/1684/files#diff-f564b605a6ba32d19bcb47128dc9c5c2f01169e2a41fa60ffaaf0b6eea13845cR34) This commit changes things so instead of using a local variable, there's a global set that stores all active callbacks. It seems to fix the issue for me when using arg0d's `repeat(1000) { ... }` test.
When we execute a future on Kotlin, we create a `jni.Callback` instance that Rust calls when the Future is ready. We need to hold a reference to that object or else the JVM will finalize it and when Rust tries to invoke the callback bad things will happen. In the unit tests this causes them to stall because the Kotlin coroutine continutation is never completed. The `callbackHolder` variable was intended to hold a reference, but it doesn't seem to be working (see [arg0d's analysis](mozilla#1677 (comment))). I belive the issue is that variable isn't actually used anywhere so Kotlin frees it before the coroutine has completed and the reason mozilla#1684 fixes the issue is that it uses the variable in the completion handler (https://github.com/mozilla/uniffi-rs/pull/1684/files#diff-f564b605a6ba32d19bcb47128dc9c5c2f01169e2a41fa60ffaaf0b6eea13845cR34) This commit changes things so instead of using a local variable, there's a global set that stores all active callbacks. It seems to fix the issue for me when using arg0d's `repeat(1000) { ... }` test.
That makes sense. I've started some work on this and hoepfully will have a PR soon. |
@Hywan are you able to reproduce this issue on swift? It seems to work for me when I try to write a similar unit test. |
Nevermind about this, I just needed to adjust my timeouts. |
Copied from mozilla#1684 (thanks Hywan!)
I just made #1697 to address the deeper problem. How about closing this one? |
This patch switches UniFFI from mozilla/uniffi-rs#1684 to mozilla/uniffi-rs#1697.
This patch should ideally be reviewed commit-by-commit for an easier reviewer experience. All commits contain a descriptive message, so that I don't have to repeat everything here.
It fixes #1669 and closes #1667.
The most important patches are probably 9cf9f11 and e488828.
In a nutshell: this patch ensures that it's possible to cancel or drop a
RustFuture
along with its innerFuture
properly. A new FFI function is generated per async function,<name>_uniffi_drop
. I'm fully aware that this patch “breaks” the black-box model of theRustFuture
from the foreign language point of view. It's however absolutely necessary, otherwise it's impossible to trigger a cancellation or a proper drop of aRustFuture
.I also have hope that this patch will fix #1677 (comment), as now there is a double reference to the
callbackHolder
, which must ensure that the garbage collector won't collect it prematurely.Finally, this patch focuses on Kotlin, and I'm starting to take a look at the Swift part now. I may update this PR to update Swift if that's necessary. Python will follow.
Follow up feature: We may simplify
Pin<Arc<RustFuture>>
to simplyArc<RustFuture>
once this PR has been merged, because nowRustFuture::future
contains thePin
. Edit: that's actually done in db62d75.