Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Cancellable Future #1684

Closed
wants to merge 21 commits into from
Closed

Conversation

Hywan
Copy link
Contributor

@Hywan Hywan commented Aug 2, 2023

This patch should ideally be reviewed commit-by-commit for an easier reviewer experience. All commits contain a descriptive message, so that I don't have to repeat everything here.

It fixes #1669 and closes #1667.

The most important patches are probably 9cf9f11 and e488828.

In a nutshell: this patch ensures that it's possible to cancel or drop a RustFuture along with its inner Future properly. A new FFI function is generated per async function, <name>_uniffi_drop. I'm fully aware that this patch “breaks” the black-box model of the RustFuture from the foreign language point of view. It's however absolutely necessary, otherwise it's impossible to trigger a cancellation or a proper drop of a RustFuture.

I also have hope that this patch will fix #1677 (comment), as now there is a double reference to the callbackHolder, which must ensure that the garbage collector won't collect it prematurely.

Finally, this patch focuses on Kotlin, and I'm starting to take a look at the Swift part now. I may update this PR to update Swift if that's necessary. Python will follow.

Follow up feature: We may simplify Pin<Arc<RustFuture>> to simply Arc<RustFuture> once this PR has been merged, because now RustFuture::future contains the Pin. Edit: that's actually done in db62d75.

This patch updates Kotlin codegen to use `suspendCancellableCoroutine`
and `CancellableContinuation` to be able to intercept cancellation of
a `Job`.

This patch alone makes the code to fail as expected, even if the
cancellation isn't handled yet.
@Hywan Hywan requested a review from a team as a code owner August 2, 2023 09:50
@Hywan Hywan requested review from bendk and removed request for a team August 2, 2023 09:50
@Hywan Hywan changed the title feat: Cancellation Future feat: Cancellable Future Aug 2, 2023
@Hywan
Copy link
Contributor Author

Hywan commented Aug 2, 2023

cc @mhammond since Ben is in holidays.

@jmartinesp
Copy link
Contributor

jmartinesp commented Aug 2, 2023

I tried running the tests mentioned in #1677 (repeating those tests 100, 1000 times) which were getting stuck before at the 2nd-3rd iteration and with these changes I got until >50 iterations of repeating the same tests before I canceled them, so it seems to solve the issue!

`RustFuture` has a `future` field. It was: `UnsafeCell<F>`, now it
becomes `RwLock<Option<Pin<Box<dyn Future<Output = T>>>>>`.

First, `RwLock` is probably safer to use than `UnsafeCell` here.

Second, `Option` is required for the next commits (being able to cancel
a `Future`).

Third, `Pin` is required to keep the multi-thread safety, and to
simplify the next commits.

Fourth, `Box<dyn Future<Output = T>>` erases the need for the `F`
generic parameter

The primary goal of this patch is to remove the `F` generic parameter.
This patch adds a new `FfiType` variant: `Future`, which represents a
pointer to a `RustFuture`.
…ncelled.

An async function in the foreign language side can be cancelled. For
example, in Kotlin:

```
var job = launch { someAsyncFunction() }
job.cancel()
```

The problem is: The `Future` running inside `someAsyncFunction` was not
dropped/cancelled. It can easily create dead-lock if some locks are used
inside `someAsyncFunction` because they are never released.

This patch focuses on Kotlin only for the codegen part. However, let's
start by explaining the modifications on the Rust side.

The procedural macros generate a function with a C ABI, that's not
new. What is new is that now this function returns a pointer to
the `RustFuture`. That's why `RustFuture::into_raw` is now public.
Previously, it was just calling `future.wake()` and nothing was returned
from this C function. But to be able to cancel/drop the `RustFuture`, we
need a pointer to it, otherwise the implementation is a complete black-
box from the foreign language point of view: the `RustFuture` is
consumed by `RustFuture::wake` (then leaked, then fetched again by the
waker etc.) and that's it. OK, so now, it returns a pointer to the
`RustFuture`.

The procedural macros generates a _new_ function with a C ABI to drop
a `RustFuture`. What it does is:

```rust
::uniffi::rust_call(uniffi_call_status, || {
    let uniffi_rust_future = ::uniffi::RustFuture::<#return_ty, crate::UniFfiTag>::from_raw(
        uniffi_future_ptr,
    );
    Ok(uniffi_rust_future.cancel_or_drop_inner_future())
});
```

First off, `RustFuture::from_raw` is now also public. Second, now
it becomes obvious why the `F` generic parameter of `RustFuture` was
creating a problem! How can we build a `RustFuture` if we don't know the
type of the `Future` itself? This generic parameter has been removed in
a previous patch (686f827).

What does `RustFuture::cancel_or_drop_inner_future`? It increments
the `wake_counter` so that it ensures the `RustFuture` _cannot_ be
awaken anymore. It also takes the value inside `RustFuture.future`,
leaving a `None`, which also means that the `RustFuture` cannot be
awaken too. Double guards. By taking the `Future`, we can drop it, thus
releasing all the code inside the `Future`. Our problem is now solved
from the Rust point of view!

Now, let's move on the foreign language land. This patch focuses on
Kotlin, another patch may come for Swift and Python soon. A
previous patch (0df55b2) has switched `Continuation` for
`CancellableContinuation`. We can now install a `CompletionHandler` that
will be called by `CancellableContinuation.invokeOnCancellation`, but
also when the `CancellableContinuation.resume` or `.resumeWithException`
have been called. In all cases, we are sure the `CompletionHandler` is called.

What this `CompletionHandler` does is the following:

* Call the `…_uniffi_drop` FFI function,
* “Free” `callbackHolder` by freeing its previous value (the
  `callback`), hopefully to indicate the GC that it can be collected. It
  may solve the problem mentioned in
  mozilla#1677 (comment)
  as a bonus.

The `completionHandler` is also passed to the function callback
handler, so that it can be called after `continuation.resume` or
`continuation.resumeWithException`.

Note that `completionHandler` receives `null` if it's a
completed or resumed termination, or a `CancellationException` if the
continuation has been cancelled. That's why it's valid to see
`continuation.resumeWithException(exception); completionHandler(null)`;
we must not mix those exceptions.

The rest of the patch is extra code to `FfiFunction` on the Rust side,
to add “companion” functions, it's a simple way to add extra functions
related to one `FfiFunction`.
This patch creates a new `use_shared_resource` async function. It
uses a `Mutex` to lock a fake resource, the lock is released after
`release_after` milliseconds.

This patch adds 2 new tests:

1. The first test does the following. It calls `useSharedResource(100U)`
   within a `Job` but the `Job` is cancelled after 50ms, which
   doesn't let the time for `useSharedResource` to free the lock.
   Hopefully, with our previous patches, the `RustFuture` is cancelled
   properly, and the lock is released. That's new! After the `delay`,
   `useSharedResource` is called again. Before the previous patches,
   because `job.cancel()` was doing nothing, we had to wait on the first
   `useSharedResource` to finish, to acquire a new lock and continue.

2. The second test does almost like the first, but it doesn't
   call `job.cancel()`. It basically tests that the second call to
   `useSharedResource` waits on the first one to finish.
Since `RustFuture::future` holds a `Pin` to the `Future`, the entire
`RustFuture` no longer needs to be wrapped inside a `Pin`. It simplifies
the code a little bit.
@Hywan Hywan force-pushed the feat-cancellable-async-func branch from 9ea66ca to db62d75 Compare August 2, 2023 12:08
Now, `RustFuture` constraints the `Future` with the `'static` lifetime
bound. This patch ensures that `uniffi_self` lives long enough, because
`rustc` was complaining about it. Nonetheless, a new problem arised:
`RustBuffer` cannot be sent inside a `RustFuture` because it doesn't
implement `Send`. And so far, it seems safe to implement `Send` for
`RustBuffer`, which is what this patch does.

And now, everything seems to work.
@bendk
Copy link
Contributor

bendk commented Aug 5, 2023

I also have hope that this patch will fix #1677 (comment), as now there is a double reference to the callbackHolder, which must ensure that the garbage collector won't collect it prematurely.

Can you explain this one more? I don't see where the extra reference gets stored.

Copy link
Contributor

@bendk bendk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like a lot about this PR: the doc updates, the fix for #1677. But I wonder if we should take a different approach to fixing the cancellation issue.

It seems to me that there's currently a fundamental problem with the ForeignExecutor scheduling: some scheduled calls never happen. The main cause seems to be that once the executor (AKA the CoroutineScope in Kotlin) is canceled then future scheduled calls will be ignored. Another cause that I'm not 100% sure about is that a call is successfully scheduled, but the executor is cancelled before that call fires. Because of this, we're leaking Arc pointers because we schedule a wake callback, but the callback never happens.

If that analysis is correct, then I'd rather update the ForeignExecutor API to handle this. I think there's 2 changes that need to be made:

  • ForeignExecutorCallback should return a value indicating if the callback was scheduled successfully. We can then check this value in schedule_do_wake() and if we see a failure then unleak the Arc pointer.
  • The callbacks themselves should get an additional arg specifying if the executor was cancelled. wake_callback() should check that and if there was a cancellation, it should still call Arc::from_raw() to unleak the pointer, but not call do_wake().

I think this will require a lot less changes on the Rust side. I also think this should be do-able from the Kotlin side. I think the first bullet is just a matter of checking isCancelled. The second bullet might be a bit more tricky, but it seems to me that there has to be a way to do that.

The main reason I'm suggesting this is that I want to start using the ForeignExecutor for more features (#1675). I think we're going to run into the same issue there if we don't fix things at the ForeignExecutor level here. Also, in general, it feels like this PR is working around a broken primitive operation rather than fixing the primitive itself.

@bendk
Copy link
Contributor

bendk commented Aug 5, 2023

I'm not totally sure that analysis is correct, feel free to push back if I missed something.

If you do think it's correct, I'd be happy to work on the Rust ForeignExecutor side of things if that would help since I'm already working with that code for #1675.

It exists a very rare case where it's possible to cancel a `Job`, and
thus a `CoroutineScope` before the FFI binding building and running
the `RustFuture` can be called, and thus, no `rustFuturePtr` can be
computed.

In this rare case, the `completionHandler` will run `rustFuturePtr!!`
and boom.

This patch catches this case by checking that the `rustFuturePtr` isn't
`null` before calling the `…_drop` function to drop the `RustFuture`.
@Hywan
Copy link
Contributor Author

Hywan commented Aug 7, 2023

Can you explain this one more? I don't see where the extra reference gets stored.

I'm not sure. I was surprised to see that this patch fixes the problem, but it wasn't my first intent. I've zero experience with Kotlin :-p.

@Hywan
Copy link
Contributor Author

Hywan commented Aug 7, 2023

@bendk

I like a lot about this PR: the doc updates, the fix for #1677. But I wonder if we should take a different approach to fixing the cancellation issue.

It seems to me that there's currently a fundamental problem with the ForeignExecutor scheduling: some scheduled calls never happen. The main cause seems to be that once the executor (AKA the CoroutineScope in Kotlin) is canceled then future scheduled calls will be ignored. Another cause that I'm not 100% sure about is that a call is successfully scheduled, but the executor is cancelled before that call fires. Because of this, we're leaking Arc pointers because we schedule a wake callback, but the callback never happens.

If that analysis is correct, then I'd rather update the ForeignExecutor API to handle this. I think there's 2 changes that need to be made:

  • ForeignExecutorCallback should return a value indicating if the callback was scheduled successfully. We can then check this value in schedule_do_wake() and if we see a failure then unleak the Arc pointer.

  • The callbacks themselves should get an additional arg specifying if the executor was cancelled. wake_callback() should check that and if there was a cancellation, it should still call Arc::from_raw() to unleak the pointer, but not call do_wake().

I think this will require a lot less changes on the Rust side. I also think this should be do-able from the Kotlin side. I think the first bullet is just a matter of checking isCancelled. The second bullet might be a bit more tricky, but it seems to me that there has to be a way to do that.

The main reason I'm suggesting this is that I want to start using the ForeignExecutor for more features (#1675). I think we're going to run into the same issue there if we don't fix things at the ForeignExecutor level here. Also, in general, it feels like this PR is working around a broken primitive operation rather than fixing the primitive itself.

To clarify, I did open the PR because we are hitting this problem hard, and it's blocking us. I know it's not the best fix possible, and I'm OK if you close this and start something from scratch :-). I'm aware this is a dressing on a deeper problem.

@Hywan
Copy link
Contributor Author

Hywan commented Aug 7, 2023

@bendk Your suggestion would work well with https://developer.apple.com/documentation/swift/withtaskcancellationhandler(operation:oncancel:) for Swift too. I've a branch to work with this, and it fits pretty well with the current design of the PR. It would fit pretty well with your suggestion too.

@Hywan
Copy link
Contributor Author

Hywan commented Aug 7, 2023

@bendk

The callbacks themselves should get an additional arg specifying if the executor was cancelled. wake_callback() should check that and if there was a cancellation, it should still call Arc::from_raw() to unleak the pointer, but not call do_wake().

The biggest problem with Kotlin is that you don't know when a Job has been cancelled, except if your Continuation has configured a invokeOnCancellation completion handler (CompletionHandler). There is no isCancelled property, like you have in Swift.

@jmartinesp
Copy link
Contributor

jmartinesp commented Aug 7, 2023

@bendk

The callbacks themselves should get an additional arg specifying if the executor was cancelled. wake_callback() should check that and if there was a cancellation, it should still call Arc::from_raw() to unleak the pointer, but not call do_wake().

The biggest problem with Kotlin is that you don't know when a Job has been cancelled, except if your Continuation has configured a invokeOnCancellation completion handler (CompletionHandler). There is no isCancelled property, like you have in Swift.

Actually, there is: on one hand, if you already have a reference to the Job you can just check Job.isCancelled. If you're inside a coroutine (or a suspend function) you can just check the underlying coroutineContext.isActive property, it will be false if it's either completed, cancelling or cancelled it I'm not mistaken.

Also, if you change the call from suspendCoroutine to suspendCancellableCoroutine, its Continuation has those same isActive/isCancelled properties.

@OtaK
Copy link

OtaK commented Aug 7, 2023

Just a heads-up, this PR (in its current state) seems to introduce a regression for the following situation:

// error: future cannot be sent between threads safely
// the trait `Send` is not implemented for `*const c_void`
// note: captured value is not `Send`
// required by a bound in `RustFuture::<T, UT>::new`

#[derive(uniffi::Object)]
pub struct MyFirstObj;

#[uniffi::export]
impl MyFirstObj {
    pub async fn do_thing(&self, second_obj: Arc<MySecondObj>) {}
    //                           ^^^^^^^^^^ has type `*const c_void` which is not `Send`
}

#[derive(uniffi::Object)]
pub struct MySecondObj;

Such code used to compile without issues on 0.24.3.

@Hywan
Copy link
Contributor Author

Hywan commented Aug 7, 2023

Yeah, because the Future has a Send + 'static bound. It was already the case before, but things are moved inside async move { … } and I think the compiler is complaining about some stuff now.

@jplatte
Copy link
Collaborator

jplatte commented Aug 7, 2023

@OtaK as per the discussion above, this PR is unlikely to be merged anyways. But it's interesting that we don't have a test case like that, it should probably be added to the next PR that aims to fix the cancellation problem.

@Hywan Hywan force-pushed the feat-cancellable-async-func branch from d0bc131 to 9e2ffac Compare August 7, 2023 15:47
bendk added a commit to bendk/uniffi-rs that referenced this pull request Aug 7, 2023
When we execute a future on Kotlin, we create a `jni.Callback` instance
that Rust calls when the Future is ready.  We need to hold a reference
to that object or else the JVM will finalize it and when Rust tries to
invoke the callback bad things will happen. In the unit tests this
causes them to stall because the Kotlin coroutine continutation is never
completed.

The `callbackHolder` variable was intended to hold a reference, but it
doesn't seem to be working (see [arg0d's analysis](mozilla#1677 (comment))).
I belive the issue is that variable isn't actually used anywhere so
Kotlin frees it before the coroutine has completed and the reason mozilla#1684
fixes the issue is that it uses the variable in the completion handler
(https://github.com/mozilla/uniffi-rs/pull/1684/files#diff-f564b605a6ba32d19bcb47128dc9c5c2f01169e2a41fa60ffaaf0b6eea13845cR34)

This commit changes things so instead of using a local variable, there's
a global set that stores all active callbacks.  It seems to fix the
issue for me when using arg0d's `repeat(1000) { ... }` test.
bendk added a commit to bendk/uniffi-rs that referenced this pull request Aug 7, 2023
When we execute a future on Kotlin, we create a `jni.Callback` instance
that Rust calls when the Future is ready.  We need to hold a reference
to that object or else the JVM will finalize it and when Rust tries to
invoke the callback bad things will happen. In the unit tests this
causes them to stall because the Kotlin coroutine continutation is never
completed.

The `callbackHolder` variable was intended to hold a reference, but it
doesn't seem to be working (see [arg0d's analysis](mozilla#1677 (comment))).
I belive the issue is that variable isn't actually used anywhere so
Kotlin frees it before the coroutine has completed and the reason mozilla#1684
fixes the issue is that it uses the variable in the completion handler
(https://github.com/mozilla/uniffi-rs/pull/1684/files#diff-f564b605a6ba32d19bcb47128dc9c5c2f01169e2a41fa60ffaaf0b6eea13845cR34)

This commit changes things so instead of using a local variable, there's
a global set that stores all active callbacks.  It seems to fix the
issue for me when using arg0d's `repeat(1000) { ... }` test.
@bendk
Copy link
Contributor

bendk commented Aug 7, 2023

@bendk

I like a lot about this PR: the doc updates, the fix for #1677. But I wonder if we should take a different approach to fixing the cancellation issue.
It seems to me that there's currently a fundamental problem with the ForeignExecutor scheduling: some scheduled calls never happen. The main cause seems to be that once the executor (AKA the CoroutineScope in Kotlin) is canceled then future scheduled calls will be ignored. Another cause that I'm not 100% sure about is that a call is successfully scheduled, but the executor is cancelled before that call fires. Because of this, we're leaking Arc pointers because we schedule a wake callback, but the callback never happens.
If that analysis is correct, then I'd rather update the ForeignExecutor API to handle this. I think there's 2 changes that need to be made:

  • ForeignExecutorCallback should return a value indicating if the callback was scheduled successfully. We can then check this value in schedule_do_wake() and if we see a failure then unleak the Arc pointer.
  • The callbacks themselves should get an additional arg specifying if the executor was cancelled. wake_callback() should check that and if there was a cancellation, it should still call Arc::from_raw() to unleak the pointer, but not call do_wake().

I think this will require a lot less changes on the Rust side. I also think this should be do-able from the Kotlin side. I think the first bullet is just a matter of checking isCancelled. The second bullet might be a bit more tricky, but it seems to me that there has to be a way to do that.
The main reason I'm suggesting this is that I want to start using the ForeignExecutor for more features (#1675). I think we're going to run into the same issue there if we don't fix things at the ForeignExecutor level here. Also, in general, it feels like this PR is working around a broken primitive operation rather than fixing the primitive itself.

To clarify, I did open the PR because we are hitting this problem hard, and it's blocking us. I know it's not the best fix possible, and I'm OK if you close this and start something from scratch :-). I'm aware this is a dressing on a deeper problem.

That makes sense. I've started some work on this and hoepfully will have a PR soon.

@bendk
Copy link
Contributor

bendk commented Aug 7, 2023

@Hywan are you able to reproduce this issue on swift? It seems to work for me when I try to write a similar unit test.

@bendk
Copy link
Contributor

bendk commented Aug 7, 2023

@Hywan are you able to reproduce this issue on swift? It seems to work for me when I try to write a similar unit test.

Nevermind about this, I just needed to adjust my timeouts.

bendk added a commit to bendk/uniffi-rs that referenced this pull request Aug 8, 2023
@bendk
Copy link
Contributor

bendk commented Aug 9, 2023

To clarify, I did open the PR because we are hitting this problem hard, and it's blocking us. I know it's not the best fix possible, and I'm OK if you close this and start something from scratch :-). I'm aware this is a dressing on a deeper problem.

I just made #1697 to address the deeper problem. How about closing this one?

@Hywan Hywan closed this Aug 10, 2023
Hywan added a commit to matrix-org/matrix-rust-sdk that referenced this pull request Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rust future seems to be leaked when cancelling Kotlin coroutine
7 participants