Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make JObject and JNIEnv !Copy to prevent use-after-free #398

Merged
merged 2 commits into from
Jan 17, 2023

Conversation

argv-minus-one
Copy link
Contributor

@argv-minus-one argv-minus-one commented Dec 29, 2022

Overview

The over-arching change here is that local reference types (like JObject) are !Copy and any APIs that may allocate new local references must take a mutable reference on the JNIEnv.

These rules then mean you can no longer easily access a copy of a local reference after it has been deleted and also mean you can't easily copy local references outside of a local reference frame created using with_local_frame (where they would be invalid)

An inter-related change also addresses the leak described in #109 as the Desc trait was updated to account for non-Copy reference types.

This PR has the changes discussed in #392.

This PR is the result of squashing a branch into a single commit. Here is the original, unsquashed branch.

Closes #392
Closes #384
Closes #381
Closes #109

Definition of Done

  • There are no TODOs left in the code
  • Change is covered by automated tests
  • The coding guidelines are followed
  • Public API has documentation
  • User-visible changes are mentioned in the Changelog
  • The continuous integration build passes

@rib
Copy link
Contributor

rib commented Dec 31, 2022

Cool, thanks for creating the PR for this work.

I think I'll try and port the Android backend for Bluey (https://github.com/rib/bluey) to this branch to help get a feel for this and I have a work-in-progress Winit branch I may update too (which I was getting frustrated with due to the terrible ergonomics of AutoLocal).

If you have any real-world code using jni-rs that you can port over to this too it would be good if you're able to try an initial port to see if it helps us catch something we've overlooked.

A few minor things initially:

  • If you can reformat how you list the issues that are closed in the commit message (list each Closes: #1234 on a separate line) then GitHub will automatically track this and automatically close those issues when we merge this work. (we can also do the same by editing the PR message but it might not be a bad idea to tweak this in the original commit message)
  • If you wouldn't mind, could you maybe add a Co-authored-by: Robert Bragg <robert@sixbynine.org> attribution to the end of the commit message, for hopefully helping figure this work out, e.g. with the prototyping too.

Considering the foot gun with the Desc trait I wonder if we should be more explicitly discouraging any use of the trait for lookups - and explain that lookups are only really intended to be done internally. My general assumption is that users of jni-rs are unlikely to be doing lookups themselves. Unless I'm overlooking some good use case for user code to be doing lookups I wonder about taking this further later on and seeing if we can move the lookup API to a private trait that the public Desc trait depends on. (though not as part of this PR/issue)

@rib
Copy link
Contributor

rib commented Jan 2, 2023

So far in updating the Android backend for Bluey it's been a fair number of changes which have been fairly straightforward.

One place were I did scratch my head for a while was with updating some native JNI functions that had a nested closure for catching Result errors, something like:

impl BleSessionNative {

    extern "C" fn on_companion_device_select(
        mut env: JNIEnv,
        session: JObject,
        session_handle: JHandle<AndroidSession>, // (jlong wrapper)
        device: JObject, // BluetoothDevice
        address: JString,
        name: JString,
    ) {
        let result = (|| -> Result<()> { // Like a try{} block since the JNI func doesn't return Result
            let address = env
                .get_string(&address)?
                .to_str()
                .map_err(|err| {
                    Error::Other(anyhow!("JNI: invalid utf8 for returned String: {:?}", err))
                })?
                .to_string();
            Ok(())
        })();
        // ...
    }
}

and getting a compiler error like:

error: lifetime may not live long enough
   --> bluey\src\android\session.rs:622:27
    |
613 |           mut env: JNIEnv,
    |           ------- has type `jni::JNIEnv<'2>`
...
617 |           address: JString,
    |           ------- has type `jni::objects::JString<'1>`
...
622 |               let address = env
    |  ___________________________^
623 | |                 .get_string(&address)?
    | |_____________________________________^ argument requires that `'1` must outlive `'2`

and it took me a while to realize that I can no longer rely on the default behaviour for lifetime elision which will bind all unnamed lifetimes to a unique lifetime. I had to update the signature to look like this instead:

    extern "C" fn on_companion_device_select<'local>(
        mut env: JNIEnv<'local>,
        session: JObject<'local>,
        session_handle: JHandle<AndroidSession>, // (jlong wrapper)
        device: JObject<'local>, // BluetoothDevice
        address: JString<'local>,
        name: JString<'local>,
    ) {
}

to make sure all the object references were bound to the same 'local lifetime

I think I saw that you updated the docs to highlight that JNI functions will need to declare the JNIEnv argument as being mutable now but we maybe also need to highlight the need to connect all object reference arguments to the same lifetime too.

rib added a commit to rib/bluey that referenced this pull request Jan 2, 2023
Updates JNI code for non-Copy object references + plumbing mutable
JNIEnv references for all APIs that may create new local references.

Ref: jni-rs/jni-rs#398
Ref: jni-rs/jni-rs#392
@argv-minus-one
Copy link
Contributor Author

argv-minus-one commented Jan 3, 2023

If you have any real-world code using jni-rs that you can port over to this too it would be good if you're able to try an initial port to see if it helps us catch something we've overlooked.

Will do, but I have some other work on my project that I need to finish first. I'm hoping to have it done this week, and then I'll migrate it over. I'll let you know when I'm done or if anything comes up.

  • If you can reformat how you list the issues that are closed in the commit message (list each Closes: #1234 on a separate line) then GitHub will automatically track this and automatically close those issues when we merge this work. (we can also do the same by editing the PR message but it might not be a bad idea to tweak this in the original commit message)

  • If you wouldn't mind, could you maybe add a Co-authored-by: Robert Bragg <robert@sixbynine.org> attribution to the end of the commit message, for hopefully helping figure this work out, e.g. with the prototyping too.

Done and done.

Considering the foot gun with the Desc trait I wonder if we should be more explicitly discouraging any use of the trait for lookups - and explain that lookups are only really intended to be done internally. My general assumption is that users of jni-rs are unlikely to be doing lookups themselves. Unless I'm overlooking some good use case for user code to be doing lookups I wonder about taking this further later on and seeing if we can move the lookup API to a private trait that the public Desc trait depends on. (though not as part of this PR/issue)

Sounds like a good idea. I never used Desc outside of the jni crate other than as a trait bound, so we could and probably should hide it.

One place were I did scratch my head for a while was with updating some native JNI functions that had a nested closure for catching Result errors, something like:

impl BleSessionNative {

    extern "C" fn on_companion_device_select(
        mut env: JNIEnv,
        session: JObject,
        session_handle: JHandle<AndroidSession>, // (jlong wrapper)
        device: JObject, // BluetoothDevice
        address: JString,
        name: JString,
    ) {
        let result = (|| -> Result<()> { // Like a try{} block since the JNI func doesn't return Result
            let address = env
                .get_string(&address)?
                .to_str()
                .map_err(|err| {
                    Error::Other(anyhow!("JNI: invalid utf8 for returned String: {:?}", err))
                })?
                .to_string();
            Ok(())
        })();
        // ...
    }
}

and getting a compiler error like:

error: lifetime may not live long enough
   --> bluey\src\android\session.rs:622:27
    |
613 |           mut env: JNIEnv,
    |           ------- has type `jni::JNIEnv<'2>`
...
617 |           address: JString,
    |           ------- has type `jni::objects::JString<'1>`
...
622 |               let address = env
    |  ___________________________^
623 | |                 .get_string(&address)?
    | |_____________________________________^ argument requires that `'1` must outlive `'2`

and it took me a while to realize that I can no longer rely on the default behaviour for lifetime elision which will bind all unnamed lifetimes to a unique lifetime. I had to update the signature to look like this instead:

    extern "C" fn on_companion_device_select<'local>(
        mut env: JNIEnv<'local>,
        session: JObject<'local>,
        session_handle: JHandle<AndroidSession>, // (jlong wrapper)
        device: JObject<'local>, // BluetoothDevice
        address: JString<'local>,
        name: JString<'local>,
    ) {
}

to make sure all the object references were bound to the same 'local lifetime

I think this was caused by the lifetimes on JNIEnv::get_string (and get_{map,list,string_unchecked}) being more restrictive than they actually need to be. I've loosened them (see unsquashed commit), re-squashed, and pushed d82b1f1. Give that a try.

@argv-minus-one
Copy link
Contributor Author

I've ported over my project to use this new jni crate. It was fairly uneventful; for the most part, everything works as expected.

Nested mut calls again

The nested mut calls problem that I mentioned earlier has also made this pattern unusable:

env.new_object(
	"com/example/SomeClass",
	"(Ljava/lang/String;)V",
	&[env.new_string("some string")?.into()],
)

Instead I have to do this:

let s: JString = env.new_string("some string")?;

env.new_object(
	"com/example/SomeClass",
	"(Ljava/lang/String;)V",
	&[(&s).into()],
)

Mildly annoying, but not the end of the world.

get_primitive_array_critical

This change has revealed a case of undefined behavior in my project: I was using JNIEnv::get_primitive_array_critical and then performing another JNI call without first dropping the AutoPrimitiveArray. This is no longer allowed because AutoPrimitiveArray borrows the JNIEnv.

I've added commit bc8389b that makes it exclusively borrow the JNIEnv, since the JNI spec says there must not be any other JNI calls inside the critical region.

@rib
Copy link
Contributor

rib commented Jan 5, 2023

I think this was caused by the lifetimes on JNIEnv::get_string (and get_{map,list,string_unchecked}) being more restrictive than they actually need to be. I've loosened them (see unsquashed commit), re-squashed, and pushed d82b1f1. Give that a try.

okey, yeah this makes sense.

Conceptually it feels a bit more correct to tie the lifetimes together explicitly in the signature of the native function but at the same time it also makes sense that get_string would work with disjoint local lifetimes here so it's convenient to not need to tie the lifetimes in this case.

@rib
Copy link
Contributor

rib commented Jan 5, 2023

I've added commit bc8389b that makes it exclusively borrow the JNIEnv, since the JNI spec says there must not be any other JNI calls inside the critical region.

Right, this looks good too.

@rib
Copy link
Contributor

rib commented Jan 5, 2023

While looking at updating the Winit branch I have then I'm reminded that the ergonomics for with_local_frame are pretty bad currently. It's not really a problem caused by this branch but just wanted to note it.

In this case I have code that's running in a thread that doesn't just return to Java (so there's no implicit local frame) and instead of creating lots of AutoLocals (error prone) I'd much rather be using with_local_frame - which I'm now more inclined towards with the changes made here.

The poor ergonomics (imho) come from not having an elegant way to generically return whatever you'd like from the with_local_frame closure. It's a bit of a fiddly problem too since it makes sense that there would be a Result that can report JNI errors from the PushLocalFrame and there should be a way to optionally return a local reference from the closure too.

I'm wondering if the Ok side of the outer Result should be generic (potentially a nested Result) and maybe instead of requiring the closure to return a JObject it could let you return whatever you like and there would be an out param of some form that can optionally be used to pass back a local reference. (My guess is that returning a local reference isn't actually the most likely thing you'd want to do even though it's a feature supported by JNI that we should support too)

Just thinking out loud at the moment, this probably doesn't need to be addressed as part of this PR.

@argv-minus-one
Copy link
Contributor Author

Is this what you have in mind?

pub fn with_local_frame_2<T, F>(&mut self, capacity: i32, f: F) -> Result<(T, JObject<'local>)>
where
    F: for<'new_local> FnOnce(&mut JNIEnv<'new_local>) -> Result<(T, JObject<'new_local>)>,
{
    unsafe {
        self.push_local_frame(capacity)?;
        let res = f(self);
        match res {
            Ok((t, obj)) => Ok((t, self.pop_local_frame(&obj)?)),
            Err(e) => {
                self.pop_local_frame(&JObject::null())?;
                Err(e)
            }
        }
    }
}

This would have been a big footgun before this PR, because you could easily return an invalid local reference through T, but now it's safe:

fn example(env: &mut JNIEnv) -> Result<()> {
    let (obj, _): (JString, JObject) = env.with_local_frame_2(1, |env| {
        let s = env.new_string("hi")?;
        Ok((s, JObject::null()))
    })?;

    Ok(())
}
error: lifetime may not live long enough
 --> tests/frame_example.rs:6:4
  |
4 |     let (obj, _): (JString, JObject) = env.with_local_frame_2(1, |env| {
  |                                                                   ---- return type of closure is Result<(jni::objects::JString<'2>, jni::objects::JObject<'_>), jni::errors::Error>
  |                                                                   |
  |                                                                   has type `&mut JNIEnv<'1>`
5 |             let s = env.new_string("hi")?;
6 |             Ok((s, JObject::null()))
  |             ^^^^^^^^^^^^^^^^^^^^^^^^ returning this value requires that `'1` must outlive `'2`

@rib
Copy link
Contributor

rib commented Jan 6, 2023

Is this what you have in mind?

At least something along these lines.

Instead of returning a tuple from the closure I was wondering about a second argument that would act as an out parameter for optionally returning a local reference.

At least so far in the places I've wanted to use with_local_frame I've never actually wanted to return a local reference and it's not clear to me how often that feature is really needed so I feel like it could be good to focus on the ergonomics of returning other values and not always have to pass back a redundant JObject::null().

Another consideration is that even without that feature it's still possible to send global references back (and since those release on drop I probably wouldn't be worried about leaking global references) - so the feature is mainly just an optimization (though at least for me it's an optimization for something I haven't needed yet)

Maybe there could even be a separate version that would support passing a local reference back.

@rib
Copy link
Contributor

rib commented Jan 6, 2023

Maybe it wouldn't really be unreasonable to just not expose that feature of being able to pass back a local reference - and instead just suggest that a global reference can be used instead.

This could simplify the API for common cases perhaps.

Another API could be added later if there really are some use cases where it's critical to optimize the efficiency of passing back a single local reference without going via a global reference.

@rib
Copy link
Contributor

rib commented Jan 6, 2023

So thinking of simplifying it to something like:

    /// Executes the given function in a new local reference frame, in which at least a given number
    /// of references can be created. Once this method returns, all references allocated
    /// in the frame are freed.
    ///
    /// If a frame can't be allocated with the requested capacity for local
    /// references, returns `Err` with a pending `OutOfMemoryError`.
    ///
    /// Since local references created within this frame won't be accessible to the calling
    /// frame then if you need to pass an object back to the caller then you can do that via a
    /// [`GlobalRef`] / [`Self::make_global`].
    pub fn with_local_frame<F, R>(&mut self, capacity: i32, f: F) -> Result<R>
    where
        F: FnOnce(&mut JNIEnv) -> R,
    {
        unsafe {
            self.push_local_frame(capacity)?;
            let ret = f(self);
            self.pop_local_frame(&JObject::null())?;
            Ok(ret)
        }
    }

@rib
Copy link
Contributor

rib commented Jan 6, 2023

So then code like:

        let mut result = None

        jenv.with_local_frame(10, |env| {
            result = Some(Self::handle_io_request_with_jni_local_frame(env, state, session, request, peripheral_handle, ble_device));
            Ok(JObject::null())
        })?;

        result.unwrap()

can become:

    jenv.with_local_frame(10, |env| {
        Self::handle_io_request_with_jni_local_frame(env, state, session, request, peripheral_handle, ble_device)
    })?

@argv-minus-one
Copy link
Contributor Author

I can do that, sure. Seems like a shame to give up the performance gain of returning a local reference, though. Remember that global references are about an order of magnitude slower to create and delete.

The jni crate itself returns a local reference from a frame in one place. My project used to do so in a lot of places as well, where it would start a local frame, prepare parameters for a Java constructor, call the constructor, and return a reference to the constructed object out of the local frame. (I say “used to” because I later decided it would be simpler and safer to serialize objects and pass around byte arrays instead.)

We could support both:

pub fn with_local_frame<F, R>(&mut self, capacity: i32, f: F) -> Result<R>
where
    F: for<'new_local> FnOnce(&mut JNIEnv<'new_local>) -> Result<R>,
{
    let (r, _) = self.with_local_frame_returning_local(capacity, |env| {
        let r = f(env)?;
        Ok((r, JObject::null()))
    })?;
    Ok(r)
}

pub fn with_local_frame_returning_local<F, R>(&mut self, capacity: i32, f: F) -> Result<(R, JObject<'local>)>
where
    F: for<'new_local> FnOnce(&mut JNIEnv<'new_local>) -> Result<(R, JObject<'new_local>)>,
{
    unsafe {
        self.push_local_frame(capacity)?;
        let res = f(self);
        match res {
            Ok((r, obj)) => Ok((r, self.pop_local_frame(&obj)?)),
            Err(e) => {
                self.pop_local_frame(&JObject::null())?;
                Err(e)
            }
        }
    }
}

What do you think?

@rib
Copy link
Contributor

rib commented Jan 9, 2023

Seems like a shame to give up the performance gain of returning a local reference, though. Remember that global references are about an order of magnitude slower to create and delete.

even though there's a performance difference between global/local refs in normal usage there are other factors here too:

  1. e.g. how often are local references passed back from a local frame? (I'd guess that's going to be a tiny proportion of all local references surely, so I'd expect it to be a drop in the ocean either way)
  2. how does JNI implement local reference pass back - I wouldn't be surprised if implementations would even implement the pass back transparently via a temporary global reference. A local reference can't magically be passed back to an outer frame without additional bookkeeping so at the very least there is some unquantified extra work involved within the JNI implementation.

Those two things together make me tend to think you'd be hard pushed to find a use case that would ever notice the performance difference but maybe if there are enough times where it's exactly what you want it might still be good to have an alternative API for that case.

Instead of stacking the APIs though I think it could be better to just keep them separate and simplify them both.

I also think the closures shouldn't really be returning a jni::error::Result since the closures are for user code that is likely to have its own Result types and it doesn't seem appropriate that user code should be expected to map all its errors back to a jni-rs Error.

If we have a version for returning a local reference that one probably doesn't need to return a tuple, but it could be good it returned a Result with a generic E error parameter instead of forcing user code to return a jni::error::Result.

(I think these same issues with the Result are also applicable to the Executor API wrappers for with_local_frame too)

Here's an initial patch I made for this the other day (just the with_local_frame change I was imagining which I also tested in Bluey afterwards) - sorry didn't get a chance to post them at the time: rib@daaf7ec

Here's an initial patch with the with_local_frame change I was imagining and a with_local_frame_returning_local API too: https://github.com/rib/jni-rs/tree/general-with-local-frame.

@rib
Copy link
Contributor

rib commented Jan 9, 2023

Making a micro benchmark to roughly quantify the difference between returning a global reference that is then converted into a local vs directly returning a local I get numbers like:

test tests::jni_noop_with_local_frame                         ... bench:          59 ns/iter (+/- 5)
test tests::jni_with_local_frame_returning_global_to_local    ... bench:       2,104 ns/iter (+/- 181)
test tests::jni_with_local_frame_returning_local              ... bench:       1,669 ns/iter (+/- 99)

so apparently about 26% slower to return a global and then create a local from that vs returning a local.

at least for me that tends to make me think it's super unlikely for that kind of difference to matter in practice but it doesn't really do any harm to have two versions.

@rib
Copy link
Contributor

rib commented Jan 10, 2023

Just to avoid getting too side-tracked with the with_local_frame discussion lets perhaps clarify that the scope of this PR is generally limited it to making local reference types non-Copy and so it'd probably make sense to finish this discussion in a follow up PR.

Any initial thoughts on https://github.com/rib/jni-rs/tree/general-with-local-frame would be great to hear (esp if we think we should change something related to making references non-Copy) but hopefully we can aim to land this soon without making this discussion a dependency for that.

In terms of porting code over to this branch for smoke testing I'm currently also using the above with_local_frame changes in Bluey / Winit since that's practically what I found I wanted the API to look like, but I feel like that's not an issue that's caused by this PR.

I'll probably try to make another review pass over everything - mainly trying to think about the pov of people porting code from 0.20 - but I think atm that this is probably good to land soon.

@rib rib changed the title Fix use-after-free (#392) and local reference leaks (#109) Make JObject and JNIEnv !Copy to prevent use-after-free Jan 10, 2023
@argv-minus-one
Copy link
Contributor Author

Ok. I've opened #399 to continue the discussion about with_local_frame.

In the mean time, is there anything else you'd like me to do with this PR?

…#392). Fix local reference leaks (jni-rs#109). Extensive breaking API changes (see changelog).

See jni-rs#392 for the discussion that led to this commit.

Closes jni-rs#392
Closes jni-rs#384
Closes jni-rs#381
Closes jni-rs#109

Co-authored-by: Robert Bragg <robert@sixbynine.org>
@argv-minus-one
Copy link
Contributor Author

argv-minus-one commented Jan 10, 2023

I've changed the commit message to mention that this also fixes #384. (I didn't notice that issue until just now.) GlobalRef and WeakRef still do erase the type, but AutoLocal doesn't.

@rib
Copy link
Contributor

rib commented Jan 10, 2023

In the mean time, is there anything else you'd like me to do with this PR?

Not atm, and hopefully it'll be good to land as is. I was just aiming to get time to take another pass over it. I'm a little busy this week but hopefully we can land this pretty soon and then tie up a few loose ends before hopefully looking at a 0.21 release too.


use log::{debug, warn};

use crate::{errors::Result, objects::JObject, sys, JNIEnv, JavaVM};

// Note: `GlobalRef` must not implement `Into<JObject>`! If it did, then it would be possible to
// wrap it in `AutoLocal`, which would cause undefined behavior upon drop as a result of calling
// the wrong JNI function to delete the reference.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having this documented special-case consideration that we need to be careful to not forget I wonder if we should look at adding an IsLocalRef trait (or similar) that can be used like a label/tag for all the types that can be safely wrapped via AutoLocal.

We can consider this in a follow up issue / PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's possible. It's unsafe to wrap JObject<'static> in AutoLocal, but it's not unsafe to wrap JObject with any other lifetime in AutoLocal, and I don't think there's any way to implement a trait for all lifetimes except 'static.

We could add another wrapper type, Local<'local, T>, where T is JObject or something that wraps it, and 'local is the local reference frame it was created in. Then, JNIEnv::auto_local accepts Local<'local, T> and returns AutoLocal<'local, T>. JObject and the other object wrapper types would no longer have a lifetime parameter and would become borrowed-only types like str, never appearing in owned form.

This would also make it possible to generify GlobalRef and WeakRef without using a GAT to change the lifetime.

This would be quite the footgun, though, since every extern fn that takes a parameter of type JObject/JClass/etc has to be changed to take Local<JObject>/Local<JClass>/etc, and if you forget any, you get undefined behavior! Rust has a way to solve this problem, extern type, but it's unstable and has been for a long time.

@rib rib merged commit f0afa3e into jni-rs:master Jan 17, 2023
@rib
Copy link
Contributor

rib commented Jan 17, 2023

Okey, I had another read through the PR and also gave one of my projects (Bluey-UI) a smoke test on Android after updating it to use the latest branch based on this work (from #399)

There will be a few things to follow up on related to this before making the next 0.21 release, and it'll probably be good to think about having some guidance / hints to add to the release notes about some of the churn that downstream users should expect when updating.

Big thanks @argv-minus-one for working through all of the knock-on details that arose while making this change!

This definitely feels like a step in the right direction for making jni-rs safer to use.

rib added a commit to rib/jni-rs that referenced this pull request Feb 1, 2023
This was recently marked as deprecated since landing jni-rs#398
meant we could also get a reference to a JObject via an implementation
of `AsRef<JObject>`.

Since there are multiple `AsRef<T>` implementations for `GlobalRef`
there are times when the compiler isn't able to infer which
implementation of `.as_ref()` you are trying to call, and so it's
still convenient to have a less ambiguous `.as_obj()` API that can
be used instead.

Closes: jni-rs#408
@rib rib added this to the 0.21 milestone Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment