Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement java bindings #1928

Merged
merged 19 commits into from Feb 27, 2024
Merged

feat: implement java bindings #1928

merged 19 commits into from Feb 27, 2024

Conversation

beinan
Copy link
Collaborator

@beinan beinan commented Feb 8, 2024

This pull request is still under development and addresses the interconnection between Java and Rust using JNI for both write and read paths. Here's a summary of the progress:

Write Path:

  • Code currently implements sending data from Java to Rust through JNI.
  • arrow_c in java and arrow ffi in rust are used for efficient data serialization and transfer.
  • Resource management ensures proper release on both sides, preventing memory leaks.

Read Path:

Development for the read path (data flow from Rust to Java) is ongoing within this PR.
The same approach with Arrow and JNI will be utilized for consistency and performance.

Next Steps:

  • Continuous development and testing on both write and read paths.
  • Open to feedback and suggestions on design and implementation for all aspects.

Highlights:

  • This PR establishes a single source for building a complete Java-Rust communication channel using JNI, which will have a better compatibility for integration with Java projects still running on java 8 such as Presto, Spark and hive/hadoop .
  • Arrow integration ensures optimized data handling and minimizes data copies.
  • Careful resource management prevents memory leaks and improves overall stability.

}

impl BlockingDataset {
pub fn write(reader: ArrowArrayStreamReader, uri: &str) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is async calls not encouraged cross JNI?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just thinking async call in jni might introduce complexities. Also I prefer to rely on the thread management in java compute engine (I'm still targeting to implement a lance connector for Trino/Presto). But anyway, async api will be more efficient in many places, I think we could implement the async jni later, can we?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @beinan Any plans to add a Trino connector for this? If so, any sense of ETA? Thanks for building this :)

@@ -0,0 +1,36 @@
use arrow::ffi_stream::ArrowArrayStreamReader;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add apache license to headers :)

cargo clippy will pick this rule up. Lets make github action to run on this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I will add the github actions for both java and rust-jni projects

)?;
Ok(BlockingDataset { inner, rt })
}
pub fn open(uri: &str) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be great that we can pass Read / Write options from JNI.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm also thinking about that, I will add the options very soon

pub fn count_rows(&self) -> Result<usize> {
self.rt.block_on(self.inner.count_rows())
}
pub fn close(&self) {}
Copy link
Contributor

@eddyxu eddyxu Feb 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So input and output are FFI_ArrowArrayStream iiuc
?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the input of the write path can be either a FFI_ArrowArrayStream or FFI_ArrowArray (I have implement the stream api for write, I'm thinking to implement append api for both structure). For the output of the read path, I think it will be something very similar but I didn't get a chance to try to implement it.

@beinan beinan force-pushed the java_jni branch 2 times, most recently from 378ed98 to f809e86 Compare February 13, 2024 07:15
@beinan
Copy link
Collaborator Author

beinan commented Feb 13, 2024

Hi @eddyxu,

I integrated a mvn plugin to build rust project in maven, you can run mvn clean package to generate the jar with a native library embedded.

Then you can find jar atlance/java/target/lance-0.1-SNAPSHOT.jar

image

The structure of the jar is simple

jar tf target/lance-0.1-SNAPSHOT.jar
META-INF/MANIFEST.MF
META-INF/
nativelib/
nativelib/darwin-aarch64/
com/
com/lancedb/
com/lancedb/lance/
META-INF/maven/
META-INF/maven/com.lancedb/
META-INF/maven/com.lancedb/lance/
nativelib/darwin-aarch64/liblance_jni.dylib
com/lancedb/lance/Dataset.class
META-INF/maven/com.lancedb/lance/pom.xml
META-INF/maven/com.lancedb/lance/pom.properties

I only added one target platform "darwin-aarch64" for macos on arm. (I will figure out how to add more target platform later)

It will also run the test in both Rust and Java project.

By the way, I also removed quite a few redundant code.

@beinan beinan force-pushed the java_jni branch 4 times, most recently from b735c42 to 1dfe53a Compare February 16, 2024 23:18
matrix:
include:
- os: ubuntu-22.04
java-version: 8
Copy link
Contributor

@eddyxu eddyxu Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote to not support java8 :) Even ok to not support java11 if it is up to me :)

@beinan beinan force-pushed the java_jni branch 7 times, most recently from 3c4badc to 824181c Compare February 17, 2024 05:51
@LuQQiu LuQQiu force-pushed the java_jni branch 3 times, most recently from 29bbe7c to 3447b43 Compare February 19, 2024 05:53
distribution: temurin
java-version: 17
cache: 'maven'
- run: echo "JAVA_17=$JAVA_HOME" >> $GITHUB_ENV
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a matrix to split it into two jobs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the CI job, testing another Java version only took 6s. If we split to another job, it would have to recompile the Rust (and also do the installation steps). I think we should keep as-is so that our CI usage is kept minimal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comments @wjones127 ! Yes, that's why I merge these two together as recompiling the rust part really takes a lot of time.

pull_request:
paths:
- java/**
- rust/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do expect to run this everytime with rust change, could we change this package as a workspace package with the rest of rust code, and share the same workspace Cargo.toml

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds great to me!

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @beinan. Thanks for working on this.

The things I would like to see:

  1. Make the Tokio runtime a lazy static, so we aren't recreating runtimes.
  2. Make this crate part of the workspace, if possible.
  3. Minimize scope of your unsafe calls.

I think additional features like supporting ReadParams can be left for a future PR. More important that we get the foundation of FFI and async runtime working.

Comment on lines 51 to 61
unsafe {
match ArrowArrayStreamReader::from_raw(arrow_array_stream_addr as *mut FFI_ArrowArrayStream)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure when we use unsafe blocks, we only put the minimal unsafe parts there and leave a comment why the calls are safe.

let stream_ptr = arrow_array_stream_addr as *mut FFI_ArrowArrayStream;
// SAFETY: the pointer is recieved directly from Java's 
// ArrowArrayStream.memory_address(), which guarantees to return a non-null
// valid pointer.
let stream = unsafe { ArrowArrayStreamReader::from_raw(stream_ptr) };
match stream { .... }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eddyxu This is still outstanding

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 32 to 34
let rt = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this to a lazy static variable? We don't want to create a new runtime every time we write to a dataset.

Note, however, in the future we should probably port the executor we did in Python #1172. This allows things like writing to a Lance dataset from a scan of another Lance dataset, and support for KeyboardInterrupt (#1438).

java/lance-jni/src/lib.rs Outdated Show resolved Hide resolved
distribution: temurin
java-version: 17
cache: 'maven'
- run: echo "JAVA_17=$JAVA_HOME" >> $GITHUB_ENV
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the CI job, testing another Java version only took 6s. If we split to another job, it would have to recompile the Rust (and also do the installation steps). I think we should keep as-is so that our CI usage is kept minimal.

@beinan beinan force-pushed the java_jni branch 3 times, most recently from 392454e to 43d6959 Compare February 21, 2024 01:19
@eddyxu
Copy link
Contributor

eddyxu commented Feb 26, 2024

@beinan @LuQQiu thanks for working on this. Please let us know what else we can help to move this forward. We are excited to have a Java SDK!

uses: actions/setup-java@v4
with:
distribution: temurin
java-version: 11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be 17 instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -35,9 +35,6 @@ jobs:
toolchain:
- stable
- nightly
defaults:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -0,0 +1,26 @@
[package]
name = "lance-jni"
version = "0.1.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this version stay consistent with the rust crate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure i can use the rust version. Still need to find a way to bump maven version i guess.

use jni::JNIEnv;

pub fn throw_java_exception(env: &mut JNIEnv, err_msg: &str) {
env.throw_new("java/lang/RuntimeException", err_msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we make a com.lancedb.lance.LanceException class? a bare RuntimeException might make catch statements unnecessarily wide

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will do as follow up.

eddyxu and others added 2 commits February 27, 2024 09:23
Co-authored-by: Will Jones <willjones127@gmail.com>
@eddyxu eddyxu merged commit e79db4f into lancedb:main Feb 27, 2024
17 checks passed
chebbyChefNEQ added a commit that referenced this pull request Feb 28, 2024
#1928 moved the workspace
`Cargo.toml` to the root directory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants