feat(java): add spark catalog basic batch write #2133

LuQQiu · 2024-03-31T03:19:49Z

Add Spark catalog basic structure
Implemented the CREATE TABLE statement

github-actions · 2024-04-01T12:51:36Z

ACTION NEEDED

Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

eddyxu · 2024-04-03T15:30:01Z

java/core/src/main/java/com/lancedb/lance/Dataset.java

+   * @param params write params
+   * @return Dataset
+   */
+  public static Dataset createEmptyDataSet(String path, Schema schema,


createEmptyDataset

eddyxu · 2024-04-03T15:32:20Z

java/core/src/main/java/com/lancedb/lance/Dataset.java

+    try (RootAllocator allocator = new RootAllocator();
+         VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator)) {
+      ByteArrayOutputStream schemaOnlyOutStream = new ByteArrayOutputStream();
+      try (ArrowStreamWriter writer = new ArrowStreamWriter(root, null,


Writing data is really tedious in this way.
Anyway we can improve the java / rust API ?

GetSchema is added, next will be adding the Spark write support.
Will work on designing/improving the write APIs after we have a clear idea of how write is called.
(Do feel it's very ugly now

eddyxu · 2024-04-03T15:34:44Z

java/spark/v3.5/src/main/java/com/lancedb/lance/spark/SparkSchemaUtils.java

+      return new ArrowType.Utf8();
+    } else if (dataType instanceof DoubleType) {
+      return new ArrowType.FloatingPoint(FloatingPointPrecision.DOUBLE);
+    } else if (dataType instanceof FloatType) {


Curious how does the FixedSizeListArray to be mapped in Spark.

Changed to use Spark's ArrowUtils

emmmm Spark ArrowUtils only deal with ArrowType.List but do not have conversion for FixedSizeListArray

Can we add some metadata / hints to help here?

NP, will see how to deal with FixSizeListArray and also check other commonly used types

eddyxu · 2024-04-03T15:44:08Z

java/spark/v3.5/src/test/java/com/lancedb/lance/spark/SparkCatalogTest.java

+    String tableName = "dev.db.lance_df_table";
+    // Same as create + insert
+    data.writeTo(tableName).using("lance").create();
+    spark.table(tableName).show();


Could you also check that manifest files have been created under the directory?

Will add to the test! thanks

eddyxu

Some minor issues

eddyxu · 2024-04-11T01:37:16Z

java/core/lance-jni/src/traits.rs

@@ -120,3 +124,32 @@ impl JMapExt for JMap<'_, '_, '_> {
        get_map_value(env, self, key)
    }
 }
+
+pub struct SingleRecordBatchReader {


What is this for?
Can we reuse arrow's RecordBatchIterator?

https://docs.rs/arrow/latest/arrow/record_batch/struct.RecordBatchIterator.html

eddyxu · 2024-04-11T01:38:56Z

java/spark/v3.5/src/main/java/com/lancedb/lance/spark/SparkSchemaUtils.java

+    }
+  }
+
+  private static DataType convert(org.apache.arrow.vector.types.pojo.FieldType fieldType) {


can we move the full qualitifed type to import?

eddyxu

Pending CI fix and address all comments

LuQQiu · 2024-04-11T04:15:50Z

@eddyxu Since this PR has some ongoing spark work e.g. the fragment create is completed but the commit is not implement, could we merge the jni one first? #2175. #2175 is self contained

LuQQiu · 2024-04-24T05:07:43Z

java/spark/v3.5/src/main/java/com/lancedb/lance/spark/source/SparkWrite.java

+    private final ArrowWriter writer;
+
+    private UnpartitionedDataWriter(String datasetUri, StructType sparkSchema) {
+      // TODO(lu) add Lance Spark configuration of maxRowsPerFragment?


@eddyxu the java approach is write to VectorSchemaRoot, unload from vectorSchemaRoot to c.ArrowArray, and pass to Lance to write to a Lance Fragment.
If i do small batch write could result in a large amount of small Lance Fragment files. If i do large batch write, could result in high memory consumption. Haven't found an existing good Arrow supported approach that Spark can keep write batches to VectorSchemaRoot and Lance Java API can keep unload batches and turn to c.ArrowArray and keep appending to same fragment

Will improve this in following PR

LuQQiu · 2024-04-26T21:47:47Z

@eddyxu @QianZhu @chebbyChefNEQ PTAL thanks!
Add the basic batch write for Spark.

LuQQiu · 2024-06-10T04:35:29Z

Close, Spark write will be another new PR
Spark read is in #2429
Spark catalog will be moved to LanceDB

LuQQiu requested review from eddyxu and beinan March 31, 2024 03:19

LuQQiu changed the title ~~feat(java): add spark catalog basic structure~~ [WIP] feat(java): add spark catalog basic structure Apr 1, 2024

eddyxu reviewed Apr 3, 2024

View reviewed changes

LuQQiu force-pushed the sparkWriteStructure branch 2 times, most recently from da7ac2c to 867e23a Compare April 10, 2024 04:28

eddyxu reviewed Apr 11, 2024

View reviewed changes

eddyxu approved these changes Apr 11, 2024

View reviewed changes

LuQQiu force-pushed the sparkWriteStructure branch from b01f908 to 68b8593 Compare April 14, 2024 16:12

LuQQiu commented Apr 24, 2024

View reviewed changes

Add Spark basic batch write

d5610e6

LuQQiu force-pushed the sparkWriteStructure branch from e906846 to d5610e6 Compare April 26, 2024 20:43

LuQQiu changed the title ~~[WIP] feat(java): add spark catalog basic structure~~ feat(java): add spark catalog basic structure Apr 26, 2024

LuQQiu changed the title ~~feat(java): add spark catalog basic structure~~ feat(java): add spark catalog basic batch write Apr 26, 2024

eddyxu requested review from QianZhu and chebbyChefNEQ April 26, 2024 22:01

LuQQiu closed this Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(java): add spark catalog basic batch write #2133

feat(java): add spark catalog basic batch write #2133

LuQQiu commented Mar 31, 2024

github-actions bot commented Apr 1, 2024

eddyxu Apr 3, 2024

eddyxu Apr 3, 2024

LuQQiu Apr 3, 2024

eddyxu Apr 3, 2024

LuQQiu Apr 4, 2024

LuQQiu Apr 4, 2024

eddyxu Apr 8, 2024

LuQQiu Apr 10, 2024

eddyxu Apr 3, 2024

LuQQiu Apr 4, 2024

eddyxu left a comment

eddyxu Apr 11, 2024

eddyxu Apr 11, 2024

eddyxu Apr 11, 2024

eddyxu left a comment

LuQQiu commented Apr 11, 2024 •

edited

Loading

LuQQiu Apr 24, 2024 •

edited

Loading

LuQQiu May 4, 2024

LuQQiu commented Apr 26, 2024

LuQQiu commented Jun 10, 2024

feat(java): add spark catalog basic batch write #2133

feat(java): add spark catalog basic batch write #2133

Conversation

LuQQiu commented Mar 31, 2024

github-actions bot commented Apr 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eddyxu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eddyxu left a comment

Choose a reason for hiding this comment

LuQQiu commented Apr 11, 2024 • edited Loading

LuQQiu Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuQQiu commented Apr 26, 2024

LuQQiu commented Jun 10, 2024

LuQQiu commented Apr 11, 2024 •

edited

Loading

LuQQiu Apr 24, 2024 •

edited

Loading