-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(java): add spark catalog basic batch write #2133
Conversation
ACTION NEEDED Lance follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
* @param params write params | ||
* @return Dataset | ||
*/ | ||
public static Dataset createEmptyDataSet(String path, Schema schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
createEmptyDataset
try (RootAllocator allocator = new RootAllocator(); | ||
VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator)) { | ||
ByteArrayOutputStream schemaOnlyOutStream = new ByteArrayOutputStream(); | ||
try (ArrowStreamWriter writer = new ArrowStreamWriter(root, null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing data is really tedious in this way.
Anyway we can improve the java / rust API ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetSchema is added, next will be adding the Spark write support.
Will work on designing/improving the write APIs after we have a clear idea of how write is called.
(Do feel it's very ugly now
return new ArrowType.Utf8(); | ||
} else if (dataType instanceof DoubleType) { | ||
return new ArrowType.FloatingPoint(FloatingPointPrecision.DOUBLE); | ||
} else if (dataType instanceof FloatType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious how does the FixedSizeListArray
to be mapped in Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emmmm Spark ArrowUtils only deal with ArrowType.List but do not have conversion for FixedSizeListArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some metadata / hints to help here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NP, will see how to deal with FixSizeListArray and also check other commonly used types
String tableName = "dev.db.lance_df_table"; | ||
// Same as create + insert | ||
data.writeTo(tableName).using("lance").create(); | ||
spark.table(tableName).show(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also check that manifest files have been created under the directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add to the test! thanks
da7ac2c
to
867e23a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor issues
java/core/lance-jni/src/traits.rs
Outdated
@@ -120,3 +124,32 @@ impl JMapExt for JMap<'_, '_, '_> { | |||
get_map_value(env, self, key) | |||
} | |||
} | |||
|
|||
pub struct SingleRecordBatchReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for?
Can we reuse arrow's RecordBatchIterator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
} | ||
|
||
private static DataType convert(org.apache.arrow.vector.types.pojo.FieldType fieldType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move the full qualitifed type to import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending CI fix and address all comments
b01f908
to
68b8593
Compare
private final ArrowWriter writer; | ||
|
||
private UnpartitionedDataWriter(String datasetUri, StructType sparkSchema) { | ||
// TODO(lu) add Lance Spark configuration of maxRowsPerFragment? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eddyxu the java approach is write to VectorSchemaRoot, unload from vectorSchemaRoot to c.ArrowArray, and pass to Lance to write to a Lance Fragment.
If i do small batch write could result in a large amount of small Lance Fragment files. If i do large batch write, could result in high memory consumption. Haven't found an existing good Arrow supported approach that Spark can keep write batches to VectorSchemaRoot and Lance Java API can keep unload batches and turn to c.ArrowArray and keep appending to same fragment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will improve this in following PR
e906846
to
d5610e6
Compare
@eddyxu @QianZhu @chebbyChefNEQ PTAL thanks! |
Close, Spark write will be another new PR |
Add Spark catalog basic structure
Implemented the CREATE TABLE statement