-
Notifications
You must be signed in to change notification settings - Fork 220
Conversation
I've added the function named But what's the proper way to refactor the function Lines 34 to 62 in a52211d
|
Those are good ideas! Even though they are represented equally, imo there are two types of matches: matches of physical types and matches of logical types. IMO the logical matches should error on extension types (e.g. Display is a logical construct (Date32 != Int32). For now, I would map the representation to something like There is a case here to change the |
I agree with that. So:
|
I would go for the latter: registering introduces state, potentially a singleton, etc. I would try first a stateless solution with a trait containing what is needed for the IPC to deserialize it. Something like pub trait Extension {
fn name() -> &str;
fn data_type() -> &DataType;
fn metadata() -> &Option<HashMap<String,String>>;
// optional, fall back to the standard `get_display_value`
fn get_display_value() -> Box<dyn Fn ...>;
} is enough for now? We probably need to constraint it somehow ( |
Ok, btw what's the error of object safe trait? I looked at the docs of rust but did not get it. error: the trait
|
It cannot be I think that we will need to implement impl PartialEq for dyn Extension + '_ {
fn eq(&self, other: &Self) -> bool {
self.name() == other.name() && self.data_type() == other.data_type() && self.metadata() == other.metadata()
}
} |
Codecov Report
@@ Coverage Diff @@
## main #338 +/- ##
==========================================
- Coverage 80.97% 80.69% -0.29%
==========================================
Files 326 327 +1
Lines 21167 21378 +211
==========================================
+ Hits 17141 17250 +109
- Misses 4026 4128 +102
Continue to review full report at Codecov.
|
Trait
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great so far!
Couple of points:
- remove
Ord
fromField
andDataType
, (why on earth this should be ordered?); only the json reader uses that it is brittle to say the least (I will PR separately) - declare a separate
enum
for physical types so that it is easier to work with then (and separate them from the logical types)? - the implementation of
to_bytes
should be made by us, based on the trait's information -
to_format
is done by us and does not need to be part of the trait -
hash
must be done by us and must be compatible withPartialEq
andEq
- [ ]The non-triviality here is that every array must now be able to convert itself to an extension type of its own compatible logical type, so that the extension type data is carried over with the array itself, so that e.g. consumers can annotate the arrays with it and use
array.data_type()
to match an extension type. - The IPC changes should be reverted: they do not follow the Arrow spec; the spec is way easier when it comes to extension types :) They are just a piece of metadata written to the fields' schema.
src/ffi/schema.rs
Outdated
@@ -356,6 +356,8 @@ fn to_format(data_type: &DataType) -> String { | |||
r | |||
} | |||
DataType::Dictionary(index, _) => to_format(index.as_ref()), | |||
//TODO: get format from metadata | |||
DataType::Extension(ty) => ty.to_format().to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extension type does not have a format for FFI: that information is passed on the fields' metadata in the keys ARROW::extension::name
and ARROW::extension::metadata
.
src/io/ipc/convert.rs
Outdated
@@ -657,6 +658,9 @@ pub(crate) fn get_fb_field_type<'a>( | |||
children: Some(fbb.create_vector(&children)), | |||
} | |||
} | |||
Extension(ex) => { | |||
todo!("extension"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is where we map the extension to the IPC field. To see how this is done by the rest of the arrow implementations, write
#[test]
fn read_generated_extension() -> Result<()> {
test_file("1.0.0-littleendian", "generated_extension")
}
on tests/it/io/ipc/file.rs
and print the schema in the json format (i..e. arrow_json
inside the function read_gzip_json
. You will find a small ARROW:extension:...
key and value, which is how extensions are written in the metadata.
src/io/ipc/gen/Schema.rs
Outdated
@@ -782,9 +782,10 @@ impl Type { | |||
pub const LargeBinary: Self = Self(19); | |||
pub const LargeUtf8: Self = Self(20); | |||
pub const LargeList: Self = Self(21); | |||
pub const Extension: Self = Self(22); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't change this generated code as other implementations will not match it. extension types are shared in a different way, see comment above.
Now BooleanArray supports logic datatypes, since lots of code/apis were changed(really non-triviality ), I'd like to refactor other arrays at last. The generated_extension json-style print result is :
I still have some questions about this:
Currently: Array ----> ipc serialize ----> ipc deserilize ---> Array<FixedSizeBinary(16)> Seems we must introduce registers as arrow-go did: https://github.com/apache/arrow/blob/master/go/arrow/datatype_extension.go#L55-L86 |
Ok, I think that we need to re-think this through, as we should not have to significantly change the crate to enable this use-case, and introducing API changes that imo are too impactful. I would prefer to avoid the registry and static state with locks in such a low-level library. Maybe the solution here is for users to declare an extension array type that must return Sorry that we are circling a bit, but it is important to try the different angles. Let me know if you do not have the time, that I can pitch in and try it out! |
Thanks, I do agree that the impactful api changes are not good things. Since Datafuse depends on this, I would like to try another way outside arrow2 to implement this: Provide convert function between Datafuse schema between Arrow2 schema
Seems this could work out and do not break anything in arrow2.
At the same time, you can try it out inside arrow2 in another way. |
Related to #326