Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Try investigate extension types #338

Closed
wants to merge 16 commits into from

Conversation

sundy-li
Copy link
Collaborator

@sundy-li sundy-li commented Aug 25, 2021

Related to #326

@sundy-li
Copy link
Collaborator Author

sundy-li commented Aug 25, 2021

I've added the function named to_physical_type to make the DataType dispatch more simple, with this we could reduce the refactor codes of the match.

But what's the proper way to refactor the function get_value_display? cc @jorgecarleitao

pub fn get_value_display<'a>(array: &'a dyn Array) -> Box<dyn Fn(usize) -> String + 'a> {
use DataType::*;
match array.data_type() {
Null => Box::new(|_: usize| "".to_string()),
Boolean => {
let a = array.as_any().downcast_ref::<BooleanArray>().unwrap();
Box::new(move |row: usize| format!("{}", a.value(row)))
}
Int8 => dyn_primitive!(array, i8, |x| x),
Int16 => dyn_primitive!(array, i16, |x| x),
Int32 => dyn_primitive!(array, i32, |x| x),
Int64 => dyn_primitive!(array, i64, |x| x),
UInt8 => dyn_primitive!(array, u8, |x| x),
UInt16 => dyn_primitive!(array, u16, |x| x),
UInt32 => dyn_primitive!(array, u32, |x| x),
UInt64 => dyn_primitive!(array, u64, |x| x),
Float16 => unreachable!(),
Float32 => dyn_primitive!(array, f32, |x| x),
Float64 => dyn_primitive!(array, f64, |x| x),
Date32 => dyn_primitive!(array, i32, temporal_conversions::date32_to_date),
Date64 => dyn_primitive!(array, i64, temporal_conversions::date64_to_date),
Time32(TimeUnit::Second) => {
dyn_primitive!(array, i32, temporal_conversions::time32s_to_time)
}
Time32(TimeUnit::Millisecond) => {
dyn_primitive!(array, i32, temporal_conversions::time32ms_to_time)
}
Time32(_) => unreachable!(), // remaining are not valid
Time64(TimeUnit::Microsecond) => {

@jorgecarleitao
Copy link
Owner

Those are good ideas!

Even though they are represented equally, imo there are two types of matches: matches of physical types and matches of logical types.

IMO the logical matches should error on extension types (e.g. Extension + Int16 should error) except at IO / FFI boundaries. Matches of physical types should be mapped accordingly, exactly like you are doing.

Display is a logical construct (Date32 != Int32). For now, I would map the representation to something like ExtensionArray[representation equal to the base logical type].

There is a case here to change the DataType::Extension(_,_,_) to DataType::Extension(Box<dyn Extension>) where Extension is a trait describing the extension including how individual items should be displayed.

@sundy-li
Copy link
Collaborator Author

sundy-li commented Aug 25, 2021

DataType::Extension(Box) where Extension is a trait

I agree with that.

So:

  1. We should have a factory for users to register custom Extension types and create Extension Type by the metadata in field, is that right? Or ipc deserialize/serialize can handle dynamic extension types.

@jorgecarleitao
Copy link
Owner

jorgecarleitao commented Aug 25, 2021

I would go for the latter: registering introduces state, potentially a singleton, etc. I would try first a stateless solution with a trait containing what is needed for the IPC to deserialize it. Something like

pub trait Extension {
     fn name() -> &str;
     fn data_type() -> &DataType;
     fn metadata() -> &Option<HashMap<String,String>>;

     // optional, fall back to the standard `get_display_value`
     fn get_display_value() -> Box<dyn Fn ...>;
}

is enough for now?

We probably need to constraint it somehow (Debug + Send + Sync)?

@sundy-li
Copy link
Collaborator Author

sundy-li commented Aug 25, 2021

Ok, btw what's the error of object safe trait? I looked at the docs of rust but did not get it.

error: the trait extension::Extension cannot be made into an object\nlabel: extension::Extension cannot be made into an object

pub trait Extension: std::fmt::Debug + Send + Sync + Hash + Ord {
    fn name(&self) -> &str;
    /// Returns physical_type
    fn data_type(&self) -> &DataType;
    fn metadata(&self) -> &Option<HashMap<String, String>>;

    // fn get_display_value<'a>(&self, array: &'a dyn Array) -> Box<dyn Fn(usize) -> String + 'a>;
}

@jorgecarleitao
Copy link
Owner

It cannot be Hash or Eq or PartialEq because they require sized objects, which a trait object is not.

I think that we will need to implement PartialEq for DataType ourselves, calling the usual in all variants except the Extension, which we use the trait to derive the PartialEq, something like:

impl PartialEq for dyn Extension + '_ {
    fn eq(&self, other: &Self) -> bool {
        self.name() == other.name() && self.data_type() == other.data_type() && self.metadata() == other.metadata()
    }
}

@codecov
Copy link

codecov bot commented Aug 25, 2021

Codecov Report

Merging #338 (fed1963) into main (f79ae3e) will decrease coverage by 0.28%.
The diff coverage is 57.54%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #338      +/-   ##
==========================================
- Coverage   80.97%   80.69%   -0.29%     
==========================================
  Files         326      327       +1     
  Lines       21167    21378     +211     
==========================================
+ Hits        17141    17250     +109     
- Misses       4026     4128     +102     
Impacted Files Coverage Δ
src/array/boolean/ffi.rs 0.00% <0.00%> (ø)
src/compute/like.rs 0.00% <0.00%> (ø)
src/datatypes/extension.rs 0.00% <0.00%> (ø)
src/ffi/schema.rs 57.91% <0.00%> (-0.27%) ⬇️
src/io/json_integration/schema.rs 43.75% <0.00%> (-0.28%) ⬇️
src/compute/aggregate/memory.rs 29.23% <15.38%> (+1.53%) ⬆️
src/io/ipc/convert.rs 93.88% <20.00%> (-0.82%) ⬇️
src/array/growable/mod.rs 41.93% <23.07%> (+1.61%) ⬆️
src/datatypes/field.rs 20.00% <28.57%> (+1.35%) ⬆️
src/array/ffi.rs 51.61% <42.30%> (+3.22%) ⬆️
... and 44 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f79ae3e...fed1963. Read the comment docs.

@sundy-li
Copy link
Collaborator Author

sundy-li commented Aug 25, 2021

Trait Extension is done, Two more problems:

  1. Should Field::new() copy the metadata from Extension types?
  2. How to Serialize/Deserialize extension types in IPC ?
    Seems we need to introduce a factory to register creators and a type Lexer from String in metadata.

Copy link
Owner

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far!

Couple of points:

  • remove Ord from Field and DataType, (why on earth this should be ordered?); only the json reader uses that it is brittle to say the least (I will PR separately)
  • declare a separate enum for physical types so that it is easier to work with then (and separate them from the logical types)?
  • the implementation of to_bytes should be made by us, based on the trait's information
  • to_format is done by us and does not need to be part of the trait
  • hash must be done by us and must be compatible with PartialEq and Eq
  • [ ]The non-triviality here is that every array must now be able to convert itself to an extension type of its own compatible logical type, so that the extension type data is carried over with the array itself, so that e.g. consumers can annotate the arrays with it and use array.data_type() to match an extension type.
  • The IPC changes should be reverted: they do not follow the Arrow spec; the spec is way easier when it comes to extension types :) They are just a piece of metadata written to the fields' schema.

@@ -356,6 +356,8 @@ fn to_format(data_type: &DataType) -> String {
r
}
DataType::Dictionary(index, _) => to_format(index.as_ref()),
//TODO: get format from metadata
DataType::Extension(ty) => ty.to_format().to_string(),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extension type does not have a format for FFI: that information is passed on the fields' metadata in the keys ARROW::extension::name and ARROW::extension::metadata.

@@ -657,6 +658,9 @@ pub(crate) fn get_fb_field_type<'a>(
children: Some(fbb.create_vector(&children)),
}
}
Extension(ex) => {
todo!("extension");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is where we map the extension to the IPC field. To see how this is done by the rest of the arrow implementations, write

    #[test]
    fn read_generated_extension() -> Result<()> {
        test_file("1.0.0-littleendian", "generated_extension")
    }

on tests/it/io/ipc/file.rs and print the schema in the json format (i..e. arrow_json inside the function read_gzip_json. You will find a small ARROW:extension:... key and value, which is how extensions are written in the metadata.

@@ -782,9 +782,10 @@ impl Type {
pub const LargeBinary: Self = Self(19);
pub const LargeUtf8: Self = Self(20);
pub const LargeList: Self = Self(21);
pub const Extension: Self = Self(22);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't change this generated code as other implementations will not match it. extension types are shared in a different way, see comment above.

@sundy-li
Copy link
Collaborator Author

sundy-li commented Aug 26, 2021

Now BooleanArray supports logic datatypes, since lots of code/apis were changed(really non-triviality ), I'd like to refactor other arrays at last.

The generated_extension json-style print result is :

Schema { 
	fields: [
		Field { 
			name: "uuids", data_type: FixedSizeBinary(16), nullable: true, dict_id: 0, dict_is_ordered: false,
			metadata: Some({"ARROW:extension:metadata": "uuid-serialized", "ARROW:extension:name": "uuid"}) 
		},
		Field { 
			name: "dict_exts", data_type: Dictionary(Int8, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false,
			metadata: Some({"ARROW:extension:metadata": "dict-extension-serialized", "ARROW:extension:name": "dict-extension"}) 
		}
	],
	metadata: {}
}

I still have some questions about this:
Considering uuid is the extention type, extend.data_type() -> data_type: FixedSizeBinary(16)

  • Is that The data_type in Field must be the result of extend.data_type() ?
  • How to deserialize into logic types if we already got the schema from IPC read? uuid rather than data_type: FixedSizeBinary(16)

Currently:

Array ----> ipc serialize ----> ipc deserilize ---> Array<FixedSizeBinary(16)>

Seems we must introduce registers as arrow-go did:

https://github.com/apache/arrow/blob/master/go/arrow/datatype_extension.go#L55-L86

@jorgecarleitao
Copy link
Owner

Ok, I think that we need to re-think this through, as we should not have to significantly change the crate to enable this use-case, and introducing API changes that imo are too impactful.

I would prefer to avoid the registry and static state with locks in such a low-level library. Maybe the solution here is for users to declare an extension array type that must return DataType::Extension. I.e. what if we add a new trait, ArrayExtension, that implements Array, and that users need to implement to implement the extension?

Sorry that we are circling a bit, but it is important to try the different angles. Let me know if you do not have the time, that I can pitch in and try it out!

@sundy-li
Copy link
Collaborator Author

sundy-li commented Aug 26, 2021

Thanks, I do agree that the impactful api changes are not good things. Since Datafuse depends on this, I would like to try another way outside arrow2 to implement this:

Provide convert function between Datafuse schema between Arrow2 schema

Datafuse schema ---->  Arrow2 schema
UUID                   ------->   {FixedSizeBinaryBinary, metadata}
Arrow2 schema ---->  Datafuse schema
{ FixedSizeBinaryBinary, metadata }                  ------->   UUID

Seems this could work out and do not break anything in arrow2.

Let me know if you do not have the time, that I can pitch in and try it out!

At the same time, you can try it out inside arrow2 in another way.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants