Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Added support for Extension #350

Closed
wants to merge 5 commits into from
Closed

Added support for Extension #350

wants to merge 5 commits into from

Conversation

jorgecarleitao
Copy link
Owner

This PR adds support for Arrow's "Extension type".

This is represented by DataType::Extension and ExtensionArray, respectively.

For now, extension arrays are only supported to be shared via IPC (i.e. not FFI, since metadata is still not supported in FFI).

This PR adds an example demonstrating how to use it.

All of this is pending passing integration tests, as usual, as well as more tests covering the feature.

@jorgecarleitao jorgecarleitao added the feature A new feature label Aug 27, 2021
@jorgecarleitao
Copy link
Owner Author

jorgecarleitao commented Aug 27, 2021

@sundy-li here is a proposal. Would it work for you? You would need to match as such:

match array.data_type() {
     DataType::Extension("date16", UInt16, _) => {
          let array = array.as_any().downcast_ref::<ExtensionArray>().unwrap();
          let array: &UInt16Array = array.inner().as_any().downcast_ref().unwrap();
     }
}

@codecov
Copy link

codecov bot commented Aug 27, 2021

Codecov Report

Merging #350 (e911fd8) into main (87aa617) will decrease coverage by 0.84%.
The diff coverage is 55.86%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #350      +/-   ##
==========================================
- Coverage   81.01%   80.16%   -0.85%     
==========================================
  Files         326      329       +3     
  Lines       21172    21507     +335     
==========================================
+ Hits        17152    17242      +90     
- Misses       4020     4265     +245     
Impacted Files Coverage Δ
src/array/display.rs 57.03% <0.00%> (-0.86%) ⬇️
src/array/extension/ffi.rs 0.00% <0.00%> (ø)
src/array/growable/mod.rs 40.32% <ø> (ø)
src/array/mod.rs 54.28% <0.00%> (-0.95%) ⬇️
src/compute/aggregate/memory.rs 26.86% <0.00%> (-0.83%) ⬇️
src/datatypes/field.rs 20.16% <0.00%> (+1.52%) ⬆️
src/datatypes/mod.rs 11.76% <ø> (ø)
src/ffi/schema.rs 57.91% <0.00%> (-0.27%) ⬇️
src/array/ffi.rs 34.65% <31.57%> (-13.74%) ⬇️
src/array/extension/mod.rs 42.30% <42.30%> (ø)
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 87aa617...e911fd8. Read the comment docs.

@sundy-li
Copy link
Collaborator

sundy-li commented Aug 28, 2021

@sundy-li here is a proposal. Would it work for you? You would need to match as such:

Yes, this would work in my case, LGTM. So Extension(String, Box<DataType>, Option<String>), the Option<String> is argument of this type? As it can contain timezone, precisions etc.

@jorgecarleitao jorgecarleitao marked this pull request as ready for review August 28, 2021 05:28
@jorgecarleitao
Copy link
Owner Author

Option<String> is an optional opaque string. E.g. we can base64-encode anything on it.

@sundy-li
Copy link
Collaborator

sundy-li commented Aug 29, 2021

This pr looks good except for one thing: how does the compute work for the extension array.

If Date16 and UInt16 share the same physical array (UInt16Array), the compute code will work for it automatically.

I just looked deep into the ClickHouse's array system, there is no logic type involved inside the array system.

https://github.com/ClickHouse/ClickHouse/blob/master/src/Columns/ColumnVector.h#L232

@jorgecarleitao
Copy link
Owner Author

AFIAK the compute cannot make many assumptions about extension type because an extension type may have different semantics.

With that said, there are still a lot of things we can assume and forward the call to the corresponding function (e.g. we should still support filter, take, concat, etc).

Are those the type of compute operations that you are thinking about?

@jorgecarleitao
Copy link
Owner Author

I have been thinking a lot about this and I am not sure this is the right way of doing this; I am starting to agree with you @sundy-li that we need the physical type idea here. I.e. not use a registry, but still use the logical_type -> physical_type -> downcast together with all array types having a DataType instead of creating an ExtensionArray.

@sundy-li
Copy link
Collaborator

sundy-li commented Aug 29, 2021

Currently, this pr can meet our demand, so I decided to introduce custom logic types in Datafuse base on this pr.
But I'm not going to have ExtensionArray inside Datafuse, the ExtensionArray can only be seen in arrow's RecordBatch converted by Datafuse's datablock.

This may help in sharing some compute operations, like Date16 sub Date16, Date16 sub/add UInt16, Date16 cmp UInt16 ...

So I think this pr can be merged now because it did not break any old behaviors, let's take it as an experimental feature.

I do agree that logical_type should not be awarded in Array, the data_type() of Array should always return the physical type. Let the schema/record batches take control of the logical_type may be better.

@sundy-li
Copy link
Collaborator

Let calculations follow specific actual types, not logical types.

@jorgecarleitao
Copy link
Owner Author

@ritchie46 , does polars use Array::data_type for other things other than to downcast the array to the correct implementation, or are logical datatypes carried over somewhere else (e.g. RecordBatch or Series). Asking because I think that @sundy-li proposal is quite interesting and may result in a lot of simplifications around data types and array construction.

@ritchie46
Copy link
Collaborator

@ritchie46 , does polars use Array::data_type for other things other than to downcast the array to the correct implementation, or are logical datatypes carried over somewhere else

Polars keep track of the data type itself on the ChunkedArray<T> This data type is mostly aligned with arrows data type, but does not have to be. For instance a polars CategoricalType has the following arrow arrays: UInt32Array and for comparisson its downcasted to UInt32Type.

Logical type Series are downcasted to their physical type, compute is applied and upcasted again. So for Date32 its casted to Int32, some compute is applied and its again casted back to Date32. This saves a lot of compiler bloat.

@sundy-li
Copy link
Collaborator

sundy-li commented Aug 30, 2021

Datafuse used polars's series idea, the SeriesTrait is good for type dispatch (it may be fatter but worth it), we can handle lots of type conversion between logic arrays and physical arrays, so thanks for this great crate.

But ChunkedArray<T> may not be a good idea since T is a logical type already, even more ChunkedArray<T> hold ArrayRef, I would rather not use generic ChunkedArray<T> and use ChunkedArrayPrimitive, ChunkedArrayBoolean instead.

So
Series op Series
----> ChunkedArray<T> op Series (SeriesTrait)
----> ChunkedArray<E> op ChunkedArray<E> (nomalize types, logical cast to physical)
----> UInt32Array op UInt32Array (downcast from ArrayRef).
-----> ChunkedArray<E> into ChunkedArray<T> (physical to logical)
-----> ChunkedArray<T> into Series (array into series)

And polars did not use flight and IPC, so currently it's easy to have custom logic data types inside it.

@jorgecarleitao
Copy link
Owner Author

Closing in favor of #359

@jorgecarleitao jorgecarleitao deleted the extension branch August 31, 2021 06:02
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants