diff --git a/guide/src/high_level.md b/guide/src/high_level.md index 9cc1c86f0e6..e055b0a15db 100644 --- a/guide/src/high_level.md +++ b/guide/src/high_level.md @@ -9,7 +9,7 @@ from a slice as follows: ```rust # use arrow2::array::{Array, PrimitiveArray}; # fn main() { -let array = PrimitiveArray::from([Some(1), None, Some(123)]); +let array = PrimitiveArray::::from([Some(1), None, Some(123)]); assert_eq!(array.len(), 3) # } ``` @@ -19,7 +19,7 @@ from a slice of values, ```rust # use arrow2::array::{Array, PrimitiveArray}; # fn main() { -let array = PrimitiveArray::from_slice([1, 0, 123]); +let array = PrimitiveArray::::from_slice([1.0, 0.0, 123.0]); assert_eq!(array.len(), 3) # } ``` @@ -29,7 +29,7 @@ or from an iterator ```rust # use arrow2::array::{Array, PrimitiveArray}; # fn main() { -let array: PrimitiveArray = [Some(1), None, Some(123)].iter().collect(); +let array: PrimitiveArray = [Some(1), None, Some(123)].iter().collect(); assert_eq!(array.len(), 3) # } ``` @@ -68,7 +68,7 @@ which is assigned when allocating arrays from iterators, slices, etc. # use arrow2::array::{Array, Int32Array, PrimitiveArray}; # use arrow2::datatypes::DataType; # fn main() { -let array = PrimitiveArray::from_slice([1, 0, 123]); +let array = PrimitiveArray::::from_slice([1, 0, 123]); assert_eq!(array.data_type(), &DataType::Int32); # } ``` @@ -105,7 +105,7 @@ let a: &dyn Array = &a; ### Downcast and `as_any` Given a trait object `array: &dyn Array`, we know its physical type via -`array.data_type().to_physical_type()`, which we use to downcast the array +`PhysicalType: array.data_type().to_physical_type()`, which we use to downcast the array to its concrete type: ```rust @@ -114,53 +114,56 @@ to its concrete type: # fn main() { let a = PrimitiveArray::::from(&[Some(1), None]); let array = &array as &dyn Array; +// ... +let physical_type: PhysicalType = array.data_type().to_physical_type(); +# } +``` + +There is a one to one relationship between each variant of `PhysicalType` (an enum) and +an each implementation of `Array` (a struct): + +| `PhysicalType` | `Array` | +|-------------------|------------------------| +| `Primitive(_)` | `PrimitiveArray<_>` | +| `Binary` | `BinaryArray` | +| `LargeBinary` | `BinaryArray` | +| `Utf8` | `Utf8Array` | +| `LargeUtf8` | `Utf8Array` | +| `List` | `ListArray` | +| `LargeList` | `ListArray` | +| `FixedSizeBinary` | `FixedSizeBinaryArray` | +| `FixedSizeList` | `FixedSizeListArray` | +| `Struct` | `StructArray` | +| `Union` | `UnionArray` | +| `Dictionary(_)` | `DictionaryArray<_>` | + +where `_` represents each of the variants (e.g. `PrimitiveType::Int32 <-> i32`). -match array.data_type().to_physical_type() { - PhysicalType::Int32 => { - let array = array.as_any().downcast_ref::>().unwrap(); - let values: &[i32] = array.values(); - assert_eq!(values, &[1, 0]); +In this context, a common idiom in using `Array` as a trait object is as follows: + +```rust +use arrow2::datatypes::{PhysicalType, PrimitiveType}; +use arrow2::array::{Array, PrimitiveArray}; + +fn float_operator(array: &dyn Array) -> Result, String> { + match array.data_type().to_physical_type() { + PhysicalType::Primitive(PrimitiveType::Float32) => { + let array = array.as_any().downcast_ref::>().unwrap(); + // let array = f32-specific operator + let array = array.clone(); + Ok(Box::new(array)) + } + PhysicalType::Primitive(PrimitiveType::Float64) => { + let array = array.as_any().downcast_ref::>().unwrap(); + // let array = f64-specific operator + let array = array.clone(); + Ok(Box::new(array)) + } + _ => Err("This operator is only valid for float point arrays".to_string()), } - _ => todo!() } -# } ``` -There is a many-to-one relationship between `DataType` and an Array (i.e. a physical representation). The relationship is the following: - -| `PhysicalType` | `PhysicalType` | -|----------------------|---------------------------| -| `UInt8` | `PrimitiveArray` | -| `UInt16` | `PrimitiveArray` | -| `UInt32` | `PrimitiveArray` | -| `UInt64` | `PrimitiveArray` | -| `Int8` | `PrimitiveArray` | -| `Int16` | `PrimitiveArray` | -| `Int32` | `PrimitiveArray` | -| `Int64` | `PrimitiveArray` | -| `Int128` | `PrimitiveArray` | -| `Float32` | `PrimitiveArray` | -| `Float64` | `PrimitiveArray` | -| `DaysMs` | `PrimitiveArray` | -| `Binary` | `BinaryArray` | -| `LargeBinary` | `BinaryArray` | -| `Utf8` | `Utf8Array` | -| `LargeUtf8` | `Utf8Array` | -| `List` | `ListArray` | -| `LargeList` | `ListArray` | -| `FixedSizeBinary` | `FixedSizeBinaryArray` | -| `FixedSizeList` | `FixedSizeListArray` | -| `Struct` | `StructArray` | -| `Union` | `UnionArray` | -| `Dictionary(UInt8)` | `DictionaryArray` | -| `Dictionary(UInt16)` | `DictionaryArray` | -| `Dictionary(UInt32)` | `DictionaryArray` | -| `Dictionary(UInt64)` | `DictionaryArray` | -| `Dictionary(Int8)` | `DictionaryArray` | -| `Dictionary(Int16)` | `DictionaryArray` | -| `Dictionary(Int32)` | `DictionaryArray` | -| `Dictionary(Int64)` | `DictionaryArray` | - ## From Iterator In the examples above, we've introduced how to create an array from an iterator. @@ -218,7 +221,8 @@ bitwise operations, it is often more performant to operate on chunks of bits ins ## Vectorized operations One of the main advantages of the arrow format and its memory layout is that -it often enables SIMD. For example, an unary operation `op` on a `PrimitiveArray` is likely auto-vectorized on the following code: +it often enables SIMD. For example, an unary operation `op` on a `PrimitiveArray` +likely emits SIMD instructions on the following code: ```rust # use arrow2::buffer::Buffer; diff --git a/guide/src/metadata.md b/guide/src/metadata.md index c9009c67673..9ced2cdfeb1 100644 --- a/guide/src/metadata.md +++ b/guide/src/metadata.md @@ -12,16 +12,18 @@ semantical types defined in Arrow. In Arrow2, logical types are declared as variants of the `enum` `arrow2::datatypes::DataType`. For example, `DataType::Int32` represents a signed integer of 32 bits. -Each logical type has an associated in-memory physical representation and is associated to specific -semantics. For example, `Date32` has the same in-memory representation as `Int32`, but the value -represents the number of days since UNIX epoch. +Each `DataType` has an associated `enum PhysicalType` (many-to-one) representing the +particular in-memory representation, and is associated to specific semantics. +For example, both `DataType::Date32` and `DataType::Int32` have the same `PhysicalType` +(`PhysicalType::Primitive(PrimitiveType::Int32)`) but `Date32` represents the number of +days since UNIX epoch. -Logical types are metadata: they annotate arrays with extra information about in-memory data. +Logical types are metadata: they annotate physical types with extra information about data. ## `Field` (column metadata) -Besides logical types, the arrow format supports other relevant metadata to the format. All this -information is stored in `arrow2::datatypes::Field`. +Besides logical types, the arrow format supports other relevant metadata to the format. +All this information is stored in `arrow2::datatypes::Field`. A `Field` is arrow's metadata associated to a column in the context of a columnar format. It has a name, a logical type `DataType`, whether the column is nullable, etc.