Skip to content

Commit

Permalink
apacheGH-41691: [Doc] Remove notion of "logical type"
Browse files Browse the repository at this point in the history
Also address apacheGH-14752 by adding a table of data types with their respective parameters and the corresponding layouts.
  • Loading branch information
pitrou committed Jun 4, 2024
1 parent 524a463 commit d31bc8c
Show file tree
Hide file tree
Showing 3 changed files with 117 additions and 32 deletions.
133 changes: 109 additions & 24 deletions docs/source/format/Columnar.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,21 +70,117 @@ concepts, here is a small glossary to help disambiguate.
without taking into account any value semantics. For example, a
32-bit signed integer array and 32-bit floating point array have the
same layout.
* **Parent** and **child arrays**: names to express relationships
between physical value arrays in a nested type structure. For
example, a ``List<T>``-type parent array has a T-type array as its
child (see more on lists below).
* **Data type**: An application-facing semantic value type that is
implemented using some physical layout. For example, Decimal128
values are stored as 16 bytes in a fixed-size binary
layout. A timestamp may be stored as 64-bit fixed-size layout.
* **Primitive type**: a data type having no child types. This includes
such types as fixed bit-width, variable-size binary, and null types.
* **Nested type**: a data type whose full structure depends on one or
more other child types. Two fully-specified nested types are equal
if and only if their child types are equal. For example, ``List<U>``
is distinct from ``List<V>`` iff U and V are different types.
* **Logical type**: An application-facing semantic value type that is
implemented using some physical layout. For example, Decimal
values are stored as 16 bytes in a fixed-size binary
layout. Similarly, strings can be stored as ``List<1-byte>``. A
timestamp may be stored as 64-bit fixed-size layout.
* **Parent** and **child arrays**: names to express relationships
between physical value arrays in a nested type structure. For
example, a ``List<T>``-type parent array has a T-type array as its
child (see more on lists below).
* **Parametric type**: a type which requires additional parameters
for full determination of its semantics. For example, all nested types
are parametric by construction. A timestamp is also parametric as it needs
a unit (such as microseconds) and a timezone.

Data Types
==========

The `Schema.fbs`_ defines built-in data types supported by the
Arrow columnar format. Each data type uses a well-defined physical layout.

`Schema.fbs`_ is the authoritative source for the description of the
standard Arrow data types. However, we also provide the below table for
convenience:

+--------------------+------------------------------+------------------------------------------------------------+
| Type | Type Parameters | Physical Memory Layout |
+====================+==============================+============================================================+
| Null | | Null |
+--------------------+------------------------------+------------------------------------------------------------+
| Boolean | | Fixed-size Primitive |
+--------------------+------------------------------+------------------------------------------------------------+
| Int | * bit width | *" (same as above)* |
| | * signedness | |
+--------------------+------------------------------+------------------------------------------------------------+
| Floating Point | * precision | *"* |
+--------------------+------------------------------+------------------------------------------------------------+
| Decimal | * bit width | *"* |
| | * scale | |
| | * precision | |
+--------------------+------------------------------+------------------------------------------------------------+
| Date | * unit | *"* |
+--------------------+------------------------------+------------------------------------------------------------+
| Time | * bit width | *"* |
| | * unit | |
+--------------------+------------------------------+------------------------------------------------------------+
| Timestamp | * unit | *"* |
| | * timezone | |
+--------------------+------------------------------+------------------------------------------------------------+
| Interval | * unit | *"* |
+--------------------+------------------------------+------------------------------------------------------------+
| Duration | * unit | *"* |
+--------------------+------------------------------+------------------------------------------------------------+
| Fixed-Size Binary | * byte width | Fixed-size Binary |
+--------------------+------------------------------+------------------------------------------------------------+
| Binary | | Variable-size Binary with 32-bit offsets |
+--------------------+------------------------------+------------------------------------------------------------+
| Utf8 | | *"* |
+--------------------+------------------------------+------------------------------------------------------------+
| Large Binary | | Variable-size Binary with 64-bit offsets |
+--------------------+------------------------------+------------------------------------------------------------+
| Large Utf8 | | *"* |
+--------------------+------------------------------+------------------------------------------------------------+
| Binary View | | Variable-size Binary View |
+--------------------+------------------------------+------------------------------------------------------------+
| Utf8 View | | *"* |
+--------------------+------------------------------+------------------------------------------------------------+
| Fixed-Size List | * *value type* | Fixed-size List |
| | * list size | |
+--------------------+------------------------------+------------------------------------------------------------+
| List | * *value type* | Variable-size List with 32-bit offsets |
+--------------------+------------------------------+------------------------------------------------------------+
| Large List | * *value type* | Variable-size List with 64-bit offsets |
+--------------------+------------------------------+------------------------------------------------------------+
| List View | * *value type* | Variable-size List View with 32-bit offsets and sizes |
+--------------------+------------------------------+------------------------------------------------------------+
| Large List View | * *value type* | Variable-size List View with 64-bit offsets and sizes |
+--------------------+------------------------------+------------------------------------------------------------+
| Struct | * *children* | Struct |
+--------------------+------------------------------+------------------------------------------------------------+
| Map | * *children* | List of Structs |
| | * keys sortedness | |
+--------------------+------------------------------+------------------------------------------------------------+
| Union | * *children* | Dense or Sparse Union |
| | * mode | |
| | * type ids | |
+--------------------+------------------------------+------------------------------------------------------------+
| Dictionary | * *index type* | Dictionary Encoded |
| | * *value type* | |
| | * orderedness | |
+--------------------+------------------------------+------------------------------------------------------------+
| Run-End Encoded | * *run end type* | Run-End Encoded |
| | * *value type* | |
+--------------------+------------------------------+------------------------------------------------------------+

.. note::
Sometimes the term "logical type" is used to denote the Arrow data types
and distinguish them from their respective physical layouts. However,
unlike other type systems such as `Apache Parquet <https://parquet.apache.org/>`__'s,
the Arrow type system doesn't have separate notions of physical types and
logical types.

The Arrow type system separately allows for
:ref:`exception types <format_metadata_extension_types>`, which allows
adorning standard Arrow data types with richer application-facing semantics
(for example defining a "JSON" type laid upon the standard String data type).


.. _format_layout:

Expand All @@ -93,7 +189,7 @@ Physical Memory Layout

Arrays are defined by a few pieces of metadata and data:

* A logical data type.
* A data type.
* A sequence of buffers.
* A length as a 64-bit signed integer. Implementations are permitted
to be limited to 32-bit lengths, see more on this below.
Expand Down Expand Up @@ -138,7 +234,7 @@ the different physical layouts defined by Arrow:
* **Run-End Encoded (REE)**: a nested layout consisting of two child arrays,
one representing values, and one representing the logical index where
the run of a corresponding value ends.
* **Null**: a sequence of all null values, having null logical type
* **Null**: a sequence of all null values.

The Arrow columnar memory layout only applies to *data* and not
*metadata*. Implementations are free to represent metadata in-memory
Expand Down Expand Up @@ -313,7 +409,7 @@ arrays have a single values buffer, variable-size binary have an
**offsets** buffer and **data** buffer.

The offsets buffer contains ``length + 1`` signed integers (either
32-bit or 64-bit, depending on the logical type), which encode the
32-bit or 64-bit, depending on the data type), which encode the
start position of each slot in the data buffer. The length of the
value in each slot is computed using the difference between the offset
at that slot's index and the subsequent offset. For example, the
Expand Down Expand Up @@ -1070,17 +1166,6 @@ of memory buffers for each layout.
"Dictionary-encoded",validity,data (indices),,
"Run-end encoded",,,,

Logical Types
=============

The `Schema.fbs`_ defines built-in logical types supported by the
Arrow columnar format. Each logical type uses one of the above
physical layouts. Nested logical types may have different physical
layouts depending on the particular realization of the type.

We do not go into detail about the logical types definitions in this
document as we consider `Schema.fbs`_ to be authoritative.

.. _format-ipc:

Serialization and Interprocess Communication (IPC)
Expand Down Expand Up @@ -1170,7 +1255,7 @@ The ``Field`` Flatbuffers type contains the metadata for a single
array. This includes:

* The field's name
* The field's logical type
* The field's data type
* Whether the field is semantically nullable. While this has no
bearing on the array's physical layout, many systems distinguish
nullable and non-nullable fields and we want to allow them to
Expand Down
14 changes: 7 additions & 7 deletions docs/source/python/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ with memory buffers, like the ones explained in the documentation on
:ref:`Memory and IO <io>`. These data structures are exposed in Python through
a series of interrelated classes:

* **Type Metadata**: Instances of ``pyarrow.DataType``, which describe a logical
array type
* **Type Metadata**: Instances of ``pyarrow.DataType``, which describe the
type of an array and govern how its values are interpreted
* **Schemas**: Instances of ``pyarrow.Schema``, which describe a named
collection of types. These can be thought of as the column types in a
table-like object.
Expand Down Expand Up @@ -55,8 +55,8 @@ array data. These include:
* **Nested types**: list, map, struct, and union
* **Dictionary type**: An encoded categorical type (more on this later)

Each logical data type in Arrow has a corresponding factory function for
creating an instance of that type object in Python:
Each data type in Arrow has a corresponding factory function for creating
an instance of that type object in Python:

.. ipython:: python
Expand All @@ -72,9 +72,9 @@ creating an instance of that type object in Python:
print(t4)
print(t5)
We use the name **logical type** because the **physical** storage may be the
same for one or more types. For example, ``int64``, ``float64``, and
``timestamp[ms]`` all occupy 64 bits per value.
.. note::
Different data types might use a given physical storage. For example,
``int64``, ``float64``, and ``timestamp[ms]`` all occupy 64 bits per value.

These objects are ``metadata``; they are used for describing the data in arrays,
schemas, and record batches. In Python, they can be used in functions where the
Expand Down
2 changes: 1 addition & 1 deletion docs/source/python/extending_types.rst
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ Defining extension types ("user-defined types")

Arrow has the notion of extension types in the metadata specification as a
possibility to extend the built-in types. This is done by annotating any of the
built-in Arrow logical types (the "storage type") with a custom type name and
built-in Arrow data types (the "storage type") with a custom type name and
optional serialized representation ("ARROW:extension:name" and
"ARROW:extension:metadata" keys in the Field’s custom_metadata of an IPC
message).
Expand Down

0 comments on commit d31bc8c

Please sign in to comment.