# ArcticDB Experimental Arrow Writing Dense Numeric Data Demo

### This notebook demonstrates the first cut of writing pyarrow tables directly into ArcticDB. This is still an experimental feature, with both the API and behaviours subject to change in minor or patch releases, and under no circumstances should be deployed to production environments. There are still a lot of rough edges, many of which are highlighted below, that will be addressed in future releases. In particular, string columns are not supported yet.

### Performance-wise, numeric data will generally be identical to writing as Pandas when creating pyarrow tables via "standard" methods such as `pa.table`, `from_arrays`, and `from_pandas`, as everything is zero-copy in both cases. The main exception is bool columns, which are zero copy in Pandas, but have a different in-memory layout in Arrow that requires a transformation. This transformation is also not yet parallelised, so can be the bottleneck for bool heavy data. Arrow tables comprising multiple record batches can also induce a memcpy as we require contiguous buffers to pass to the encoder.

## Preamble

In [1]:
import numpy as np
import pandas as pd
import polars as pl
import pyarrow as pa
from arcticdb import Arctic, OutputFormat, QueryBuilder, WritePayload

In [2]:
arctic = Arctic("lmdb://arrow-writes-demo")

In [3]:
lib = arctic.get_library("test_lib", output_format=OutputFormat.EXPERIMENTAL_ARROW, create_if_missing=True)

In [4]:
lib._nvs.version_store.clear()

In [5]:
sym = "test"

## Helper function for pretty-printing pyarrow tables as the default repr isn't very human friendly. Note that converting a pyarrow table to a polars dataframe is zero-copy!

In [6]:
def print_table(table):
    print(pl.from_arrow(table))

## Libraries must be explicitly configured to allow Arrow table writes

In [7]:
try:
    lib.write(sym, pa.table({"col": pa.array([0, 1])}))
except Exception as e:
    print(e)

data is of a type that cannot be normalized. Consider using write_pickle instead. type(data)=[<class 'pyarrow.lib.Table'>]


In [8]:
lib._nvs._set_allow_arrow_input()

Hidden method as this will be removed once writing Arrow tables is fully supported

## Write some numeric data (including bools and timestamps) of all supported types

numpy and pandas are not required, they just have convenient methods for generating the data

In [9]:
table_0 = pa.table(
    {
        "timestamp": pa.Array.from_pandas(pd.date_range("2025-01-01", periods=50), type=pa.timestamp("ns")),
        "bool": pa.array([True, False] * 25, pa.bool_()),
        "uint8": pa.array(np.arange(50), pa.uint8()),
        "uint16": pa.array(np.arange(50), pa.uint16()),
        "uint32": pa.array(np.arange(50), pa.uint32()),
        "uint64": pa.array(np.arange(50), pa.uint64()),
        "int8": pa.array(np.arange(50), pa.int8()),
        "int16": pa.array(np.arange(50), pa.int16()),
        "int32": pa.array(np.arange(50), pa.int32()),
        "int64": pa.array(np.arange(50), pa.int64()),
        "float32": pa.array(np.arange(50), pa.float32()),
        "float64": pa.array(np.arange(50), pa.float64()),        
    }
)
print_table(table_0)

shape: (50, 12)
┌─────────────────────┬───────┬───────┬────────┬───┬───────┬───────┬─────────┬─────────┐
│ timestamp           ┆ bool  ┆ uint8 ┆ uint16 ┆ … ┆ int32 ┆ int64 ┆ float32 ┆ float64 │
│ ---                 ┆ ---   ┆ ---   ┆ ---    ┆   ┆ ---   ┆ ---   ┆ ---     ┆ ---     │
│ datetime[ns]        ┆ bool  ┆ u8    ┆ u16    ┆   ┆ i32   ┆ i64   ┆ f32     ┆ f64     │
╞═════════════════════╪═══════╪═══════╪════════╪═══╪═══════╪═══════╪═════════╪═════════╡
│ 2025-01-01 00:00:00 ┆ true  ┆ 0     ┆ 0      ┆ … ┆ 0     ┆ 0     ┆ 0.0     ┆ 0.0     │
│ 2025-01-02 00:00:00 ┆ false ┆ 1     ┆ 1      ┆ … ┆ 1     ┆ 1     ┆ 1.0     ┆ 1.0     │
│ 2025-01-03 00:00:00 ┆ true  ┆ 2     ┆ 2      ┆ … ┆ 2     ┆ 2     ┆ 2.0     ┆ 2.0     │
│ 2025-01-04 00:00:00 ┆ false ┆ 3     ┆ 3      ┆ … ┆ 3     ┆ 3     ┆ 3.0     ┆ 3.0     │
│ 2025-01-05 00:00:00 ┆ true  ┆ 4     ┆ 4      ┆ … ┆ 4     ┆ 4     ┆ 4.0     ┆ 4.0     │
│ …                   ┆ …     ┆ …     ┆ …      ┆ … ┆ …     ┆ …     ┆ …       ┆ …       │
│ 202

In [10]:
lib.write(sym, table_0)

VersionedItem(symbol='test', library='test_lib', data=n/a, version=0, metadata=None, host='LMDB(path=/data/team/data/arctic_native/examples/arrow-writes-demo)', timestamp=1760691320986733451)

In [11]:
print_table(lib.read(sym).data)

shape: (50, 12)
┌─────────────────────┬───────┬───────┬────────┬───┬───────┬───────┬─────────┬─────────┐
│ timestamp           ┆ bool  ┆ uint8 ┆ uint16 ┆ … ┆ int32 ┆ int64 ┆ float32 ┆ float64 │
│ ---                 ┆ ---   ┆ ---   ┆ ---    ┆   ┆ ---   ┆ ---   ┆ ---     ┆ ---     │
│ datetime[ns]        ┆ bool  ┆ u8    ┆ u16    ┆   ┆ i32   ┆ i64   ┆ f32     ┆ f64     │
╞═════════════════════╪═══════╪═══════╪════════╪═══╪═══════╪═══════╪═════════╪═════════╡
│ 2025-01-01 00:00:00 ┆ true  ┆ 0     ┆ 0      ┆ … ┆ 0     ┆ 0     ┆ 0.0     ┆ 0.0     │
│ 2025-01-02 00:00:00 ┆ false ┆ 1     ┆ 1      ┆ … ┆ 1     ┆ 1     ┆ 1.0     ┆ 1.0     │
│ 2025-01-03 00:00:00 ┆ true  ┆ 2     ┆ 2      ┆ … ┆ 2     ┆ 2     ┆ 2.0     ┆ 2.0     │
│ 2025-01-04 00:00:00 ┆ false ┆ 3     ┆ 3      ┆ … ┆ 3     ┆ 3     ┆ 3.0     ┆ 3.0     │
│ 2025-01-05 00:00:00 ┆ true  ┆ 4     ┆ 4      ┆ … ┆ 4     ┆ 4     ┆ 4.0     ┆ 4.0     │
│ …                   ┆ …     ┆ …     ┆ …      ┆ … ┆ …     ┆ …     ┆ …       ┆ …       │
│ 202

## Append some data to this

In [12]:
table_1 = pa.table(
    {
        "timestamp": pa.Array.from_pandas(pd.date_range("2025-02-20", periods=50), type=pa.timestamp("ns")),
        "bool": pa.array([True, False] * 25, pa.bool_()),
        "uint8": pa.array(np.arange(50, 100), pa.uint8()),
        "uint16": pa.array(np.arange(50, 100), pa.uint16()),
        "uint32": pa.array(np.arange(50, 100), pa.uint32()),
        "uint64": pa.array(np.arange(50, 100), pa.uint64()),
        "int8": pa.array(np.arange(50, 100), pa.int8()),
        "int16": pa.array(np.arange(50, 100), pa.int16()),
        "int32": pa.array(np.arange(50, 100), pa.int32()),
        "int64": pa.array(np.arange(50, 100), pa.int64()),
        "float32": pa.array(np.arange(50, 100), pa.float32()),
        "float64": pa.array(np.arange(50, 100), pa.float64()),
    }
)
lib.append(sym, table_1)
print_table(lib.read(sym).data)

shape: (100, 12)
┌─────────────────────┬───────┬───────┬────────┬───┬───────┬───────┬─────────┬─────────┐
│ timestamp           ┆ bool  ┆ uint8 ┆ uint16 ┆ … ┆ int32 ┆ int64 ┆ float32 ┆ float64 │
│ ---                 ┆ ---   ┆ ---   ┆ ---    ┆   ┆ ---   ┆ ---   ┆ ---     ┆ ---     │
│ datetime[ns]        ┆ bool  ┆ u8    ┆ u16    ┆   ┆ i32   ┆ i64   ┆ f32     ┆ f64     │
╞═════════════════════╪═══════╪═══════╪════════╪═══╪═══════╪═══════╪═════════╪═════════╡
│ 2025-01-01 00:00:00 ┆ true  ┆ 0     ┆ 0      ┆ … ┆ 0     ┆ 0     ┆ 0.0     ┆ 0.0     │
│ 2025-01-02 00:00:00 ┆ false ┆ 1     ┆ 1      ┆ … ┆ 1     ┆ 1     ┆ 1.0     ┆ 1.0     │
│ 2025-01-03 00:00:00 ┆ true  ┆ 2     ┆ 2      ┆ … ┆ 2     ┆ 2     ┆ 2.0     ┆ 2.0     │
│ 2025-01-04 00:00:00 ┆ false ┆ 3     ┆ 3      ┆ … ┆ 3     ┆ 3     ┆ 3.0     ┆ 3.0     │
│ 2025-01-05 00:00:00 ┆ true  ┆ 4     ┆ 4      ┆ … ┆ 4     ┆ 4     ┆ 4.0     ┆ 4.0     │
│ …                   ┆ …     ┆ …     ┆ …      ┆ … ┆ …     ┆ …     ┆ …       ┆ …       │
│ 20

## Arrow has no concept of indexes, so if a timestamp column should be treated as an index this must be specified via the new `index_column` argument. This is required if `update` will be used, or if read calls will use the `date_range` argument!

### This is an area we are definitely looking for feedback on, this API is very much not set in stone

### Also note that as Arrow has no index concept, it also has no concept of multiindexes

In [13]:
try:
    lib.read(sym, date_range=(pd.Timestamp("2025-02-01"), pd.Timestamp("2025-03-01")))
except Exception as e:
    print(e)

E_ASSERTION_FAILURE Cannot apply date range filter to symbol with non-timestamp index


20251017 09:55:21.069139 5692 E arcticdb | E_ASSERTION_FAILURE Cannot apply date range filter to symbol with non-timestamp index


In [14]:
lib.write(sym, table_0, index_column="timestamp")
lib.append(sym, table_1, index_column="timestamp")
print_table(lib.read(sym, date_range=(pd.Timestamp("2025-02-01"), pd.Timestamp("2025-03-01"))).data)

shape: (29, 12)
┌─────────────────────┬───────┬───────┬────────┬───┬───────┬───────┬─────────┬─────────┐
│ timestamp           ┆ bool  ┆ uint8 ┆ uint16 ┆ … ┆ int32 ┆ int64 ┆ float32 ┆ float64 │
│ ---                 ┆ ---   ┆ ---   ┆ ---    ┆   ┆ ---   ┆ ---   ┆ ---     ┆ ---     │
│ datetime[ns]        ┆ bool  ┆ u8    ┆ u16    ┆   ┆ i32   ┆ i64   ┆ f32     ┆ f64     │
╞═════════════════════╪═══════╪═══════╪════════╪═══╪═══════╪═══════╪═════════╪═════════╡
│ 2025-02-01 00:00:00 ┆ false ┆ 31    ┆ 31     ┆ … ┆ 31    ┆ 31    ┆ 31.0    ┆ 31.0    │
│ 2025-02-02 00:00:00 ┆ true  ┆ 32    ┆ 32     ┆ … ┆ 32    ┆ 32    ┆ 32.0    ┆ 32.0    │
│ 2025-02-03 00:00:00 ┆ false ┆ 33    ┆ 33     ┆ … ┆ 33    ┆ 33    ┆ 33.0    ┆ 33.0    │
│ 2025-02-04 00:00:00 ┆ true  ┆ 34    ┆ 34     ┆ … ┆ 34    ┆ 34    ┆ 34.0    ┆ 34.0    │
│ 2025-02-05 00:00:00 ┆ false ┆ 35    ┆ 35     ┆ … ┆ 35    ┆ 35    ┆ 35.0    ┆ 35.0    │
│ …                   ┆ …     ┆ …     ┆ …      ┆ … ┆ …     ┆ …     ┆ …       ┆ …       │
│ 202

## `update` works in the same way

In [15]:
table_2 = pa.table(
    {
        "timestamp": pa.Array.from_pandas(pd.date_range("2025-04-08", periods=2), type=pa.timestamp("ns")),
        "bool": pa.array([True] * 2, pa.bool_()),
        "uint8": pa.array([0] * 2, pa.uint8()),
        "uint16": pa.array([0] * 2, pa.uint16()),
        "uint32": pa.array([0] * 2, pa.uint32()),
        "uint64": pa.array([0] * 2, pa.uint64()),
        "int8": pa.array([0] * 2, pa.int8()),
        "int16": pa.array([0] * 2, pa.int16()),
        "int32": pa.array([0] * 2, pa.int32()),
        "int64": pa.array([0] * 2, pa.int64()),
        "float32": pa.array([0] * 2, pa.float32()),
        "float64": pa.array([0] * 2, pa.float64()),
    }
)
print_table(table_2)

shape: (2, 12)
┌─────────────────────┬──────┬───────┬────────┬───┬───────┬───────┬─────────┬─────────┐
│ timestamp           ┆ bool ┆ uint8 ┆ uint16 ┆ … ┆ int32 ┆ int64 ┆ float32 ┆ float64 │
│ ---                 ┆ ---  ┆ ---   ┆ ---    ┆   ┆ ---   ┆ ---   ┆ ---     ┆ ---     │
│ datetime[ns]        ┆ bool ┆ u8    ┆ u16    ┆   ┆ i32   ┆ i64   ┆ f32     ┆ f64     │
╞═════════════════════╪══════╪═══════╪════════╪═══╪═══════╪═══════╪═════════╪═════════╡
│ 2025-04-08 00:00:00 ┆ true ┆ 0     ┆ 0      ┆ … ┆ 0     ┆ 0     ┆ 0.0     ┆ 0.0     │
│ 2025-04-09 00:00:00 ┆ true ┆ 0     ┆ 0      ┆ … ┆ 0     ┆ 0     ┆ 0.0     ┆ 0.0     │
└─────────────────────┴──────┴───────┴────────┴───┴───────┴───────┴─────────┴─────────┘


In [16]:
lib.update(sym, table_2, index_column="timestamp")
print_table(lib.read(sym).data)

shape: (100, 12)
┌─────────────────────┬───────┬───────┬────────┬───┬───────┬───────┬─────────┬─────────┐
│ timestamp           ┆ bool  ┆ uint8 ┆ uint16 ┆ … ┆ int32 ┆ int64 ┆ float32 ┆ float64 │
│ ---                 ┆ ---   ┆ ---   ┆ ---    ┆   ┆ ---   ┆ ---   ┆ ---     ┆ ---     │
│ datetime[ns]        ┆ bool  ┆ u8    ┆ u16    ┆   ┆ i32   ┆ i64   ┆ f32     ┆ f64     │
╞═════════════════════╪═══════╪═══════╪════════╪═══╪═══════╪═══════╪═════════╪═════════╡
│ 2025-01-01 00:00:00 ┆ true  ┆ 0     ┆ 0      ┆ … ┆ 0     ┆ 0     ┆ 0.0     ┆ 0.0     │
│ 2025-01-02 00:00:00 ┆ false ┆ 1     ┆ 1      ┆ … ┆ 1     ┆ 1     ┆ 1.0     ┆ 1.0     │
│ 2025-01-03 00:00:00 ┆ true  ┆ 2     ┆ 2      ┆ … ┆ 2     ┆ 2     ┆ 2.0     ┆ 2.0     │
│ 2025-01-04 00:00:00 ┆ false ┆ 3     ┆ 3      ┆ … ┆ 3     ┆ 3     ┆ 3.0     ┆ 3.0     │
│ 2025-01-05 00:00:00 ┆ true  ┆ 4     ┆ 4      ┆ … ┆ 4     ┆ 4     ┆ 4.0     ┆ 4.0     │
│ …                   ┆ …     ┆ …     ┆ …      ┆ … ┆ …     ┆ …     ┆ …       ┆ …       │
│ 20

## In this first release, the `index_column` must be specified in the `append` and `update` calls even if the existing data has an index column. There will be an option in a future release to infer the index column from the existing data in these cases.

## An exception will be thrown if a non-timestamp column is specified as the index

In [17]:
try:
    lib.write(sym, table_0, index_column="int64")
except Exception as e:
    print(e)

E_INVALID_USER_ARGUMENT Specified Arrow index column has non-time type INT64


## or if the specified column does not exist

In [18]:
try:
    lib.write(sym, table_0, index_column="blah")
except Exception as e:
    print(e)

E_COLUMN_DOESNT_EXIST Specified index column named 'blah' not present in data


## or if the specified index column does not match the existing data

In [19]:
table = pa.table(
    {
        "timestamp_0": pa.Array.from_pandas(pd.date_range("2025-01-01", periods=2), type=pa.timestamp("ns")),
        "timestamp_1": pa.Array.from_pandas(pd.date_range("2025-01-03", periods=2), type=pa.timestamp("ns")),
    }
)
lib.write(sym, table, index_column="timestamp_0")
try:
    lib.update(sym, table, index_column="timestamp_1")
except Exception as e:
    print(e)

The index names in the argument are not identical to that of the existing version: UPDATE
stream_id="test"
(Showing only the mismatch. Full col list saved in the `last_mismatch_msg` attribute of the lib instance.
'-' marks columns missing from the argument, '+' for unexpected.)
-"FD<name=timestamp_0, type=TD<type=NANOSECONDS_UTC64, dim=0>, idx=0>
-FD<name=timestamp_1, type=TD<type=NANOSECONDS_UTC64, dim=0>, idx=1>"
+"FD<name=timestamp_1, type=TD<type=NANOSECONDS_UTC64, dim=0>, idx=0>
+FD<name=timestamp_0, type=TD<type=NANOSECONDS_UTC64, dim=0>, idx=1>"


## The index column does not need to be the first column in the table

In [20]:
table = pa.table(
    {
        "bool": pa.array([True, False] * 25, pa.bool_()),
        "uint8": pa.array(np.arange(50), pa.uint8()),
        "uint16": pa.array(np.arange(50), pa.uint16()),
        "uint32": pa.array(np.arange(50), pa.uint32()),
        "uint64": pa.array(np.arange(50), pa.uint64()),
        "int8": pa.array(np.arange(50), pa.int8()),
        "int16": pa.array(np.arange(50), pa.int16()),
        "int32": pa.array(np.arange(50), pa.int32()),
        "int64": pa.array(np.arange(50), pa.int64()),
        "float32": pa.array(np.arange(50), pa.float32()),
        "timestamp": pa.Array.from_pandas(pd.date_range("2025-01-01", periods=50), type=pa.timestamp("ns")),
        "float64": pa.array(np.arange(50), pa.float64()),        
    }
)
print_table(table)

shape: (50, 12)
┌───────┬───────┬────────┬────────┬───┬───────┬─────────┬─────────────────────┬─────────┐
│ bool  ┆ uint8 ┆ uint16 ┆ uint32 ┆ … ┆ int64 ┆ float32 ┆ timestamp           ┆ float64 │
│ ---   ┆ ---   ┆ ---    ┆ ---    ┆   ┆ ---   ┆ ---     ┆ ---                 ┆ ---     │
│ bool  ┆ u8    ┆ u16    ┆ u32    ┆   ┆ i64   ┆ f32     ┆ datetime[ns]        ┆ f64     │
╞═══════╪═══════╪════════╪════════╪═══╪═══════╪═════════╪═════════════════════╪═════════╡
│ true  ┆ 0     ┆ 0      ┆ 0      ┆ … ┆ 0     ┆ 0.0     ┆ 2025-01-01 00:00:00 ┆ 0.0     │
│ false ┆ 1     ┆ 1      ┆ 1      ┆ … ┆ 1     ┆ 1.0     ┆ 2025-01-02 00:00:00 ┆ 1.0     │
│ true  ┆ 2     ┆ 2      ┆ 2      ┆ … ┆ 2     ┆ 2.0     ┆ 2025-01-03 00:00:00 ┆ 2.0     │
│ false ┆ 3     ┆ 3      ┆ 3      ┆ … ┆ 3     ┆ 3.0     ┆ 2025-01-04 00:00:00 ┆ 3.0     │
│ true  ┆ 4     ┆ 4      ┆ 4      ┆ … ┆ 4     ┆ 4.0     ┆ 2025-01-05 00:00:00 ┆ 4.0     │
│ …     ┆ …     ┆ …      ┆ …      ┆ … ┆ …     ┆ …       ┆ …                   ┆ …   

In [21]:
lib.write(sym, table, index_column="timestamp")
print_table(lib.read(sym, date_range=(pd.Timestamp("2025-01-10"), pd.Timestamp("2025-01-15"))).data)

shape: (6, 12)
┌───────┬───────┬────────┬────────┬───┬───────┬─────────┬─────────────────────┬─────────┐
│ bool  ┆ uint8 ┆ uint16 ┆ uint32 ┆ … ┆ int64 ┆ float32 ┆ timestamp           ┆ float64 │
│ ---   ┆ ---   ┆ ---    ┆ ---    ┆   ┆ ---   ┆ ---     ┆ ---                 ┆ ---     │
│ bool  ┆ u8    ┆ u16    ┆ u32    ┆   ┆ i64   ┆ f32     ┆ datetime[ns]        ┆ f64     │
╞═══════╪═══════╪════════╪════════╪═══╪═══════╪═════════╪═════════════════════╪═════════╡
│ false ┆ 9     ┆ 9      ┆ 9      ┆ … ┆ 9     ┆ 9.0     ┆ 2025-01-10 00:00:00 ┆ 9.0     │
│ true  ┆ 10    ┆ 10     ┆ 10     ┆ … ┆ 10    ┆ 10.0    ┆ 2025-01-11 00:00:00 ┆ 10.0    │
│ false ┆ 11    ┆ 11     ┆ 11     ┆ … ┆ 11    ┆ 11.0    ┆ 2025-01-12 00:00:00 ┆ 11.0    │
│ true  ┆ 12    ┆ 12     ┆ 12     ┆ … ┆ 12    ┆ 12.0    ┆ 2025-01-13 00:00:00 ┆ 12.0    │
│ false ┆ 13    ┆ 13     ┆ 13     ┆ … ┆ 13    ┆ 13.0    ┆ 2025-01-14 00:00:00 ┆ 13.0    │
│ true  ┆ 14    ┆ 14     ┆ 14     ┆ … ┆ 14    ┆ 14.0    ┆ 2025-01-15 00:00:00 ┆ 14.0 

## But it may not appear in the same relative position if a subset of the columns are read (this will be fixed in a future release)
### As with Pandas, the index column is always read, even if it is not specified in the `columns` argument

In [22]:
print_table(lib.read(sym, columns=["uint8", "int8", "float64"]).data)

shape: (50, 4)
┌─────────────────────┬───────┬──────┬─────────┐
│ timestamp           ┆ uint8 ┆ int8 ┆ float64 │
│ ---                 ┆ ---   ┆ ---  ┆ ---     │
│ datetime[ns]        ┆ u8    ┆ i8   ┆ f64     │
╞═════════════════════╪═══════╪══════╪═════════╡
│ 2025-01-01 00:00:00 ┆ 0     ┆ 0    ┆ 0.0     │
│ 2025-01-02 00:00:00 ┆ 1     ┆ 1    ┆ 1.0     │
│ 2025-01-03 00:00:00 ┆ 2     ┆ 2    ┆ 2.0     │
│ 2025-01-04 00:00:00 ┆ 3     ┆ 3    ┆ 3.0     │
│ 2025-01-05 00:00:00 ┆ 4     ┆ 4    ┆ 4.0     │
│ …                   ┆ …     ┆ …    ┆ …       │
│ 2025-02-15 00:00:00 ┆ 45    ┆ 45   ┆ 45.0    │
│ 2025-02-16 00:00:00 ┆ 46    ┆ 46   ┆ 46.0    │
│ 2025-02-17 00:00:00 ┆ 47    ┆ 47   ┆ 47.0    │
│ 2025-02-18 00:00:00 ┆ 48    ┆ 48   ┆ 48.0    │
│ 2025-02-19 00:00:00 ┆ 49    ┆ 49   ┆ 49.0    │
└─────────────────────┴───────┴──────┴─────────┘


## Writing views of data works as you would expect

In [23]:
table_view = table.slice(20, 10)
print_table(table_view)

shape: (10, 12)
┌───────┬───────┬────────┬────────┬───┬───────┬─────────┬─────────────────────┬─────────┐
│ bool  ┆ uint8 ┆ uint16 ┆ uint32 ┆ … ┆ int64 ┆ float32 ┆ timestamp           ┆ float64 │
│ ---   ┆ ---   ┆ ---    ┆ ---    ┆   ┆ ---   ┆ ---     ┆ ---                 ┆ ---     │
│ bool  ┆ u8    ┆ u16    ┆ u32    ┆   ┆ i64   ┆ f32     ┆ datetime[ns]        ┆ f64     │
╞═══════╪═══════╪════════╪════════╪═══╪═══════╪═════════╪═════════════════════╪═════════╡
│ true  ┆ 20    ┆ 20     ┆ 20     ┆ … ┆ 20    ┆ 20.0    ┆ 2025-01-21 00:00:00 ┆ 20.0    │
│ false ┆ 21    ┆ 21     ┆ 21     ┆ … ┆ 21    ┆ 21.0    ┆ 2025-01-22 00:00:00 ┆ 21.0    │
│ true  ┆ 22    ┆ 22     ┆ 22     ┆ … ┆ 22    ┆ 22.0    ┆ 2025-01-23 00:00:00 ┆ 22.0    │
│ false ┆ 23    ┆ 23     ┆ 23     ┆ … ┆ 23    ┆ 23.0    ┆ 2025-01-24 00:00:00 ┆ 23.0    │
│ true  ┆ 24    ┆ 24     ┆ 24     ┆ … ┆ 24    ┆ 24.0    ┆ 2025-01-25 00:00:00 ┆ 24.0    │
│ false ┆ 25    ┆ 25     ┆ 25     ┆ … ┆ 25    ┆ 25.0    ┆ 2025-01-26 00:00:00 ┆ 25.0

In [24]:
lib.write(sym, table_view)
print_table(lib.read(sym).data)

shape: (10, 12)
┌───────┬───────┬────────┬────────┬───┬───────┬─────────┬─────────────────────┬─────────┐
│ bool  ┆ uint8 ┆ uint16 ┆ uint32 ┆ … ┆ int64 ┆ float32 ┆ timestamp           ┆ float64 │
│ ---   ┆ ---   ┆ ---    ┆ ---    ┆   ┆ ---   ┆ ---     ┆ ---                 ┆ ---     │
│ bool  ┆ u8    ┆ u16    ┆ u32    ┆   ┆ i64   ┆ f32     ┆ datetime[ns]        ┆ f64     │
╞═══════╪═══════╪════════╪════════╪═══╪═══════╪═════════╪═════════════════════╪═════════╡
│ true  ┆ 20    ┆ 20     ┆ 20     ┆ … ┆ 20    ┆ 20.0    ┆ 2025-01-21 00:00:00 ┆ 20.0    │
│ false ┆ 21    ┆ 21     ┆ 21     ┆ … ┆ 21    ┆ 21.0    ┆ 2025-01-22 00:00:00 ┆ 21.0    │
│ true  ┆ 22    ┆ 22     ┆ 22     ┆ … ┆ 22    ┆ 22.0    ┆ 2025-01-23 00:00:00 ┆ 22.0    │
│ false ┆ 23    ┆ 23     ┆ 23     ┆ … ┆ 23    ┆ 23.0    ┆ 2025-01-24 00:00:00 ┆ 23.0    │
│ true  ┆ 24    ┆ 24     ┆ 24     ┆ … ┆ 24    ┆ 24.0    ┆ 2025-01-25 00:00:00 ┆ 24.0    │
│ false ┆ 25    ┆ 25     ┆ 25     ┆ … ┆ 25    ┆ 25.0    ┆ 2025-01-26 00:00:00 ┆ 25.0

## Staging APIs also work as expected

In [25]:
lib.stage(sym, table_0, index_column="timestamp")
lib.stage(sym, table_1, index_column="timestamp")
lib.finalize_staged_data(sym)
print_table(lib.read(sym).data)

shape: (100, 12)
┌─────────────────────┬───────┬───────┬────────┬───┬───────┬───────┬─────────┬─────────┐
│ timestamp           ┆ bool  ┆ uint8 ┆ uint16 ┆ … ┆ int32 ┆ int64 ┆ float32 ┆ float64 │
│ ---                 ┆ ---   ┆ ---   ┆ ---    ┆   ┆ ---   ┆ ---   ┆ ---     ┆ ---     │
│ datetime[ns]        ┆ bool  ┆ u8    ┆ u16    ┆   ┆ i32   ┆ i64   ┆ f32     ┆ f64     │
╞═════════════════════╪═══════╪═══════╪════════╪═══╪═══════╪═══════╪═════════╪═════════╡
│ 2025-01-01 00:00:00 ┆ true  ┆ 0     ┆ 0      ┆ … ┆ 0     ┆ 0     ┆ 0.0     ┆ 0.0     │
│ 2025-01-02 00:00:00 ┆ false ┆ 1     ┆ 1      ┆ … ┆ 1     ┆ 1     ┆ 1.0     ┆ 1.0     │
│ 2025-01-03 00:00:00 ┆ true  ┆ 2     ┆ 2      ┆ … ┆ 2     ┆ 2     ┆ 2.0     ┆ 2.0     │
│ 2025-01-04 00:00:00 ┆ false ┆ 3     ┆ 3      ┆ … ┆ 3     ┆ 3     ┆ 3.0     ┆ 3.0     │
│ 2025-01-05 00:00:00 ┆ true  ┆ 4     ┆ 4      ┆ … ┆ 4     ┆ 4     ┆ 4.0     ┆ 4.0     │
│ …                   ┆ …     ┆ …     ┆ …      ┆ … ┆ …     ┆ …     ┆ …       ┆ …       │
│ 20

# Pandas interoperability

## Data written as Arrow can be read as Pandas

If no index column is specified in the write, then the resulting dataframe will have a `RangeIndex`

In [26]:
lib.write(sym, table)
lib.head(sym, output_format=OutputFormat.PANDAS).data

Unnamed: 0,bool,uint8,uint16,uint32,uint64,int8,int16,int32,int64,float32,timestamp,float64
0,True,0,0,0,0,0,0,0,0,0.0,2025-01-01,0.0
1,False,1,1,1,1,1,1,1,1,1.0,2025-01-02,1.0
2,True,2,2,2,2,2,2,2,2,2.0,2025-01-03,2.0
3,False,3,3,3,3,3,3,3,3,3.0,2025-01-04,3.0
4,True,4,4,4,4,4,4,4,4,4.0,2025-01-05,4.0


Index columns are correctly mapped to Pandas index

In [27]:
lib.write(sym, table, index_column="timestamp")
df = lib.head(sym, output_format=OutputFormat.PANDAS).data
print(df.index)
print(df.index.is_monotonic_increasing)
df

DatetimeIndex(['2025-01-01', '2025-01-02', '2025-01-03', '2025-01-04',
               '2025-01-05'],
              dtype='datetime64[ns]', name='timestamp', freq=None)
True


Unnamed: 0_level_0,bool,uint8,uint16,uint32,uint64,int8,int16,int32,int64,float32,float64
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2025-01-01,True,0,0,0,0,0,0,0,0,0.0,0.0
2025-01-02,False,1,1,1,1,1,1,1,1,1.0,1.0
2025-01-03,True,2,2,2,2,2,2,2,2,2.0,2.0
2025-01-04,False,3,3,3,3,3,3,3,3,3.0,3.0
2025-01-05,True,4,4,4,4,4,4,4,4,4.0,4.0


## Cannot yet append/update Arrow tables to Pandas dataframes or vice versa. This will be fixed in a future release.

In [28]:
lib.write(sym, pd.DataFrame({"col": np.arange(10, dtype=np.int64)}))
try:
    lib.append(sym, pa.table({"col": pa.array(np.arange(10, 20, dtype=np.int64), pa.int64())}))
except Exception as e:
    print(e)

E_INCOMPATIBLE_OBJECTS Append can be performed only on objects of the same type. Existing type is DataFrame new type is Arrow Table.


## Specifying an index column with Pandas data has no effect

In [29]:
lib.write(sym, pd.DataFrame({"col": np.arange(10, dtype=np.int64), "timestamp": pd.date_range("2025-01-01", periods=10)}), index_column="timestamp")
lib.head(sym, output_format=OutputFormat.PANDAS).data

Unnamed: 0,col,timestamp
0,0,2025-01-01
1,1,2025-01-02
2,2,2025-01-03
3,3,2025-01-04
4,4,2025-01-05


## For batch methods, the `WritePayload` and `UpdatePayload` classes have additional `index_column` fields

In [30]:
payload_0 = WritePayload("sym0", table_0, index_column="timestamp")
payload_1 = WritePayload("sym1", table_1, index_column="timestamp")
lib.write_batch([payload_0, payload_1])

[VersionedItem(symbol='sym0', library='test_lib', data=n/a, version=0, metadata=None, host='LMDB(path=/data/team/data/arctic_native/examples/arrow-writes-demo)', timestamp=1760691321671797922),
 VersionedItem(symbol='sym1', library='test_lib', data=n/a, version=0, metadata=None, host='LMDB(path=/data/team/data/arctic_native/examples/arrow-writes-demo)', timestamp=1760691321667000359)]

# Limitations

### String columns not yet supported

In [31]:
try:
    lib.write(sym, pa.table({"col": pa.array(["hello", "bonjour"], pa.string())}))
except Exception as e:
    print(e)

E_UNSUPPORTED_COLUMN_TYPE Unsupported Arrow data type provided `u`


### Sparse columns not yet supported

In [32]:
try:
    lib.write(sym, pa.table({"col": pa.array([1, None], pa.int64())}))
except Exception as e:
    print(e)

E_UNSUPPORTED_COLUMN_TYPE Column 'col' contains null values, which are not currently supported


### Non-nanosecond timestamp columns not yet supported

In [33]:
try:
    lib.write(sym, pa.table({"col": pa.Array.from_pandas(pd.date_range("2025-01-01", periods=10), type=pa.timestamp("ms"))}))
except Exception as e:
    print(e)

E_UNSUPPORTED_COLUMN_TYPE Unsupported Arrow data type provided `tsm:`


### Non-UTC nanosecond timestamp columns return as UTC, losing the timezone information. This will be fixed in a future release.

In [34]:
lib.write(
    sym,
    pa.table(
        {
            "col": pa.Array.from_pandas(pd.date_range("2025-01-01", periods=5, tz="Europe/Amsterdam"), type=pa.timestamp("ns", tz="Europe/Amsterdam"))
        }
    )
)
print_table(lib.read(sym).data)

shape: (5, 1)
┌─────────────────────┐
│ col                 │
│ ---                 │
│ datetime[ns]        │
╞═════════════════════╡
│ 2024-12-31 23:00:00 │
│ 2025-01-01 23:00:00 │
│ 2025-01-02 23:00:00 │
│ 2025-01-03 23:00:00 │
│ 2025-01-04 23:00:00 │
└─────────────────────┘


### Lower level `pyarrow` primitives such as `Array`, `RecordBatch`, and `ChunkedArray` not yet normalized

In [35]:
try:
    lib.write(sym, pa.array([0, 1], pa.int64()))
except Exception as e:
    print(e)

data is of a type that cannot be normalized. Consider using write_pickle instead. type(data)=[<class 'pyarrow.lib.Int64Array'>]


### Calling `update` with a `date_range` that overlaps the index of the provided table not yet supported (non-overlapping is fine)

In [36]:
table = pa.table(
    {
        "timestamp": pa.Array.from_pandas(pd.date_range("2025-01-01", periods=5), type=pa.timestamp("ns")),
        "col": pa.array([0, 1, 2, 3, 4], pa.int64()),
    }
)
try:
    lib.update(sym, table, date_range=(pd.Timestamp("2025-01-02"), pd.Timestamp("2025-01-04")), index_column="timestamp")
except Exception as e:
    print(e)

update with date_range and pyarrow Table not yet supported with date_range overlapping the data


### Processing operations (`QueryBuilder`) on data written with Arrow are not yet tested and may have unexpected behaviour

### We also do not support writing dictionary-encoded data (similar to Pandas categoricals) right now