# ArcticDB Experimental Arrow Writing Dense String Data Demo

### This notebook demonstrates the next cut of writing pyarrow tables directly into ArcticDB. This is still an experimental feature, with both the API and behaviours subject to change in minor or patch releases, and under no circumstances should be deployed to production environments. There are still a lot of rough edges, many of which are highlighted below, that will be addressed in future releases. This cut adds support for writing string columns.

### Performance-wise, writing string data in Arrow Tables is never slower than Pandas, and can be up to x2 faster depending on the table size, number of unique strings in the data, and the number of cores involved.

### Please read the notebook for numeric data before this one, functionality covered there that behaves in the same way for string columns is not repeated here.

## Preamble

In [1]:
import pandas as pd
import polars as pl
import pyarrow as pa
from ahl.mongo.mongoose import NativeMongoose
from arcticdb import Arctic, OutputFormat, QueryBuilder, WritePayload
from arcticdb.util.arrow import stringify_dictionary_encoded_columns

In [2]:
arctic = Arctic("lmdb://arrow-writes-demo")

In [3]:
lib = arctic.get_library("test_lib", output_format=OutputFormat.EXPERIMENTAL_ARROW, create_if_missing=True)
lib._nvs._set_allow_arrow_input(True)

In [4]:
lib._nvs.version_store.clear()

In [5]:
sym = "test"

In [6]:
def print_table(table):
    print(pl.from_arrow(table))

## Write some string data of both supported string types

PyArrow uses the `string` type by default. This is sufficient for use cases where the sum of the length of all of the strings in a column (without deduplication) is less than 2GB. If this is insufficient, then the `large_string` type is required, which supports total string lengths up to 8 exabytes, which is probably enough.

In [7]:
table_0 = pa.table(
    {
        "timestamp": pa.Array.from_pandas(pd.date_range("2025-01-01", periods=4), type=pa.timestamp("ns")),
        "strings 1": pa.array(["these", "are", "some", "strings"], pa.string()),
        "strings 2": pa.array(["here", "are", "some", "more"], pa.large_string()),
    }
)
print_table(table_0)

shape: (4, 3)
┌─────────────────────┬───────────┬───────────┐
│ timestamp           ┆ strings 1 ┆ strings 2 │
│ ---                 ┆ ---       ┆ ---       │
│ datetime[ns]        ┆ str       ┆ str       │
╞═════════════════════╪═══════════╪═══════════╡
│ 2025-01-01 00:00:00 ┆ these     ┆ here      │
│ 2025-01-02 00:00:00 ┆ are       ┆ are       │
│ 2025-01-03 00:00:00 ┆ some      ┆ some      │
│ 2025-01-04 00:00:00 ┆ strings   ┆ more      │
└─────────────────────┴───────────┴───────────┘


In [8]:
lib.write(sym, table_0, index_column="timestamp")

VersionedItem(symbol='test', library='test_lib', data=n/a, version=0, metadata=None, host='LMDB(path=/data/team/data/arctic_native/examples/arrow-writes-demo)', timestamp=1761570730696143527)

In [9]:
received_table = lib.read(sym).data
print_table(received_table)

shape: (4, 3)
┌─────────────────────┬───────────┬───────────┐
│ timestamp           ┆ strings 1 ┆ strings 2 │
│ ---                 ┆ ---       ┆ ---       │
│ datetime[ns]        ┆ cat       ┆ cat       │
╞═════════════════════╪═══════════╪═══════════╡
│ 2025-01-01 00:00:00 ┆ these     ┆ here      │
│ 2025-01-02 00:00:00 ┆ are       ┆ are       │
│ 2025-01-03 00:00:00 ┆ some      ┆ some      │
│ 2025-01-04 00:00:00 ┆ strings   ┆ more      │
└─────────────────────┴───────────┴───────────┘


## Note that in this release string columns are still dictionary-encoded in the output (`cat`, short for categorical in Polars). This will be fixed in a future release, but for now it means that string data read out of ArcticDB cannot be immeditely written back without a conversion

In [10]:
try:
    lib.write(sym, received_table)
except Exception as e:
    print(e)

E_UNSUPPORTED_COLUMN_TYPE Dictionary-encoded Arrow data unsupported


In [11]:
converted_table = stringify_dictionary_encoded_columns(received_table)
lib.write(sym, converted_table)

VersionedItem(symbol='test', library='test_lib', data=n/a, version=1, metadata=None, host='LMDB(path=/data/team/data/arctic_native/examples/arrow-writes-demo)', timestamp=1761570746629560088)

## Strings in Arrow are internally represented as UTF-8, so this full character set is available

In [12]:
table_0 = pa.table(
    {
        "timestamp": pa.Array.from_pandas(pd.date_range("2025-01-01", periods=4), type=pa.timestamp("ns")),
        "strings 1": pa.array(["🔄", "🙈", "🙉", "🙊"], pa.string()),
        "strings 2": pa.array(["€", "©", "Ö", "â"], pa.large_string()),
    }
)
print_table(table_0)

shape: (4, 3)
┌─────────────────────┬───────────┬───────────┐
│ timestamp           ┆ strings 1 ┆ strings 2 │
│ ---                 ┆ ---       ┆ ---       │
│ datetime[ns]        ┆ str       ┆ str       │
╞═════════════════════╪═══════════╪═══════════╡
│ 2025-01-01 00:00:00 ┆ 🔄        ┆ €         │
│ 2025-01-02 00:00:00 ┆ 🙈        ┆ ©         │
│ 2025-01-03 00:00:00 ┆ 🙉        ┆ Ö         │
│ 2025-01-04 00:00:00 ┆ 🙊        ┆ â         │
└─────────────────────┴───────────┴───────────┘


In [13]:
lib.write(sym, table_0, index_column="timestamp")
print_table(lib.read(sym).data)

shape: (4, 3)
┌─────────────────────┬───────────┬───────────┐
│ timestamp           ┆ strings 1 ┆ strings 2 │
│ ---                 ┆ ---       ┆ ---       │
│ datetime[ns]        ┆ cat       ┆ cat       │
╞═════════════════════╪═══════════╪═══════════╡
│ 2025-01-01 00:00:00 ┆ 🔄        ┆ €         │
│ 2025-01-02 00:00:00 ┆ 🙈        ┆ ©         │
│ 2025-01-03 00:00:00 ┆ 🙉        ┆ Ö         │
│ 2025-01-04 00:00:00 ┆ 🙊        ┆ â         │
└─────────────────────┴───────────┴───────────┘


## Append some data to this

Note that the `strings 1` column now has type `large_string`, and the `strings 2` column not has type `string`, the opposite way round to in the `write` call. We store both types the same internally, and so this is allowed.

In [14]:
table_1 = pa.table(
    {
        "timestamp": pa.Array.from_pandas(pd.date_range("2025-01-05", periods=3), type=pa.timestamp("ns")),
        "strings 1": pa.array(["even", "more", "strings"], pa.large_string()),
        "strings 2": pa.array(["hello", "bonjour", "gutentag"], pa.string()),
    }
)
lib.append(sym, table_1, index_column="timestamp")
print_table(lib.read(sym).data)

shape: (7, 3)
┌─────────────────────┬───────────┬───────────┐
│ timestamp           ┆ strings 1 ┆ strings 2 │
│ ---                 ┆ ---       ┆ ---       │
│ datetime[ns]        ┆ cat       ┆ cat       │
╞═════════════════════╪═══════════╪═══════════╡
│ 2025-01-01 00:00:00 ┆ 🔄        ┆ €         │
│ 2025-01-02 00:00:00 ┆ 🙈        ┆ ©         │
│ 2025-01-03 00:00:00 ┆ 🙉        ┆ Ö         │
│ 2025-01-04 00:00:00 ┆ 🙊        ┆ â         │
│ 2025-01-05 00:00:00 ┆ even      ┆ hello     │
│ 2025-01-06 00:00:00 ┆ more      ┆ bonjour   │
│ 2025-01-07 00:00:00 ┆ strings   ┆ gutentag  │
└─────────────────────┴───────────┴───────────┘


## `update` works in the same way

In [15]:
table_2 = pa.table(
    {
        "timestamp": pa.Array.from_pandas(pd.date_range("2025-01-04", periods=2), type=pa.timestamp("ns")),
        "strings 1": pa.array(["replacement", "strings"]),
        "strings 2": pa.array(["goodbye", "au revoir"]),
    }
)
print_table(table_2)

shape: (2, 3)
┌─────────────────────┬─────────────┬───────────┐
│ timestamp           ┆ strings 1   ┆ strings 2 │
│ ---                 ┆ ---         ┆ ---       │
│ datetime[ns]        ┆ str         ┆ str       │
╞═════════════════════╪═════════════╪═══════════╡
│ 2025-01-04 00:00:00 ┆ replacement ┆ goodbye   │
│ 2025-01-05 00:00:00 ┆ strings     ┆ au revoir │
└─────────────────────┴─────────────┴───────────┘


In [16]:
lib.update(sym, table_2, index_column="timestamp")
print_table(lib.read(sym).data)

shape: (7, 3)
┌─────────────────────┬─────────────┬───────────┐
│ timestamp           ┆ strings 1   ┆ strings 2 │
│ ---                 ┆ ---         ┆ ---       │
│ datetime[ns]        ┆ cat         ┆ cat       │
╞═════════════════════╪═════════════╪═══════════╡
│ 2025-01-01 00:00:00 ┆ 🔄          ┆ €         │
│ 2025-01-02 00:00:00 ┆ 🙈          ┆ ©         │
│ 2025-01-03 00:00:00 ┆ 🙉          ┆ Ö         │
│ 2025-01-04 00:00:00 ┆ replacement ┆ goodbye   │
│ 2025-01-05 00:00:00 ┆ strings     ┆ au revoir │
│ 2025-01-06 00:00:00 ┆ more        ┆ bonjour   │
│ 2025-01-07 00:00:00 ┆ strings     ┆ gutentag  │
└─────────────────────┴─────────────┴───────────┘


## Writing views of data works as you would expect

In [17]:
table_view = table_0.slice(1, 2)
print_table(table_view)

shape: (2, 3)
┌─────────────────────┬───────────┬───────────┐
│ timestamp           ┆ strings 1 ┆ strings 2 │
│ ---                 ┆ ---       ┆ ---       │
│ datetime[ns]        ┆ str       ┆ str       │
╞═════════════════════╪═══════════╪═══════════╡
│ 2025-01-02 00:00:00 ┆ 🙈        ┆ ©         │
│ 2025-01-03 00:00:00 ┆ 🙉        ┆ Ö         │
└─────────────────────┴───────────┴───────────┘


In [18]:
lib.write(sym, table_view)
print_table(lib.read(sym).data)

shape: (2, 3)
┌─────────────────────┬───────────┬───────────┐
│ timestamp           ┆ strings 1 ┆ strings 2 │
│ ---                 ┆ ---       ┆ ---       │
│ datetime[ns]        ┆ cat       ┆ cat       │
╞═════════════════════╪═══════════╪═══════════╡
│ 2025-01-02 00:00:00 ┆ 🙈        ┆ ©         │
│ 2025-01-03 00:00:00 ┆ 🙉        ┆ Ö         │
└─────────────────────┴───────────┴───────────┘


## Staging (AKA incomplete) APIs also work as expected

In [19]:
# Equivalent to calling write with parallel=True, write with incomplete=True, or append with incomplete=True
lib.stage(sym, table_0, index_column="timestamp")
lib.stage(sym, table_1, index_column="timestamp")
lib.finalize_staged_data(sym)
print_table(lib.read(sym).data)

shape: (7, 3)
┌─────────────────────┬───────────┬───────────┐
│ timestamp           ┆ strings 1 ┆ strings 2 │
│ ---                 ┆ ---       ┆ ---       │
│ datetime[ns]        ┆ cat       ┆ cat       │
╞═════════════════════╪═══════════╪═══════════╡
│ 2025-01-01 00:00:00 ┆ 🔄        ┆ €         │
│ 2025-01-02 00:00:00 ┆ 🙈        ┆ ©         │
│ 2025-01-03 00:00:00 ┆ 🙉        ┆ Ö         │
│ 2025-01-04 00:00:00 ┆ 🙊        ┆ â         │
│ 2025-01-05 00:00:00 ┆ even      ┆ hello     │
│ 2025-01-06 00:00:00 ┆ more      ┆ bonjour   │
│ 2025-01-07 00:00:00 ┆ strings   ┆ gutentag  │
└─────────────────────┴───────────┴───────────┘


# Pandas interoperability

## Data written as Arrow can be read as Pandas

In [20]:
lib.read(sym, output_format=OutputFormat.PANDAS).data

Unnamed: 0_level_0,strings 1,strings 2
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2025-01-01,🔄,€
2025-01-02,🙈,©
2025-01-03,🙉,Ö
2025-01-04,🙊,â
2025-01-05,even,hello
2025-01-06,more,bonjour
2025-01-07,strings,gutentag
