# Table Utilities

This notebook demonstrates the table utility functions from `specparser.amt.table`.

## Table of Contents

1. [Table Orientations](#1.-Table-Orientations) - Understanding the 4 table types
2. [Column Operations](#2.-Column-Operations) - `table_column`, `table_select_columns`, `table_add_column`, `table_drop_columns`
3. [Row Operations](#3.-Row-Operations) - `table_bind_rows`, `table_unique_rows`, `table_head`, `table_sample`
4. [Value Operations](#4.-Value-Operations) - `table_replace_value`
5. [Multi-Table Operations](#5.-Multi-Table-Operations) - `table_stack_cols`
   - [5b. Key-Based Joins](#5b.-Key-Based-Joins) - `table_left_join`, `table_inner_join`
6. [Reshape Operations](#6.-Reshape-Operations) - `table_unchop`, `table_chop`
7. [Display/Output](#7.-Display/Output) - `format_table`, `print_table`, `show_table`
8. [Arrow Tables](#8.-Arrow-Tables-(PyArrow-backed)) - PyArrow-backed tables for large data
9. [Join Benchmarks](#9.-Join-Benchmarks) - Performance comparison: row vs arrow joins
10. [Arrow Compute Functions](#10.-Arrow-Compute-Functions) - Vectorized operations
11. [u8m Tables](#11.-u8m-Tables-(uint8-Matrix-Columns)) - High-performance string tables

---

## Setup

The following cell imports all table functions used in this notebook.

In [None]:
# Setup: Imports and helper function
import pandas as pd
from specparser.amt import (
    table_to_columns,
    table_to_rows,
    table_column,
    table_select_columns,
    table_add_column,
    table_drop_columns,
    table_replace_value,
    table_bind_rows,
    table_unique_rows,
    table_head,
    table_sample,
    table_stack_cols,
    table_left_join,
    table_inner_join,
    table_unchop,
    table_chop,
    format_table,
    print_table,
    show_table,
)

In [2]:
# Sample data: a small stocks table
stocks = {
    "orientation": "row",
    "columns": ["ticker", "name", "sector", "price"],
    "rows": [
        ["AAPL", "Apple", "Tech", 150.0],
        ["GOOGL", "Alphabet", "Tech", 140.0],
        ["JPM", "JPMorgan", "Finance", 180.0],
        ["XOM", "Exxon", "Energy", 100.0],
        ["MSFT", "Microsoft", "Tech", 380.0],
    ]
}
show_table(stocks)

Unnamed: 0,ticker,name,sector,price
0,AAPL,Apple,Tech,150.0
1,GOOGL,Alphabet,Tech,140.0
2,JPM,JPMorgan,Finance,180.0
3,XOM,Exxon,Energy,100.0
4,MSFT,Microsoft,Tech,380.0


---
## 1. Table Orientations

Tables can be stored in four orientations:

| Orientation | Storage Format | Best For |
|-------------|----------------|----------|
| **row** | `rows` is list of row lists | Small tables, row-wise operations |
| **column** | `rows` is list of column lists | Column-wise operations, aggregations |
| **arrow** | `rows` is list of PyArrow arrays | Large data, vectorized compute, joins |
| **u8m** | `rows` is list of uint8 matrices | High-performance string data with Numba |

### Orientation Examples

```python
# Row-oriented: rows = [[row0], [row1], ...]
{"orientation": "row", "columns": ["a", "b"], "rows": [[1, 2], [3, 4]]}

# Column-oriented: rows = [[col0_values], [col1_values], ...]
{"orientation": "column", "columns": ["a", "b"], "rows": [[1, 3], [2, 4]]}

# Arrow-oriented: rows = [pa.array, pa.array, ...]
{"orientation": "arrow", "columns": ["a", "b"], "rows": [pa.array([1, 3]), pa.array([2, 4])]}

# u8m-oriented: rows = [uint8_matrix, uint8_matrix, ...]
{"orientation": "u8m", "columns": ["asset", "date"], "rows": [np.array([[65,65], [66,66]], dtype=np.uint8), ...]}
```

### Conversion Rules

The table below shows what conversions are available:

| From \ To | row | column | arrow | u8m |
|------------|-----|--------|-------|-----|
| **row** | ✓ | ✓ | ✓ | ✗ |
| **column** | ✓ | ✓ | ✓ | ✗ |
| **arrow** | ✓ | ✓ | ✓ | ✗ |
| **u8m** | ✓* | ✓* | ✗ | ✓ |

*u8m converts to column/row via `table_to_columns()` which converts uint8 matrices to Python strings using `u8m2s()`.

### Function Orientation Support

Most functions accept multiple orientations and either preserve the input orientation or convert to a specific output:

| Pattern | Behavior |
|---------|----------|
| **Preserves orientation** | Output has same orientation as input |
| **Converts to X** | Always outputs orientation X |
| **Arrow-optimized** | Uses PyArrow compute, converts to arrow |

### `table_to_columns(table)`

Convert any table to column-oriented.

**Orientation Support:**
- ✓ row → column
- ✓ column → column (no-op)
- ✓ arrow → column
- ✓ u8m → column (converts uint8 matrices to Python strings)

In [3]:
# Convert to column-oriented
stocks_col = table_to_columns(stocks)

print(f"Orientation: {stocks_col['orientation']}")
print(f"Columns: {stocks_col['columns']}")
print(f"\nData (each list is a column):")
for name, col in zip(stocks_col['columns'], stocks_col['rows']):
    print(f"  {name}: {col}")

Orientation: column
Columns: ['ticker', 'name', 'sector', 'price']

Data (each list is a column):
  ticker: ['AAPL', 'GOOGL', 'JPM', 'XOM', 'MSFT']
  name: ['Apple', 'Alphabet', 'JPMorgan', 'Exxon', 'Microsoft']
  sector: ['Tech', 'Tech', 'Finance', 'Energy', 'Tech']
  price: [150.0, 140.0, 180.0, 100.0, 380.0]


### `table_to_rows(table)`

Convert a column-oriented or arrow-oriented table to row-oriented.

**Orientation Support:**
- ✓ row → row (no-op)
- ✓ column → row
- ✓ arrow → row
- ✗ u8m (use `table_to_columns` first)

In [4]:
# Convert back to row-oriented
stocks_row = table_to_rows(stocks_col)

print(f"Orientation: {stocks_row['orientation']}")
print(f"Columns: {stocks_row['columns']}")
print(f"\nData (each list is a row):")
for row in stocks_row['rows']:
    print(f"  {row}")

Orientation: row
Columns: ['ticker', 'name', 'sector', 'price']

Data (each list is a row):
  ['AAPL', 'Apple', 'Tech', 150.0]
  ['GOOGL', 'Alphabet', 'Tech', 140.0]
  ['JPM', 'JPMorgan', 'Finance', 180.0]
  ['XOM', 'Exxon', 'Energy', 100.0]
  ['MSFT', 'Microsoft', 'Tech', 380.0]


In [5]:
# Roundtrip preserves data
assert table_to_rows(table_to_columns(stocks))["rows"] == stocks["rows"]
print("Roundtrip successful!")

Roundtrip successful!


---
## 2. Column Operations

The following functions work with columns. Each function shows its orientation support:
- ✓ = supported
- ✗ = not supported
- → = output orientation

### `table_column(table, colname)`

Extract a single column as a list (or PyArrow array for arrow tables, or uint8 matrix for u8m tables).

**Orientation Support:**
- ✓ row → list
- ✓ column → list
- ✓ arrow → pa.Array
- ✓ u8m → uint8 matrix

In [6]:
# Extract the 'sector' column
sectors = table_column(stocks, "sector")
print(f"Sectors: {sectors}")

Sectors: ['Tech', 'Tech', 'Finance', 'Energy', 'Tech']


In [7]:
# Error case: column not found
try:
    table_column(stocks, "missing_column")
except ValueError as e:
    print(f"Error: {e}")

Error: Column 'missing_column' not found in table columns: ['ticker', 'name', 'sector', 'price']


### `table_select_columns(table, columns)`

Select and reorder columns. Preserves input orientation.

**Orientation Support:**
- ✓ row → row
- ✓ column → column
- ✓ arrow → arrow
- ✗ u8m

In [8]:
# Select only name, ticker, and price (reordered)
selected = table_select_columns(stocks, ["name", "ticker", "price"])
show_table(selected)

Unnamed: 0,name,ticker,price
0,Apple,AAPL,150.0
1,Alphabet,GOOGL,140.0
2,JPMorgan,JPM,180.0
3,Exxon,XOM,100.0
4,Microsoft,MSFT,380.0


### `table_add_column(table, colname, value, position)`

Add a new column with a constant or list value. Preserves input orientation.

**Orientation Support:**
- ✓ row → row
- ✓ column → column
- ✓ arrow → arrow
- ✗ u8m

In [9]:
# Add 'currency' column at the end
with_currency = table_add_column(stocks, "currency", value="USD")
show_table(with_currency)

Unnamed: 0,ticker,name,sector,price,currency
0,AAPL,Apple,Tech,150.0,USD
1,GOOGL,Alphabet,Tech,140.0,USD
2,JPM,JPMorgan,Finance,180.0,USD
3,XOM,Exxon,Energy,100.0,USD
4,MSFT,Microsoft,Tech,380.0,USD


In [10]:
# Add 'rank' column at position 0
with_rank = table_add_column(stocks, "rank", value=0, position=0)
show_table(with_rank)

Unnamed: 0,rank,ticker,name,sector,price
0,0,AAPL,Apple,Tech,150.0
1,0,GOOGL,Alphabet,Tech,140.0
2,0,JPM,JPMorgan,Finance,180.0
3,0,XOM,Exxon,Energy,100.0
4,0,MSFT,Microsoft,Tech,380.0


### `table_drop_columns(table, columns)`

Remove specified columns. Preserves input orientation.

**Orientation Support:**
- ✓ row → row
- ✓ column → column
- ✓ arrow → arrow
- ✗ u8m

In [11]:
# Drop 'sector' column
without_sector = table_drop_columns(stocks, ["sector"])
show_table(without_sector)

Unnamed: 0,ticker,name,price
0,AAPL,Apple,150.0
1,GOOGL,Alphabet,140.0
2,JPM,JPMorgan,180.0
3,XOM,Exxon,100.0
4,MSFT,Microsoft,380.0


In [12]:
# Drop multiple columns
minimal = table_drop_columns(stocks, ["name", "sector"])
show_table(minimal)

Unnamed: 0,ticker,price
0,AAPL,150.0
1,GOOGL,140.0
2,JPM,180.0
3,XOM,100.0
4,MSFT,380.0


---
## 3. Row Operations

Row operations work with table rows. Some require specific orientations.

### `table_bind_rows(*tables)`

Concatenate rows from multiple tables (must have same columns). Uses first table's orientation.

**Orientation Support:**
- ✓ row + row → row
- ✓ column + column → column
- ✓ arrow + arrow → arrow
- ✓ mixed → first table's orientation
- ✗ u8m

In [13]:
# Create a second table with more stocks
more_stocks = {
    "orientation": "row",
    "columns": ["ticker", "name", "sector", "price"],
    "rows": [
        ["AMZN", "Amazon", "Tech", 180.0],
        ["WMT", "Walmart", "Retail", 160.0],
    ]
}

# Bind rows together
combined = table_bind_rows(stocks, more_stocks)
show_table(combined)

Unnamed: 0,ticker,name,sector,price
0,AAPL,Apple,Tech,150.0
1,GOOGL,Alphabet,Tech,140.0
2,JPM,JPMorgan,Finance,180.0
3,XOM,Exxon,Energy,100.0
4,MSFT,Microsoft,Tech,380.0
5,AMZN,Amazon,Tech,180.0
6,WMT,Walmart,Retail,160.0


In [14]:
# Error case: column mismatch
bad_table = {
    "orientation": "row",
    "columns": ["ticker", "different_column"],
    "rows": [["TEST", 100]]
}

try:
    table_bind_rows(stocks, bad_table)
except ValueError as e:
    print(f"Error: {e}")

Error: Table 2 columns ['ticker', 'different_column'] != first table columns ['ticker', 'name', 'sector', 'price']


### `table_unique_rows(table)`

Remove duplicate rows. Converts to row orientation.

**Orientation Support:**
- ✓ row → row
- ✓ column → row (converts)
- ✓ arrow → row (converts)
- ✗ u8m

In [15]:
# Table with duplicates
with_dupes = {
    "orientation": "row",
    "columns": ["ticker", "price"],
    "rows": [
        ["AAPL", 150],
        ["GOOGL", 140],
        ["AAPL", 150],  # duplicate
        ["MSFT", 380],
        ["GOOGL", 140],  # duplicate
    ]
}

unique = table_unique_rows(with_dupes)
show_table(unique)

Unnamed: 0,ticker,price
0,AAPL,150
1,GOOGL,140
2,MSFT,380


### `table_head(table, n)`

Return the first n rows. Preserves input orientation.

**Orientation Support:**
- ✓ row → row
- ✓ column → column
- ✓ arrow → arrow
- ✗ u8m

In [16]:
# First 3 rows
first_three = table_head(stocks, 3)
show_table(first_three)

Unnamed: 0,ticker,name,sector,price
0,AAPL,Apple,Tech,150.0
1,GOOGL,Alphabet,Tech,140.0
2,JPM,JPMorgan,Finance,180.0


### `table_sample(table, n)`

Return a random sample of n rows. Preserves input orientation.

**Orientation Support:**
- ✓ row → row
- ✓ column → column
- ✓ arrow → arrow
- ✗ u8m

In [17]:
# Random sample of 2 rows (run multiple times to see different results)
sampled = table_sample(stocks, 2)
show_table(sampled)

Unnamed: 0,ticker,name,sector,price
0,XOM,Exxon,Energy,100.0
1,AAPL,Apple,Tech,150.0


---
## 4. Value Operations

### `table_replace_value(table, colname, old_value, new_value)`

Replace occurrences of a value in a column. Uses PyArrow compute, converts to arrow.

**Orientation Support:**
- ✓ row → arrow (converts)
- ✓ column → arrow (converts)
- ✓ arrow → arrow
- ✗ u8m

In [18]:
# Replace "Tech" with "Technology" in the sector column
replaced = table_replace_value(stocks, "sector", "Tech", "Technology")
show_table(replaced)

Unnamed: 0,ticker,name,sector,price
0,AAPL,Apple,Technology,150.0
1,GOOGL,Alphabet,Technology,140.0
2,JPM,JPMorgan,Finance,180.0
3,XOM,Exxon,Energy,100.0
4,MSFT,Microsoft,Technology,380.0


---
## 5. Multi-Table Operations

### `table_stack_cols(*tables, key_col, copy_data)`

Stack columns from multiple tables side-by-side (tables must be row-aligned). Uses first table's orientation.

**Orientation Support:**
- ✓ row + row → row
- ✓ column + column → column
- ✓ arrow + arrow → arrow
- ✓ mixed → first table's orientation
- ✗ u8m

In [19]:
# Create a metrics table (same tickers, same order)
metrics = {
    "orientation": "row",
    "columns": ["ticker", "pe_ratio", "dividend"],
    "rows": [
        ["AAPL", 28.5, 0.92],
        ["GOOGL", 25.0, 0.0],
        ["JPM", 10.2, 4.0],
        ["XOM", 12.1, 3.5],
        ["MSFT", 35.0, 2.72],
    ]
}

print("Stocks table:")
display(show_table(stocks))

print("\nMetrics table:")
display(show_table(metrics))

Stocks table:


Unnamed: 0,ticker,name,sector,price
0,AAPL,Apple,Tech,150.0
1,GOOGL,Alphabet,Tech,140.0
2,JPM,JPMorgan,Finance,180.0
3,XOM,Exxon,Energy,100.0
4,MSFT,Microsoft,Tech,380.0



Metrics table:


Unnamed: 0,ticker,pe_ratio,dividend
0,AAPL,28.5,0.92
1,GOOGL,25.0,0.0
2,JPM,10.2,4.0
3,XOM,12.1,3.5
4,MSFT,35.0,2.72


In [20]:
# Stack columns side-by-side (keeps key column from first table)
stacked = table_stack_cols(stocks, metrics, key_col=0)

print(f"Result orientation: {stacked['orientation']}")
show_table(stacked)

Result orientation: column


Unnamed: 0,ticker,name,sector,price,pe_ratio,dividend
0,AAPL,Apple,Tech,150.0,28.5,0.92
1,GOOGL,Alphabet,Tech,140.0,25.0,0.0
2,JPM,JPMorgan,Finance,180.0,10.2,4.0
3,XOM,Exxon,Energy,100.0,12.1,3.5
4,MSFT,Microsoft,Tech,380.0,35.0,2.72


### Removed: `table_join` (use `table_stack_cols` instead)

In [21]:
# table_join has been removed - use table_stack_cols instead
# stacked = table_stack_cols(stocks, metrics, key_col=0)

In [22]:
# Error case: row count mismatch
short_table = {
    "orientation": "row",
    "columns": ["ticker", "rating"],
    "rows": [
        ["AAPL", "Buy"],
        ["GOOGL", "Hold"],
    ]
}

try:
    table_stack_cols(stocks, short_table)  # 5 rows vs 2 rows
except ValueError as e:
    print(f"Error: {e}")

Error: Table 2 has 2 rows; expected 5


---
## 5b. Key-Based Joins

Unlike `table_stack_cols` which requires row-aligned tables, join functions match rows by key values.

### `table_left_join(left, right, left_on, right_on, suffixes)`

Left join two tables on key columns. All rows from left table are kept.

**Orientation Support:**
- ✓ row + row → row
- ✓ column + column → row (converts)
- ✓ arrow + arrow → arrow (uses PyArrow hash join)
- ✓ mixed → arrow if any arrow, else row
- ✗ u8m

In [23]:
# Create tables that are NOT row-aligned (different number of rows, different order)
prices = {
    "orientation": "row",
    "columns": ["ticker", "price"],
    "rows": [
        ["AAPL", 150.0],
        ["GOOGL", 140.0],
        ["AMZN", 180.0],  # not in ratings table
    ]
}

ratings = {
    "orientation": "row",
    "columns": ["ticker", "rating", "target"],
    "rows": [
        ["MSFT", "Buy", 400],
        ["AAPL", "Hold", 160],
        ["XOM", "Sell", 90],
    ]
}

print("Prices table:")
display(show_table(prices))

print("\nRatings table:")
display(show_table(ratings))

Prices table:


Unnamed: 0,ticker,price
0,AAPL,150.0
1,GOOGL,140.0
2,AMZN,180.0



Ratings table:


Unnamed: 0,ticker,rating,target
0,MSFT,Buy,400
1,AAPL,Hold,160
2,XOM,Sell,90


In [24]:
# Left join: all prices rows, matching ratings
result = table_left_join(prices, ratings, "ticker")
print("Left join (prices <- ratings):")
print("- All 3 price rows kept")
print("- AAPL gets rating data, GOOGL and AMZN get None")
show_table(result)

Left join (prices <- ratings):
- All 3 price rows kept
- AAPL gets rating data, GOOGL and AMZN get None


Unnamed: 0,ticker,price,rating,target
0,AAPL,150.0,Hold,160.0
1,GOOGL,140.0,,
2,AMZN,180.0,,


In [25]:
# Join with different key column names
assets = {
    "orientation": "row",
    "columns": ["asset_id", "name"],
    "rows": [["A1", "Asset One"], ["A2", "Asset Two"]]
}

values = {
    "orientation": "row",
    "columns": ["id", "value"],
    "rows": [["A1", 100], ["A3", 300]]
}

result = table_left_join(assets, values, left_on="asset_id", right_on="id")
print("Join with different key column names (asset_id <- id):")
show_table(result)

Join with different key column names (asset_id <- id):


Unnamed: 0,asset_id,name,value
0,A1,Asset One,100.0
1,A2,Asset Two,


In [26]:
# Handle column name conflicts with suffixes
table1 = {
    "orientation": "row",
    "columns": ["id", "value"],
    "rows": [[1, "left_val"]]
}

table2 = {
    "orientation": "row",
    "columns": ["id", "value"],
    "rows": [[1, "right_val"]]
}

result = table_left_join(table1, table2, "id", suffixes=("_left", "_right"))
print("Column conflict resolved with suffixes:")
show_table(result)

Column conflict resolved with suffixes:


Unnamed: 0,id,value_left,value_right
0,1,left_val,right_val


In [27]:
# Duplicate keys in right table: produces multiple output rows
left = {
    "orientation": "row",
    "columns": ["id", "name"],
    "rows": [[1, "Item"]]
}

right = {
    "orientation": "row",
    "columns": ["id", "tag"],
    "rows": [[1, "tag_a"], [1, "tag_b"], [1, "tag_c"]]
}

result = table_left_join(left, right, "id")
print("Duplicate keys in right: 1 left row -> 3 output rows")
show_table(result)

Duplicate keys in right: 1 left row -> 3 output rows


Unnamed: 0,id,name,tag
0,1,Item,tag_a
1,1,Item,tag_b
2,1,Item,tag_c


### `table_inner_join(left, right, left_on, right_on, suffixes)`

Inner join two tables on key columns. Only matching rows are kept.

**Orientation Support:**
- ✓ row + row → row
- ✓ column + column → row (converts)
- ✓ arrow + arrow → arrow (uses PyArrow hash join)
- ✓ mixed → arrow if any arrow, else row
- ✗ u8m

In [28]:
# Inner join: only matching rows
result = table_inner_join(prices, ratings, "ticker")
print("Inner join (prices & ratings):")
print("- Only AAPL exists in both tables")
show_table(result)

Inner join (prices & ratings):
- Only AAPL exists in both tables


Unnamed: 0,ticker,price,rating,target
0,AAPL,150.0,Hold,160


In [29]:
# Inner join with no matches returns empty table
no_match_left = {
    "orientation": "row",
    "columns": ["id", "a"],
    "rows": [[1, "x"], [2, "y"]]
}

no_match_right = {
    "orientation": "row",
    "columns": ["id", "b"],
    "rows": [[3, "p"], [4, "q"]]
}

result = table_inner_join(no_match_left, no_match_right, "id")
print("Inner join with no matching keys:")
print(f"Result has {len(result['rows'])} rows")
show_table(result)

Inner join with no matching keys:
Result has 0 rows


Unnamed: 0,id,a,b


---
## 6. Reshape Operations

Reshape operations change the structure of tables.

### `table_unchop(table, column)`

Expand rows by unrolling a list-valued column. Row-oriented only.

**Orientation Support:**
- ✓ row → row
- ✗ column (convert to row first)
- ✗ arrow (convert to row first)
- ✗ u8m

In [30]:
# Table with list-valued column
tags = {
    "orientation": "row",
    "columns": ["ticker", "tags"],
    "rows": [
        ["AAPL", ["tech", "consumer", "hardware"]],
        ["GOOGL", ["tech", "advertising"]],
        ["JPM", ["finance", "banking"]],
    ]
}

print("Original table (with list values):")
for row in tags["rows"]:
    print(f"  {row}")

Original table (with list values):
  ['AAPL', ['tech', 'consumer', 'hardware']]
  ['GOOGL', ['tech', 'advertising']]
  ['JPM', ['finance', 'banking']]


In [31]:
# Unchop: expand lists into rows
unchopped = table_unchop(tags, "tags")

print(f"Result orientation: {unchopped['orientation']}")
show_table(unchopped)

Result orientation: column


Unnamed: 0,ticker,tags
0,AAPL,tech
1,AAPL,consumer
2,AAPL,hardware
3,GOOGL,tech
4,GOOGL,advertising
5,JPM,finance
6,JPM,banking


### `table_chop(table, column)`

Collapse rows by grouping on non-target columns. Row-oriented only.

**Orientation Support:**
- ✓ row → row
- ✗ column (convert to row first)
- ✗ arrow (convert to row first)
- ✗ u8m

In [32]:
# Chop: collapse back into lists
chopped = table_chop(unchopped, "tags")

print(f"Result orientation: {chopped['orientation']}")
print("\nCollapsed table:")
for row in chopped["rows"]:
    print(f"  {row}")

Result orientation: row

Collapsed table:
  ['AAPL', ['tech', 'consumer', 'hardware']]
  ['GOOGL', ['tech', 'advertising']]
  ['JPM', ['finance', 'banking']]


In [33]:
# Another example: group prices by sector
prices_by_sector = table_chop(
    table_select_columns(stocks, ["sector", "price"]),
    "price"
)

print("Prices grouped by sector:")
for row in prices_by_sector["rows"]:
    print(f"  {row[0]}: {row[1]}")

Prices grouped by sector:
  Tech: [150.0, 140.0, 380.0]
  Finance: [180.0]
  Energy: [100.0]


---
## 7. Display/Output

### `format_table(table)`

Format a table as a tab-separated string with header.

**Orientation Support:**
- ✓ row
- ✓ column (converts to row)
- ✓ arrow (converts to row)
- ✓ u8m (converts via column)

In [34]:
# Format as string
formatted = format_table(table_head(stocks, 3))
print(formatted)

ticker	name	sector	price
AAPL	Apple	Tech	150
GOOGL	Alphabet	Tech	140
JPM	JPMorgan	Finance	180


### `print_table(table)`

Print a table directly to stdout.

**Orientation Support:**
- ✓ row
- ✓ column (converts to row)
- ✓ arrow (converts to row)
- ✓ u8m (converts via column)

In [35]:
# Print directly
print_table(table_head(stocks, 3))

ticker	name	sector	price
AAPL	Apple	Tech	150
GOOGL	Alphabet	Tech	140
JPM	JPMorgan	Finance	180


**Note:** Both `format_table` and `print_table` have a safety limit of 100,000 rows to prevent memory issues.

---
## Summary (All Orientations)

### Table Orientations

| Orientation | Description | Use Case |
|-------------|-------------|----------|
| `row` | List of row lists | Small tables, iteration |
| `column` | List of column lists | Column operations |
| `arrow` | List of PyArrow arrays | Large data, fast compute |
| `u8m` | List of uint8 matrices | Numba-accelerated strings |

### Conversion Functions

| Function | Input | Output |
|----------|-------|--------|
| `table_to_columns` | row, column, arrow, u8m | column |
| `table_to_rows` | row, column, arrow | row |
| `table_to_arrow` | row, column, arrow | arrow |

### Function Orientation Support

| Function | row | column | arrow | u8m | Output |
|----------|-----|--------|-------|-----|--------|
| `table_column` | ✓ | ✓ | ✓ | ✓ | list/pa.Array/u8m |
| `table_select_columns` | ✓ | ✓ | ✓ | ✗ | preserves |
| `table_add_column` | ✓ | ✓ | ✓ | ✗ | preserves |
| `table_drop_columns` | ✓ | ✓ | ✓ | ✗ | preserves |
| `table_replace_value` | ✓ | ✓ | ✓ | ✗ | arrow |
| `table_bind_rows` | ✓ | ✓ | ✓ | ✗ | first table's |
| `table_unique_rows` | ✓ | ✓ | ✓ | ✗ | row |
| `table_head` | ✓ | ✓ | ✓ | ✗ | preserves |
| `table_sample` | ✓ | ✓ | ✓ | ✗ | preserves |
| `table_stack_cols` | ✓ | ✓ | ✓ | ✗ | first table's |
| `table_left_join` | ✓ | ✓ | ✓ | ✗ | arrow if any arrow |
| `table_inner_join` | ✓ | ✓ | ✓ | ✗ | arrow if any arrow |
| `table_unchop` | ✓ | ✗ | ✗ | ✗ | row |
| `table_chop` | ✓ | ✗ | ✗ | ✗ | row |
| `format_table` | ✓ | ✓ | ✓ | ✓ | string |
| `print_table` | ✓ | ✓ | ✓ | ✓ | stdout |
| `show_table` | ✓ | ✓ | ✓ | ✓ | DataFrame |

### u8m-Specific Functions

| Function | Description |
|----------|-------------|
| `u8m_from_matrix` | Split pipe-delimited uint8 matrix into u8m table |
| `u8m_column` | Extract column as uint8 matrix |

---
## 8. Arrow Tables (PyArrow-backed)

Arrow-oriented tables use PyArrow arrays for efficient storage and vectorized operations. This orientation is best for:
- Large datasets (millions of rows)
- Vectorized compute operations
- Fast joins using PyArrow's hash join

**Orientation Support:** All `table_*_arrow` functions accept any orientation and convert to arrow internally.

In [36]:
# Additional imports for Arrow support
import pyarrow as pa
from specparser.amt import (
    table_to_arrow,
    table_to_jsonable,
    table_orientation,
    table_nrows,
    table_validate,
)

### `table_to_arrow(table)`

Convert any table to arrow-oriented (PyArrow arrays).

In [37]:
# Convert row-oriented stocks table to arrow
stocks_arrow = table_to_arrow(stocks)

print(f"Orientation: {stocks_arrow['orientation']}")
print(f"Columns: {stocks_arrow['columns']}")
print(f"\nColumn types:")
for name, arr in zip(stocks_arrow['columns'], stocks_arrow['rows']):
    print(f"  {name}: {type(arr).__name__} ({arr.type})")

print(f"\nFirst column (ticker) values: {stocks_arrow['rows'][0].to_pylist()}")

Orientation: arrow
Columns: ['ticker', 'name', 'sector', 'price']

Column types:
  ticker: StringArray (string)
  name: StringArray (string)
  sector: StringArray (string)
  price: DoubleArray (double)

First column (ticker) values: ['AAPL', 'GOOGL', 'JPM', 'XOM', 'MSFT']


### Helper Functions for All Orientations

In [38]:
# table_orientation: get the orientation
print(f"Row table: {table_orientation(stocks)}")
print(f"Column table: {table_orientation(stocks_col)}")
print(f"Arrow table: {table_orientation(stocks_arrow)}")

Row table: row
Column table: column
Arrow table: arrow


In [39]:
# table_nrows: get row count for any orientation
print(f"Row table: {table_nrows(stocks)} rows")
print(f"Column table: {table_nrows(stocks_col)} rows")
print(f"Arrow table: {table_nrows(stocks_arrow)} rows")

Row table: 5 rows
Column table: 5 rows
Arrow table: 5 rows


In [40]:
# table_validate: check table structure
table_validate(stocks_arrow)
print("Arrow table is valid!")

# Validation catches errors
bad_arrow = {"orientation": "arrow", "columns": ["a"], "rows": [[1, 2, 3]]}  # not a pa.Array
try:
    table_validate(bad_arrow)
except ValueError as e:
    print(f"Validation error: {e}")

Arrow table is valid!
Validation error: Column 0 (a) is not a PyArrow Array, got list


### Arrow Operations Preserve Orientation

Most table operations preserve the arrow orientation, enabling chained operations without conversion overhead.

In [41]:
# table_head preserves arrow orientation
head_arrow = table_head(stocks_arrow, 3)
print(f"table_head result: {table_orientation(head_arrow)}")

# table_select_columns preserves arrow
selected_arrow = table_select_columns(stocks_arrow, ["ticker", "price"])
print(f"table_select_columns result: {table_orientation(selected_arrow)}")

# table_drop_columns preserves arrow
dropped_arrow = table_drop_columns(stocks_arrow, ["sector"])
print(f"table_drop_columns result: {table_orientation(dropped_arrow)}")

# table_add_column preserves arrow
added_arrow = table_add_column(stocks_arrow, "currency", "USD")
print(f"table_add_column result: {table_orientation(added_arrow)}")

# table_sample preserves arrow
sampled_arrow = table_sample(stocks_arrow, 2)
print(f"table_sample result: {table_orientation(sampled_arrow)}")

table_head result: arrow
table_select_columns result: arrow
table_drop_columns result: arrow
table_add_column result: arrow
table_sample result: arrow


### Zero-Copy Column Extraction

For arrow tables, `table_column` returns a PyArrow Array (zero-copy reference), not a Python list.

In [42]:
# From row/column table: returns Python list
prices_list = table_column(stocks, "price")
print(f"From row table: {type(prices_list).__name__} = {prices_list}")

# From arrow table: returns PyArrow Array (zero-copy!)
prices_arrow = table_column(stocks_arrow, "price")
print(f"From arrow table: {type(prices_arrow).__name__} ({prices_arrow.type})")
print(f"  Values: {prices_arrow.to_pylist()}")

From row table: list = [150.0, 140.0, 180.0, 100.0, 380.0]
From arrow table: DoubleArray (double)
  Values: [150.0, 140.0, 180.0, 100.0, 380.0]


### Vectorized Value Replacement

`table_replace_value` uses PyArrow compute kernels for vectorized replacement.

In [43]:
# Vectorized replace on arrow table
replaced_arrow = table_replace_value(stocks_arrow, "sector", "Tech", "Technology")
print(f"Result orientation: {table_orientation(replaced_arrow)}")
print(f"Sector values: {table_column(replaced_arrow, 'sector').to_pylist()}")

Result orientation: arrow
Sector values: ['Technology', 'Technology', 'Finance', 'Energy', 'Technology']


### Mixed Orientation Operations

When combining arrow with row/column tables, the result is arrow (to preserve performance).

In [44]:
# table_bind_rows: arrow + row -> arrow
bound = table_bind_rows(stocks_arrow, more_stocks)  # arrow + row
print(f"bind_rows(arrow, row) -> {table_orientation(bound)}, {table_nrows(bound)} rows")

# table_stack_cols: arrow + row -> arrow
metrics_arrow = table_to_arrow(metrics)
stacked = table_stack_cols(stocks_arrow, metrics)  # arrow + row
print(f"stack_cols(arrow, row) -> {table_orientation(stacked)}")

bind_rows(arrow, row) -> arrow, 7 rows
stack_cols(arrow, row) -> arrow


### Fast Joins with Arrow

Joins use PyArrow's optimized `pa.Table.join()` when either input is arrow.

In [45]:
# Arrow left join - uses PyArrow's optimized hash join
prices_arr = table_to_arrow(prices)
ratings_arr = table_to_arrow(ratings)

result = table_left_join(prices_arr, ratings_arr, "ticker")

print(f"Result: {table_orientation(result)}, {table_nrows(result)} rows")
print(f"Columns: {result['columns']}")
show_table(table_to_rows(result))

Result: arrow, 3 rows
Columns: ['ticker', 'price', 'rating', 'target']


Unnamed: 0,ticker,price,rating,target
0,AAPL,150.0,Hold,160.0
1,GOOGL,140.0,,
2,AMZN,180.0,,


### Converting to JSON

`table_to_jsonable` converts any table (including arrow) to JSON-serializable format, handling special types like datetime and Decimal.

In [46]:
from datetime import datetime
from decimal import Decimal

# Table with special types
special_table = {
    "orientation": "row",
    "columns": ["timestamp", "value", "amount"],
    "rows": [
        [datetime(2024, 1, 15, 10, 30), Decimal("123.456"), 100],
        [datetime(2024, 6, 1, 14, 0), Decimal("789.012"), 200],
    ]
}

# Convert to JSON-serializable (works from any orientation)
json_table = table_to_jsonable(special_table)
print("JSON-serializable output:")
for row in json_table["rows"]:
    print(f"  {row}")

JSON-serializable output:
  ['2024-01-15T10:30:00', 123.456, 100]
  ['2024-06-01T14:00:00', 789.012, 200]


### Roundtrip Conversions

Convert between orientations freely without losing data.

In [47]:
# row -> arrow -> row (roundtrip)
original = stocks
arrow_version = table_to_arrow(original)
back_to_row = table_to_rows(arrow_version)

assert back_to_row["rows"] == original["rows"]
print("row -> arrow -> row: Data preserved!")

# row -> arrow -> column (chain conversions)
column_from_arrow = table_to_columns(arrow_version)
print(f"row -> arrow -> column: orientation = {table_orientation(column_from_arrow)}")

row -> arrow -> row: Data preserved!
row -> arrow -> column: orientation = column


---
## Arrow Functions Summary

| Function | Description |
|----------|-------------|
| `table_to_arrow` | Convert any table to arrow-oriented |
| `table_orientation` | Get orientation ("row", "column", or "arrow") |
| `table_nrows` | Get row count (works for all orientations) |
| `table_validate` | Validate table structure |
| `table_to_jsonable` | Convert to JSON-serializable dict |

**Operations that preserve arrow orientation:**
- `table_head`, `table_sample`
- `table_select_columns`, `table_drop_columns`, `table_add_column`
- `table_replace_value`
- `table_bind_rows`, `table_stack_cols`
- `table_unique_rows`
- `table_left_join`, `table_inner_join`
- `table_unchop`, `table_chop`

**Zero-copy operations:**
- `table_column` returns `pa.Array` for arrow tables
- Column selection/dropping references existing arrays

---
## 9. Join Benchmarks

Compare performance of row-oriented vs arrow-oriented left joins.

**Scenario:** Join an items table (unique items with categories) to a values table 
(multiple values per item, like time-series data or transaction logs).

### Small Scale Demo

First, let's see the join behavior with small, visible tables.

In [48]:
# Items table: 3 unique items with categories
items_small = {
    "orientation": "row",
    "columns": ["item", "category"],
    "rows": [
        ["apple", "fruit"],
        ["carrot", "vegetable"],
        ["bread", "grain"],
    ]
}

# Values table: 3 values per item (like daily prices)
values_small = {
    "orientation": "row",
    "columns": ["item", "value"],
    "rows": [
        ["apple", 1.20],
        ["apple", 1.15],
        ["apple", 1.25],
        ["carrot", 0.80],
        ["carrot", 0.75],
        ["carrot", 0.85],
        ["bread", 2.50],
        ["bread", 2.40],
        ["bread", 2.60],
    ]
}

print("Items table (3 rows):")
display(show_table(items_small))

print("\nValues table (9 rows, 3 per item):")
display(show_table(values_small))

Items table (3 rows):


Unnamed: 0,item,category
0,apple,fruit
1,carrot,vegetable
2,bread,grain



Values table (9 rows, 3 per item):


Unnamed: 0,item,value
0,apple,1.2
1,apple,1.15
2,apple,1.25
3,carrot,0.8
4,carrot,0.75
5,carrot,0.85
6,bread,2.5
7,bread,2.4
8,bread,2.6


In [49]:
# Left join: each item gets all its values
# 3 items × 3 values each = 9 result rows
result_small = table_left_join(items_small, values_small, "item")

print(f"Result: {table_nrows(result_small)} rows (3 items × 3 values)")
show_table(result_small)

Result: 9 rows (3 items × 3 values)


Unnamed: 0,item,category,value
0,apple,fruit,1.2
1,apple,fruit,1.15
2,apple,fruit,1.25
3,carrot,vegetable,0.8
4,carrot,vegetable,0.75
5,carrot,vegetable,0.85
6,bread,grain,2.5
7,bread,grain,2.4
8,bread,grain,2.6


### Large Scale Benchmark

Now let's generate larger tables and compare row vs arrow join performance.

In [50]:
import time

# Configuration
n_items = 1000
n_values_per_item = 100

# Items table: 1000 unique items with categories
items_large = {
    "orientation": "row",
    "columns": ["item", "category"],
    "rows": [
        [f"item_{i}", f"cat_{i % 10}"] 
        for i in range(n_items)
    ]
}

# Values table: 100,000 rows (100 values per item)
values_large = {
    "orientation": "row",
    "columns": ["item", "value"],
    "rows": [
        [f"item_{i}", float(j)]
        for i in range(n_items)
        for j in range(n_values_per_item)
    ]
}

print(f"Items table: {table_nrows(items_large):,} rows")
print(f"Values table: {table_nrows(values_large):,} rows")
print(f"Expected result: {n_items * n_values_per_item:,} rows")

Items table: 1,000 rows
Values table: 100,000 rows
Expected result: 100,000 rows


In [51]:
# Benchmark 1: Row-oriented join (Python dict-based)
t0 = time.perf_counter()
result_row = table_left_join(items_large, values_large, "item")
t_row = time.perf_counter() - t0

print(f"Row join: {t_row:.3f}s")
print(f"  Result: {table_nrows(result_row):,} rows")
print(f"  Orientation: {table_orientation(result_row)}")

Row join: 0.030s
  Result: 100,000 rows
  Orientation: row


In [52]:
# Benchmark 2: Arrow-oriented join (PyArrow hash join)
# First convert to arrow (one-time cost)
t0 = time.perf_counter()
items_arrow_large = table_to_arrow(items_large)
values_arrow_large = table_to_arrow(values_large)
t_convert = time.perf_counter() - t0

print(f"Conversion to arrow: {t_convert:.3f}s")

# Now the join
t0 = time.perf_counter()
result_arrow = table_left_join(items_arrow_large, values_arrow_large, "item")
t_arrow = time.perf_counter() - t0

print(f"Arrow join: {t_arrow:.3f}s")
print(f"  Result: {table_nrows(result_arrow):,} rows")
print(f"  Orientation: {table_orientation(result_arrow)}")

Conversion to arrow: 0.013s
Arrow join: 0.003s
  Result: 100,000 rows
  Orientation: arrow


In [53]:
# Summary comparison
print("=" * 50)
print("BENCHMARK SUMMARY")
print("=" * 50)
print(f"Data size: {n_items:,} items × {n_values_per_item} values")
print(f"Result size: {n_items * n_values_per_item:,} rows")
print()
print(f"Row join:   {t_row:.3f}s")
print(f"Arrow join: {t_arrow:.3f}s (+ {t_convert:.3f}s conversion)")
print()
if t_arrow > 0:
    speedup = t_row / t_arrow
    print(f"Speedup: {speedup:.1f}x faster with Arrow")
    print()
    print("Note: If data is already in arrow format,")
    print("      the conversion cost is amortized over multiple operations.")

BENCHMARK SUMMARY
Data size: 1,000 items × 100 values
Result size: 100,000 rows

Row join:   0.030s
Arrow join: 0.003s (+ 0.013s conversion)

Speedup: 8.8x faster with Arrow

Note: If data is already in arrow format,
      the conversion cost is amortized over multiple operations.


---
## 10. Arrow Compute Functions

The `table_*_arrow` functions expose PyArrow's vectorized compute operations. They:
- Accept any table orientation (row, column, or arrow)
- Convert to arrow internally if needed
- Return arrow-oriented tables
- Provide fast, vectorized operations without Python loops

**Categories:**
- **Arithmetic:** `add`, `subtract`, `multiply`, `divide`, `negate`, `abs`, `sign`, `power`, `sqrt`, `exp`, `ln`, `log10`, `log2`, `round`, `ceil`, `floor`, `trunc`
- **Trigonometric:** `sin`, `cos`, `tan`, `asin`, `acos`, `atan`, `atan2`
- **Comparison:** `equal`, `not_equal`, `less`, `less_equal`, `greater`, `greater_equal`
- **Null checks:** `is_null`, `is_valid`, `is_nan`, `is_finite`, `is_in`
- **Logical:** `and`, `or`, `xor`, `invert`
- **String:** `upper`, `lower`, `capitalize`, `title`, `strip`, `lstrip`, `rstrip`, `length`, `starts_with`, `ends_with`, `contains`, `replace_substr`, `split`
- **Aggregates:** `summarize`, `sum`, `mean`, `min`, `max`, `count`, `count_distinct`, `stddev`, `variance`, `first`, `last`, `any`, `all`
- **Cumulative:** `cumsum`, `cumprod`, `cummin`, `cummax`, `cummean`, `diff`
- **Selection:** `if_else`, `coalesce`, `fill_null`, `fill_null_forward`, `fill_null_backward`
- **Filter:** `filter`

In [54]:
# Import arrow compute functions
from specparser.amt import (
    # arithmetic
    table_add_arrow, table_subtract_arrow, table_multiply_arrow, 
    table_divide_arrow, table_abs_arrow, table_sqrt_arrow,
    table_round_arrow, table_cumsum_arrow,
    # comparison & filter
    table_greater_arrow, table_filter_arrow,
    # string
    table_upper_arrow, table_lower_arrow, table_length_arrow,
    # aggregates
    table_summarize_arrow, table_sum_arrow, table_mean_arrow,
    table_min_arrow, table_max_arrow,
    # selection
    table_fill_null_arrow,
)

### Arithmetic Operations

Column-wise arithmetic with scalars or other columns.

In [55]:
# Create a sales table
sales = {
    "orientation": "row",
    "columns": ["product", "price", "qty"],
    "rows": [
        ["Widget A", 10.50, 5],
        ["Widget B", 25.00, 3],
        ["Gadget X", 15.75, 8],
        ["Gadget Y", 8.99, 12],
    ]
}
show_table(sales)

Unnamed: 0,product,price,qty
0,Widget A,10.5,5
1,Widget B,25.0,3
2,Gadget X,15.75,8
3,Gadget Y,8.99,12


In [56]:
# Multiply price × qty to get total (result_col adds new column)
with_total = table_multiply_arrow(sales, "price", "qty", result_col="total")
show_table(with_total)

Unnamed: 0,product,price,qty,total
0,Widget A,10.5,5,52.5
1,Widget B,25.0,3,75.0
2,Gadget X,15.75,8,126.0
3,Gadget Y,8.99,12,107.88


In [57]:
# Add 10% tax to price (using scalar)
with_tax = table_multiply_arrow(with_total, "total", 1.10, result_col="with_tax")
with_tax = table_round_arrow(with_tax, "with_tax", decimals=2)
show_table(with_tax)

Unnamed: 0,product,price,qty,total,with_tax
0,Widget A,10.5,5,52.5,57.75
1,Widget B,25.0,3,75.0,82.5
2,Gadget X,15.75,8,126.0,138.6
3,Gadget Y,8.99,12,107.88,118.67


In [58]:
# Cumulative sum of totals
with_running = table_cumsum_arrow(with_total, "total", result_col="running_total")
show_table(with_running)

Unnamed: 0,product,price,qty,total,running_total
0,Widget A,10.5,5,52.5,52.5
1,Widget B,25.0,3,75.0,127.5
2,Gadget X,15.75,8,126.0,253.5
3,Gadget Y,8.99,12,107.88,361.38


### String Operations

Transform and analyze string columns.

In [59]:
# Uppercase product names
upper_products = table_upper_arrow(sales, "product")
print("Uppercased products:")
print(table_column(upper_products, "product").to_pylist())

# String length
with_len = table_length_arrow(sales, "product", result_col="name_len")
show_table(with_len)

Uppercased products:
['WIDGET A', 'WIDGET B', 'GADGET X', 'GADGET Y']


Unnamed: 0,product,price,qty,name_len
0,Widget A,10.5,5,8
1,Widget B,25.0,3,8
2,Gadget X,15.75,8,8
3,Gadget Y,8.99,12,8


### Filtering

Filter rows using comparison + filter pattern.

In [60]:
# Filter: keep only high-value sales (total > 50)
# Step 1: Add boolean mask column
with_mask = table_greater_arrow(with_total, "total", 50, result_col="_mask")

# Step 2: Filter by mask
filtered = table_filter_arrow(with_mask, "_mask")

# Step 3: Drop the temporary mask column
high_value = table_drop_columns(filtered, ["_mask"])

print("Sales with total > 50:")
show_table(high_value)

Sales with total > 50:


Unnamed: 0,product,price,qty,total
0,Widget A,10.5,5,52.5
1,Widget B,25.0,3,75.0
2,Gadget X,15.75,8,126.0
3,Gadget Y,8.99,12,107.88


### Aggregates

Compute summary statistics across columns.

In [61]:
# table_summarize_arrow: compute multiple aggregates at once
summary = table_summarize_arrow(with_total, {"total": ["sum", "mean", "min", "max"]})
print("Sales total summary:")
show_table(summary)

Sales total summary:


Unnamed: 0,total_sum,total_mean,total_min,total_max
0,361.38,90.345,52.5,126.0


In [62]:
# Individual aggregate functions return scalars
total_sum = table_sum_arrow(with_total, "total")
total_mean = table_mean_arrow(with_total, "total")
total_min = table_min_arrow(with_total, "total")
total_max = table_max_arrow(with_total, "total")

print(f"Sum:  {total_sum}")
print(f"Mean: {total_mean:.2f}")
print(f"Min:  {total_min}")
print(f"Max:  {total_max}")

Sum:  361.38
Mean: 90.34
Min:  52.5
Max:  126.0


### Null Handling

Fill missing values using selection functions.

In [63]:
# Table with missing values
with_nulls = {
    "orientation": "row",
    "columns": ["item", "value"],
    "rows": [
        ["A", 10],
        ["B", None],
        ["C", 30],
        ["D", None],
    ]
}

# Fill nulls with a default value
filled = table_fill_null_arrow(with_nulls, "value", 0)
print("Original vs filled:")
print(f"Original: {table_column(with_nulls, 'value')}")
print(f"Filled:   {table_column(filled, 'value').to_pylist()}")

Original vs filled:
Original: [10, None, 30, None]
Filled:   [10, 0, 30, 0]


### Arrow Compute Benchmarks

Compare performance of Python loops vs Arrow compute functions on large tables.

In [64]:
# Generate a large table for benchmarking
import random

n_rows = 1_000_000
benchmark_data = {
    "orientation": "row",
    "columns": ["id", "value", "text"],
    "rows": [
        [i, random.uniform(1, 100), f"item_{i}"]
        for i in range(n_rows)
    ]
}

print(f"Benchmark table: {n_rows:,} rows")

Benchmark table: 1,000,000 rows


In [65]:
# Benchmark harness - stores results and displays summary table
import time

benchmark_results = []

def run_benchmark(name, python_fn, arrow_fn, iterations=1):
    """Run a benchmark and store results."""
    # Python timing
    t0 = time.perf_counter()
    for _ in range(iterations):
        python_fn()
    t_python = (time.perf_counter() - t0) / iterations
    
    # Arrow timing
    t0 = time.perf_counter()
    for _ in range(iterations):
        arrow_fn()
    t_arrow = (time.perf_counter() - t0) / iterations
    
    benchmark_results.append((name, t_python, t_arrow))

def show_benchmark_results():
    """Display benchmark results as a formatted table."""
    print("=" * 70)
    print("ARROW COMPUTE BENCHMARK RESULTS")
    print("=" * 70)
    print(f"Data size: {n_rows:,} rows")
    print()
    print(f"{'Operation':<30} {'Python':<12} {'Arrow':<12} {'Speedup':<10}")
    print("-" * 70)
    for name, t_python, t_arrow in benchmark_results:
        speedup = t_python / t_arrow if t_arrow > 0 else float('inf')
        print(f"{name:<30} {t_python:.3f}s{'':<6} {t_arrow:.3f}s{'':<6} {speedup:.1f}x")
    print("-" * 70)

In [66]:
# Additional imports for comprehensive benchmarks
from specparser.amt import (
    table_power_arrow, table_add_arrow, table_lower_arrow,
    table_strip_arrow, table_starts_with_arrow, table_contains_arrow,
    table_replace_substr_arrow, table_split_arrow,
    table_count_distinct_arrow, table_cummax_arrow, table_diff_arrow,
    table_if_else_arrow, table_less_arrow,
)

# === ARITHMETIC BENCHMARKS ===

# Benchmark: Multiply (scalar)
def python_multiply(table, col, scalar):
    col_idx = table["columns"].index(col)
    return {"orientation": "row", "columns": table["columns"],
            "rows": [[v * scalar if i == col_idx else v 
                     for i, v in enumerate(row)] for row in table["rows"]]}

run_benchmark("multiply (scalar)",
              lambda: python_multiply(benchmark_data, "value", 2.5),
              lambda: table_multiply_arrow(benchmark_data, "value", 2.5))

# Benchmark: Add columns
def python_add_cols(table, col1, col2, result_col):
    idx1 = table["columns"].index(col1)
    idx2 = table["columns"].index(col2)
    return {"orientation": "row", "columns": table["columns"] + [result_col],
            "rows": [row + [row[idx1] + row[idx2]] for row in table["rows"]]}

run_benchmark("add (columns)",
              lambda: python_add_cols(benchmark_data, "id", "value", "sum"),
              lambda: table_add_arrow(benchmark_data, "id", "value", result_col="sum"))

# Benchmark: Power (use float to avoid column index confusion)
def python_power(table, col, exponent):
    col_idx = table["columns"].index(col)
    return {"orientation": "row", "columns": table["columns"],
            "rows": [[v ** exponent if i == col_idx else v 
                     for i, v in enumerate(row)] for row in table["rows"]]}

run_benchmark("power",
              lambda: python_power(benchmark_data, "value", 2.0),
              lambda: table_power_arrow(benchmark_data, "value", 2.0))

# Benchmark: Round
def python_round(table, col, decimals):
    col_idx = table["columns"].index(col)
    return {"orientation": "row", "columns": table["columns"],
            "rows": [[round(v, decimals) if i == col_idx else v 
                     for i, v in enumerate(row)] for row in table["rows"]]}

run_benchmark("round",
              lambda: python_round(benchmark_data, "value", 2),
              lambda: table_round_arrow(benchmark_data, "value", decimals=2))

print("Arithmetic benchmarks complete.")

Arithmetic benchmarks complete.


In [67]:
# === STRING BENCHMARKS ===

# Benchmark: Uppercase
def python_upper(table, col):
    col_idx = table["columns"].index(col)
    return {"orientation": "row", "columns": table["columns"],
            "rows": [[v.upper() if i == col_idx else v 
                     for i, v in enumerate(row)] for row in table["rows"]]}

run_benchmark("upper",
              lambda: python_upper(benchmark_data, "text"),
              lambda: table_upper_arrow(benchmark_data, "text"))

# Benchmark: Lowercase
def python_lower(table, col):
    col_idx = table["columns"].index(col)
    return {"orientation": "row", "columns": table["columns"],
            "rows": [[v.lower() if i == col_idx else v 
                     for i, v in enumerate(row)] for row in table["rows"]]}

run_benchmark("lower",
              lambda: python_lower(benchmark_data, "text"),
              lambda: table_lower_arrow(benchmark_data, "text"))

# Benchmark: Length
def python_length(table, col, result_col):
    col_idx = table["columns"].index(col)
    return {"orientation": "row", "columns": table["columns"] + [result_col],
            "rows": [row + [len(row[col_idx])] for row in table["rows"]]}

run_benchmark("length",
              lambda: python_length(benchmark_data, "text", "len"),
              lambda: table_length_arrow(benchmark_data, "text", result_col="len"))

# Benchmark: Strip (using text with added whitespace)
def python_strip(table, col):
    col_idx = table["columns"].index(col)
    return {"orientation": "row", "columns": table["columns"],
            "rows": [[v.strip() if i == col_idx else v 
                     for i, v in enumerate(row)] for row in table["rows"]]}

run_benchmark("strip",
              lambda: python_strip(benchmark_data, "text"),
              lambda: table_strip_arrow(benchmark_data, "text"))

# Benchmark: Starts with
def python_starts_with(table, col, pattern, result_col):
    col_idx = table["columns"].index(col)
    return {"orientation": "row", "columns": table["columns"] + [result_col],
            "rows": [row + [row[col_idx].startswith(pattern)] 
                     for row in table["rows"]]}

run_benchmark("starts_with",
              lambda: python_starts_with(benchmark_data, "text", "item_1", "sw"),
              lambda: table_starts_with_arrow(benchmark_data, "text", "item_1", 
                                              result_col="sw"))

# Benchmark: Contains
def python_contains(table, col, pattern, result_col):
    col_idx = table["columns"].index(col)
    return {"orientation": "row", "columns": table["columns"] + [result_col],
            "rows": [row + [pattern in row[col_idx]] for row in table["rows"]]}

run_benchmark("contains",
              lambda: python_contains(benchmark_data, "text", "999", "has"),
              lambda: table_contains_arrow(benchmark_data, "text", "999", 
                                           result_col="has"))

# Benchmark: Replace substring
def python_replace(table, col, old, new):
    col_idx = table["columns"].index(col)
    return {"orientation": "row", "columns": table["columns"],
            "rows": [[v.replace(old, new) if i == col_idx else v 
                     for i, v in enumerate(row)] for row in table["rows"]]}

run_benchmark("replace",
              lambda: python_replace(benchmark_data, "text", "item", "product"),
              lambda: table_replace_substr_arrow(benchmark_data, "text", 
                                                 "item", "product"))

# Benchmark: Split
def python_split(table, col, sep, result_col):
    col_idx = table["columns"].index(col)
    return {"orientation": "row", "columns": table["columns"] + [result_col],
            "rows": [row + [row[col_idx].split(sep)] for row in table["rows"]]}

run_benchmark("split",
              lambda: python_split(benchmark_data, "text", "_", "parts"),
              lambda: table_split_arrow(benchmark_data, "text", "_", 
                                        result_col="parts"))

print("String benchmarks complete.")

String benchmarks complete.


In [68]:
# === AGGREGATION BENCHMARKS ===

# Benchmark: Sum
def python_sum(table, col):
    col_idx = table["columns"].index(col)
    return sum(row[col_idx] for row in table["rows"])

run_benchmark("sum",
              lambda: python_sum(benchmark_data, "value"),
              lambda: table_sum_arrow(benchmark_data, "value"))

# Benchmark: Mean
def python_mean(table, col):
    col_idx = table["columns"].index(col)
    values = [row[col_idx] for row in table["rows"]]
    return sum(values) / len(values)

run_benchmark("mean",
              lambda: python_mean(benchmark_data, "value"),
              lambda: table_mean_arrow(benchmark_data, "value"))

# Benchmark: Min/Max
def python_min_max(table, col):
    col_idx = table["columns"].index(col)
    values = [row[col_idx] for row in table["rows"]]
    return (min(values), max(values))

def arrow_min_max(table, col):
    return (table_min_arrow(table, col), table_max_arrow(table, col))

run_benchmark("min_max",
              lambda: python_min_max(benchmark_data, "value"),
              lambda: arrow_min_max(benchmark_data, "value"))

# Benchmark: Count distinct
def python_count_distinct(table, col):
    col_idx = table["columns"].index(col)
    return len(set(row[col_idx] for row in table["rows"]))

run_benchmark("count_distinct",
              lambda: python_count_distinct(benchmark_data, "text"),
              lambda: table_count_distinct_arrow(benchmark_data, "text"))

print("Aggregation benchmarks complete.")

Aggregation benchmarks complete.


In [69]:
# === CUMULATIVE BENCHMARKS ===

# Benchmark: Cumulative sum
def python_cumsum(table, col, result_col):
    col_idx = table["columns"].index(col)
    total = 0
    new_rows = []
    for row in table["rows"]:
        total += row[col_idx]
        new_rows.append(list(row) + [total])
    return {"orientation": "row", "columns": table["columns"] + [result_col], 
            "rows": new_rows}

run_benchmark("cumsum",
              lambda: python_cumsum(benchmark_data, "value", "cs"),
              lambda: table_cumsum_arrow(benchmark_data, "value", result_col="cs"))

# Benchmark: Cumulative max
def python_cummax(table, col, result_col):
    col_idx = table["columns"].index(col)
    running_max = float('-inf')
    new_rows = []
    for row in table["rows"]:
        running_max = max(running_max, row[col_idx])
        new_rows.append(list(row) + [running_max])
    return {"orientation": "row", "columns": table["columns"] + [result_col], 
            "rows": new_rows}

run_benchmark("cummax",
              lambda: python_cummax(benchmark_data, "value", "cm"),
              lambda: table_cummax_arrow(benchmark_data, "value", result_col="cm"))

# Benchmark: Diff (pairwise differences)
def python_diff(table, col, result_col):
    col_idx = table["columns"].index(col)
    new_rows = []
    prev = None
    for row in table["rows"]:
        val = row[col_idx]
        diff = None if prev is None else val - prev
        new_rows.append(list(row) + [diff])
        prev = val
    return {"orientation": "row", "columns": table["columns"] + [result_col], 
            "rows": new_rows}

run_benchmark("diff",
              lambda: python_diff(benchmark_data, "value", "d"),
              lambda: table_diff_arrow(benchmark_data, "value", result_col="d"))

print("Cumulative benchmarks complete.")

Cumulative benchmarks complete.


In [70]:
# === FILTERING & SELECTION BENCHMARKS ===

# Benchmark: Filter by condition
def python_filter(table, col, threshold):
    col_idx = table["columns"].index(col)
    return {"orientation": "row", "columns": table["columns"],
            "rows": [row for row in table["rows"] if row[col_idx] > threshold]}

def arrow_filter(table, col, threshold):
    with_mask = table_greater_arrow(table, col, threshold, result_col="_mask")
    filtered = table_filter_arrow(with_mask, "_mask")
    return table_drop_columns(filtered, ["_mask"])

run_benchmark("filter",
              lambda: python_filter(benchmark_data, "value", 50),
              lambda: arrow_filter(benchmark_data, "value", 50))

# Benchmark: Fill null (create data with nulls first)
benchmark_with_nulls = {
    "orientation": "row",
    "columns": ["id", "value", "text"],
    "rows": [[i, None if i % 10 == 0 else random.uniform(1, 100), f"item_{i}"]
             for i in range(n_rows)]
}

def python_fill_null(table, col, fill_value):
    col_idx = table["columns"].index(col)
    return {"orientation": "row", "columns": table["columns"],
            "rows": [[fill_value if i == col_idx and v is None else v 
                     for i, v in enumerate(row)] for row in table["rows"]]}

run_benchmark("fill_null",
              lambda: python_fill_null(benchmark_with_nulls, "value", 0),
              lambda: table_fill_null_arrow(benchmark_with_nulls, "value", 0))

# Benchmark: If-else (conditional selection)
def python_if_else(table, cond_col, threshold, true_val, false_val, result_col):
    col_idx = table["columns"].index(cond_col)
    return {"orientation": "row", "columns": table["columns"] + [result_col],
            "rows": [row + [true_val if row[col_idx] > threshold else false_val] 
                     for row in table["rows"]]}

def arrow_if_else(table, col, threshold, true_val, false_val, result_col):
    with_mask = table_greater_arrow(table, col, threshold, result_col="_cond")
    result = table_if_else_arrow(with_mask, "_cond", true_val, false_val, 
                                  result_col=result_col)
    return table_drop_columns(result, ["_cond"])

run_benchmark("if_else",
              lambda: python_if_else(benchmark_data, "value", 50, "high", "low", "cat"),
              lambda: arrow_if_else(benchmark_data, "value", 50, "high", "low", "cat"))

print("Filtering & selection benchmarks complete.")

Filtering & selection benchmarks complete.


In [71]:
# === BENCHMARK RESULTS ===
show_benchmark_results()

ARROW COMPUTE BENCHMARK RESULTS
Data size: 1,000,000 rows

Operation                      Python       Arrow        Speedup   
----------------------------------------------------------------------
multiply (scalar)              0.443s       0.182s       2.4x
add (columns)                  0.422s       0.177s       2.4x
power                          0.446s       0.182s       2.4x
round                          0.477s       0.183s       2.6x
upper                          0.527s       0.194s       2.7x
lower                          0.445s       0.188s       2.4x
length                         0.201s       0.174s       1.2x
strip                          0.480s       0.184s       2.6x
starts_with                    0.309s       0.184s       1.7x
contains                       0.301s       0.191s       1.6x
replace                        0.448s       0.206s       2.2x
split                          0.740s       0.200s       3.7x
sum                            0.024s       0.182s       0

### Arrow Compute Functions Summary

| Category | Functions |
|----------|-----------|
| **Arithmetic** | `add`, `subtract`, `multiply`, `divide`, `negate`, `abs`, `sign`, `power`, `sqrt`, `exp`, `ln`, `log10`, `log2`, `round`, `ceil`, `floor`, `trunc` |
| **Trigonometric** | `sin`, `cos`, `tan`, `asin`, `acos`, `atan`, `atan2` |
| **Comparison** | `equal`, `not_equal`, `less`, `less_equal`, `greater`, `greater_equal` |
| **Null checks** | `is_null`, `is_valid`, `is_nan`, `is_finite`, `is_in` |
| **Logical** | `and`, `or`, `xor`, `invert` |
| **String** | `upper`, `lower`, `capitalize`, `title`, `strip`, `lstrip`, `rstrip`, `length`, `starts_with`, `ends_with`, `contains`, `replace_substr`, `split` |
| **Aggregates** | `summarize`, `sum`, `mean`, `min`, `max`, `count`, `count_distinct`, `stddev`, `variance`, `first`, `last`, `any`, `all` |
| **Cumulative** | `cumsum`, `cumprod`, `cummin`, `cummax`, `cummean`, `diff` |
| **Selection** | `if_else`, `coalesce`, `fill_null`, `fill_null_forward`, `fill_null_backward` |
| **Filter** | `filter` |

All functions follow the naming pattern `table_{operation}_arrow` and:
- Accept any table orientation (row, column, or arrow)
- Convert to arrow internally if needed
- Return arrow-oriented tables for chaining
- Support `result_col` parameter to add result as new column (where applicable)

---

## 11. u8m Tables (uint8 Matrix Columns)

u8m-oriented tables represent data as uint8 matrices - fixed-width byte arrays that enable high-performance operations with Numba. This is particularly useful for string data like dates, tickers, and codes that have predictable formats.

### Key Concepts

- **uint8 matrix**: A 2D numpy array of bytes (dtype=uint8) where each row is a fixed-width string
- **Pipe-delimited format**: Input matrices use `|field0|field1|field2|` format
- **`field()` extraction**: The `field(mat, n)` function extracts the nth field (1-indexed) as a slice
- **Zero-copy views**: Field extraction returns views, not copies, for efficiency

In [72]:
# Imports for u8m tables
from specparser.amt import (
    strs2u8mat,      # Convert Python strings to uint8 matrix
    u8m2s,           # Convert uint8 matrix to Python strings
    u8m_from_matrix, # Split pipe-delimited matrix into u8m table
    u8m_column,      # Extract single column from u8m table
    table_to_columns,
    table_nrows,
    table_orientation,
    table_validate,
    table_column,
    field,
)

### Creating u8m Tables

u8m tables are created from pipe-delimited uint8 matrices using `u8m_from_matrix()`.

In [73]:
# Create a pipe-delimited uint8 matrix
# Format: |field0|field1|field2|...
mat = strs2u8mat([
    "|CL|2024-01|2024-03|",
    "|GC|2024-02|2024-04|",
    "|NG|2024-03|2024-06|",
])

print("Raw matrix shape:", mat.shape)
print("Raw matrix dtype:", mat.dtype)
print()

# Show the raw bytes as strings
print("Raw data:")
for row in u8m2s(mat):
    print(f"  {row!r}")

Raw matrix shape: (3, 20)
Raw matrix dtype: uint8

Raw data:
  '|CL|2024-01|2024-03|'
  '|GC|2024-02|2024-04|'
  '|NG|2024-03|2024-06|'


In [None]:
# Split into a u8m table
# Each column name corresponds to one field (in order, 1-indexed)
tbl = u8m_from_matrix(mat, ["asset", "entry", "expiry"])

print(f"Orientation: {table_orientation(tbl)}")
print(f"Columns: {tbl['columns']}")
print(f"Number of rows: {table_nrows(tbl)}")
print(f"Rows type: {type(tbl["rows"][0].dtype)}")
print(f"Row  value: {u8m2s(tbl["rows"][0][0:1,:])}")
print(f"Rows shape: {tbl["rows"][0].shape}")
print()
print(f"column 0: {u8m2s(tbl["rows"][0]).tolist()}")
print()
      
# Each column is a uint8 matrix with different widths
for i, (name, col) in enumerate(zip(tbl["columns"], tbl["rows"])):
    print(f"{name}: shape={col.shape}, dtype={col.dtype}")

show_table(tbl)

Orientation: u8m
Columns: ['asset', 'entry', 'expiry']
Number of rows: 3
Rows type: <class 'numpy.dtypes.UInt8DType'>
Row  value: ['CL']
Rows shape: (3, 2)

column 0: ['CL', 'GC', 'NG']

asset: shape=(3, 2), dtype=uint8
entry: shape=(3, 7), dtype=uint8
expiry: shape=(3, 7), dtype=uint8


### Accessing Columns

Use `u8m_column()` to extract a single column as a uint8 matrix. Use `u8m2s()` to convert to Python strings for display.

In [75]:
# Extract asset column as uint8 matrix
asset_u8m = u8m_column(tbl, "asset")
print(f"asset column shape: {asset_u8m.shape}")
print(f"asset column dtype: {asset_u8m.dtype}")
print()

# Convert to Python strings for display
print("Assets:", u8m2s(asset_u8m).tolist())
print()

# Extract entry dates
entry_u8m = u8m_column(tbl, "entry")
print("Entry dates:", u8m2s(entry_u8m).tolist())

asset column shape: (3, 2)
asset column dtype: uint8

Assets: ['CL', 'GC', 'NG']

Entry dates: ['2024-01', '2024-02', '2024-03']


In [76]:
# table_column() also works and returns uint8 matrix
expiry_u8m = table_column(tbl, "expiry")
print(f"Expiry via table_column: {u8m2s(expiry_u8m).tolist()}")

Expiry via table_column: ['2024-03', '2024-04', '2024-06']


### How `field()` Works

The `field()` function extracts fields from pipe-delimited matrices. Fields are 1-indexed (field 1 is between the 1st and 2nd pipe).

In [77]:
# Direct field extraction (what u8m_from_matrix does internally)
print("Field positions in: |CL|2024-01|2024-03|")
print("  field(mat, 1) = positions 1-2   -> CL")
print("  field(mat, 2) = positions 4-10  -> 2024-01")
print("  field(mat, 3) = positions 12-18 -> 2024-03")
print()

# Extract fields directly
f1 = field(mat, 1)
f2 = field(mat, 2)
f3 = field(mat, 3)

print(f"Field 1: {u8m2s(f1).tolist()} (width={f1.shape[1]})")
print(f"Field 2: {u8m2s(f2).tolist()} (width={f2.shape[1]})")
print(f"Field 3: {u8m2s(f3).tolist()} (width={f3.shape[1]})")

Field positions in: |CL|2024-01|2024-03|
  field(mat, 1) = positions 1-2   -> CL
  field(mat, 2) = positions 4-10  -> 2024-01
  field(mat, 3) = positions 12-18 -> 2024-03

Field 1: ['CL', 'GC', 'NG'] (width=2)
Field 2: ['2024-01', '2024-02', '2024-03'] (width=7)
Field 3: ['2024-03', '2024-04', '2024-06'] (width=7)


### Converting u8m to Other Formats

u8m tables can be converted to column-oriented tables with Python strings using `table_to_columns()`.

In [78]:
# Convert u8m table to column-oriented with strings
str_tbl = table_to_columns(tbl)

print(f"Orientation: {table_orientation(str_tbl)}")
print(f"Columns: {str_tbl['columns']}")
print()

# Now rows contains Python string lists
for name, values in zip(str_tbl["columns"], str_tbl["rows"]):
    print(f"{name}: {values}")

Orientation: column
Columns: ['asset', 'entry', 'expiry']

asset: ['CL', 'GC', 'NG']
entry: ['2024-01', '2024-02', '2024-03']
expiry: ['2024-03', '2024-04', '2024-06']


### Validation

u8m tables are validated like other table types. Key checks:
- All columns must have the same number of rows
- All columns must have dtype uint8
- Number of columns must match number of column names

In [79]:
# Validate our u8m table
table_validate(tbl)
print("u8m table is valid!")

u8m table is valid!


In [80]:
# Error case: wrong dtype
import numpy as np

bad_tbl = {
    "orientation": "u8m",
    "columns": ["col1"],
    "rows": [np.array([[1, 2], [3, 4]], dtype=np.int32)]  # Wrong dtype!
}

try:
    table_validate(bad_tbl)
except ValueError as e:
    print(f"Validation error: {e}")

Validation error: Column 0 (col1) has dtype int32, expected uint8


### Practical Example: Futures Contract Specs

u8m tables are particularly useful for financial data with fixed-width string formats.

In [81]:
# Futures straddle specs: |asset|entry_ym|expiry_ym|type|
specs = strs2u8mat([
    "|CL Comdty      |2024-01|2024-03|N|",
    "|CL Comdty      |2024-01|2024-03|F|",
    "|GC Comdty      |2024-02|2024-05|N|",
    "|GC Comdty      |2024-02|2024-05|F|",
    "|NG Comdty      |2024-03|2024-06|N|",
])

# Split into table
specs_tbl = u8m_from_matrix(specs, ["asset", "entry", "expiry", "type"])

print("Futures specs table:")
print(f"  Rows: {table_nrows(specs_tbl)}")
print(f"  Columns: {specs_tbl['columns']}")
print()

# Show column details
for name in specs_tbl["columns"]:
    col = u8m_column(specs_tbl, name)
    print(f"{name:8} width={col.shape[1]:2}  values={u8m2s(col).tolist()}")

Futures specs table:
  Rows: 5
  Columns: ['asset', 'entry', 'expiry', 'type']

asset    width=15  values=['CL Comdty      ', 'CL Comdty      ', 'GC Comdty      ', 'GC Comdty      ', 'NG Comdty      ']
entry    width= 7  values=['2024-01', '2024-01', '2024-02', '2024-02', '2024-03']
expiry   width= 7  values=['2024-03', '2024-03', '2024-05', '2024-05', '2024-06']
type     width= 1  values=['N', 'F', 'N', 'F', 'N']


### u8m Functions Summary

**u8m-specific functions:**

| Function | Description | Orientation |
|----------|-------------|-------------|
| `u8m_from_matrix(mat, columns)` | Split pipe-delimited uint8 matrix into u8m table | → u8m |
| `u8m_column(table, colname)` | Extract column as uint8 matrix | u8m only |
| `table_u8m2arrow(table)` | Convert u8m table to arrow (for joins) | u8m → arrow |

**Functions that support u8m:**

| Function | u8m Support | Notes |
|----------|-------------|-------|
| `table_column` | ✓ | Returns uint8 matrix |
| `table_to_columns` | ✓ | Converts to Python strings (stripped) |
| `table_u8m2arrow` | ✓ | Converts to PyArrow string arrays (stripped) |
| `table_nrows` | ✓ | Returns row count |
| `table_orientation` | ✓ | Returns "u8m" |
| `table_validate` | ✓ | Checks dtype=uint8 |
| `show_table` | ✓ | Converts via column |
| `format_table` | ✓ | Converts via column |
| `print_table` | ✓ | Converts via column |

**Related string functions (from `strings.py`):**

| Function | Description |
|----------|-------------|
| `strs2u8mat(strings)` | Convert Python strings to uint8 matrix |
| `u8m2s(mat)` | Convert uint8 matrix to numpy string array |
| `field(mat, n)` | Extract nth field (1-indexed) from pipe-delimited matrix |

**Note:** u8m tables support one-way conversion only (u8m → column/arrow). There is no conversion *to* u8m from other orientations since the pipe-delimited format is specific to the input data. Use `table_u8m2arrow()` when you need to join u8m tables with other tables.