# ArtcicDB Experimental Arrow Reading Demo

### This notebook demonstrates the first cut of reading ArcticDB data directly into pyarrow tables. This is still an experimental feature, with both the API and behaviours subject to change in minor or patch releases, and under no circumstances should be deployed to production environments. There are still a lot of rough edges, many of which are highlighted below, that will be addressed in future releases. In addition, ArcticDB does not yet support writing pyarrow tables directly, and so Pandas is still needed in the demo.

### Performance-wise, numeric data will generally be identical to reading as Pandas. For string data, Arrow is always faster as the GIL is not needed to create string columns in Arrow.

## Preamble

In [1]:
import numpy as np
import pandas as pd
import polars as pl
import pyarrow as pa
from arcticdb import Arctic, LibraryOptions, QueryBuilder

In [2]:
ac = Arctic("lmdb://tmp/arrow_reads_demo")

In [3]:
ac.delete_library("arrow_static")
ac.delete_library("arrow_dynamic")
lib = ac.create_library("arrow_static")
lib_dyn = ac.create_library("arrow_dynamic", LibraryOptions(dynamic_schema=True))

In [4]:
sym = "test"

## Helper function for pretty-printing pyarrow tables as the default repr isn't very human friendly. Note that converting a pyarrow table to a polars dataframe is zero-copy!

In [5]:
def print_table(table):
    print(pl.from_arrow(table))

## Write some numeric data

In [6]:
df = pd.DataFrame({"col1": np.arange(10, dtype=np.int64), "col2": np.arange(10, 20, dtype=np.float64)})
lib.write(sym, df)
df

Unnamed: 0,col1,col2
0,0,10.0
1,1,11.0
2,2,12.0
3,3,13.0
4,4,14.0
5,5,15.0
6,6,16.0
7,7,17.0
8,8,18.0
9,9,19.0


## Read back as a pyarrow Table

In [7]:
from arcticdb import OutputFormat
table = lib.read(sym, output_format=OutputFormat.EXPERIMENTAL_ARROW).data
print(type(table))
table

<class 'pyarrow.lib.Table'>


pyarrow.Table
col1: int64
col2: double
----
col1: [[0,1,2,3,4,5,6,7,8,9]]
col2: [[10,11,12,13,14,15,16,17,18,19]]

In [8]:
print_table(table)

shape: (10, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ f64  │
╞══════╪══════╡
│ 0    ┆ 10.0 │
│ 1    ┆ 11.0 │
│ 2    ┆ 12.0 │
│ 3    ┆ 13.0 │
│ 4    ┆ 14.0 │
│ 5    ┆ 15.0 │
│ 6    ┆ 16.0 │
│ 7    ┆ 17.0 │
│ 8    ┆ 18.0 │
│ 9    ┆ 19.0 │
└──────┴──────┘


### Note that Arrow has no concept of indexes, so the RangeIndex has been dropped

## Argument supported by all read-like methods (read, head, tail, batch_read, batch_read_and_join), and can also use (case-insensitive) strings instead of importing enum

In [9]:
table = lib.head(sym, output_format=OutputFormat.EXPERIMENTAL_ARROW).data
print_table(table)
table = lib.tail(sym, output_format="experimental_arrow").data
print_table(table)

shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ f64  │
╞══════╪══════╡
│ 0    ┆ 10.0 │
│ 1    ┆ 11.0 │
│ 2    ┆ 12.0 │
│ 3    ┆ 13.0 │
│ 4    ┆ 14.0 │
└──────┴──────┘
shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ f64  │
╞══════╪══════╡
│ 5    ┆ 15.0 │
│ 6    ┆ 16.0 │
│ 7    ┆ 17.0 │
│ 8    ┆ 18.0 │
│ 9    ┆ 19.0 │
└──────┴──────┘


## Can also set a flag on the library object so you do not need to pass the output format every time

In [10]:
lib = ac.get_library("arrow_static", output_format=OutputFormat.EXPERIMENTAL_ARROW)
print_table(lib.head(sym).data)

shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ f64  │
╞══════╪══════╡
│ 0    ┆ 10.0 │
│ 1    ┆ 11.0 │
│ 2    ┆ 12.0 │
│ 3    ┆ 13.0 │
│ 4    ┆ 14.0 │
└──────┴──────┘


### Or on the entire `Arctic` instance, so all libraries fetched from this instance use Arrow as the default return type

`ac = Arctic("lmdb://tmp/arrow_reads_demo", output_format=OutputFormat.EXPERIMENTAL_ARROW)`

## In this case you can override back to Pandas output format for individual read-like calls

In [11]:
lib.read(sym, output_format=OutputFormat.PANDAS).data

Unnamed: 0,col1,col2
0,0,10.0
1,1,11.0
2,2,12.0
3,3,13.0
4,4,14.0
5,5,15.0
6,6,16.0
7,7,17.0
8,8,18.0
9,9,19.0


## Timeseries indexes appear as first column

In [12]:
df = pd.DataFrame({"col1": np.arange(10, dtype=np.int64), "col2": np.arange(10, 20, dtype=np.float64)}, index=pd.date_range("2025-01-01", periods=10))
df.index.name = "ts"
df

Unnamed: 0_level_0,col1,col2
ts,Unnamed: 1_level_1,Unnamed: 2_level_1
2025-01-01,0,10.0
2025-01-02,1,11.0
2025-01-03,2,12.0
2025-01-04,3,13.0
2025-01-05,4,14.0
2025-01-06,5,15.0
2025-01-07,6,16.0
2025-01-08,7,17.0
2025-01-09,8,18.0
2025-01-10,9,19.0


In [13]:
lib.write(sym, df)
table = lib.head(sym).data
print_table(table)

shape: (5, 3)
┌─────────────────────┬──────┬──────┐
│ ts                  ┆ col1 ┆ col2 │
│ ---                 ┆ ---  ┆ ---  │
│ datetime[ns]        ┆ i64  ┆ f64  │
╞═════════════════════╪══════╪══════╡
│ 2025-01-01 00:00:00 ┆ 0    ┆ 10.0 │
│ 2025-01-02 00:00:00 ┆ 1    ┆ 11.0 │
│ 2025-01-03 00:00:00 ┆ 2    ┆ 12.0 │
│ 2025-01-04 00:00:00 ┆ 3    ┆ 13.0 │
│ 2025-01-05 00:00:00 ┆ 4    ┆ 14.0 │
└─────────────────────┴──────┴──────┘


## Timezones are also maintained. If the index does not have a name, it will default to "index"

In [14]:
df = pd.DataFrame({"col1": np.arange(10, dtype=np.int64), "col2": np.arange(10, 20, dtype=np.float64)}, index=pd.date_range("2025-01-01", periods=10, tz="America/New_York"))
df

Unnamed: 0,col1,col2
2025-01-01 00:00:00-05:00,0,10.0
2025-01-02 00:00:00-05:00,1,11.0
2025-01-03 00:00:00-05:00,2,12.0
2025-01-04 00:00:00-05:00,3,13.0
2025-01-05 00:00:00-05:00,4,14.0
2025-01-06 00:00:00-05:00,5,15.0
2025-01-07 00:00:00-05:00,6,16.0
2025-01-08 00:00:00-05:00,7,17.0
2025-01-09 00:00:00-05:00,8,18.0
2025-01-10 00:00:00-05:00,9,19.0


In [15]:
lib.write(sym, df)
table = lib.head(sym).data
print_table(table)

shape: (5, 3)
┌────────────────────────────────┬──────┬──────┐
│ index                          ┆ col1 ┆ col2 │
│ ---                            ┆ ---  ┆ ---  │
│ datetime[ns, America/New_York] ┆ i64  ┆ f64  │
╞════════════════════════════════╪══════╪══════╡
│ 2025-01-01 00:00:00 EST        ┆ 0    ┆ 10.0 │
│ 2025-01-02 00:00:00 EST        ┆ 1    ┆ 11.0 │
│ 2025-01-03 00:00:00 EST        ┆ 2    ┆ 12.0 │
│ 2025-01-04 00:00:00 EST        ┆ 3    ┆ 13.0 │
│ 2025-01-05 00:00:00 EST        ┆ 4    ┆ 14.0 │
└────────────────────────────────┴──────┴──────┘


## Column selection and date_range filtering work as expected

In [16]:
table = lib.read(sym, date_range=(pd.Timestamp("2025-01-03"), pd.Timestamp("2025-01-06")), columns=["col1"], output_format="EXPERIMENTAL_ARROW").data
print_table(table)

shape: (3, 2)
┌────────────────────────────────┬──────┐
│ index                          ┆ col1 │
│ ---                            ┆ ---  │
│ datetime[ns, America/New_York] ┆ i64  │
╞════════════════════════════════╪══════╡
│ 2025-01-03 00:00:00 EST        ┆ 2    │
│ 2025-01-04 00:00:00 EST        ┆ 3    │
│ 2025-01-05 00:00:00 EST        ┆ 4    │
└────────────────────────────────┴──────┘


## Appends, updates, and large writes that result in row-slicing result in a returned table made up of multiple record batches
Each record batch corresponds to one row-slice as stored on disk in ArcticDB, and can be thought of as a row-slice within the table as well.

All this means in practice is that each column is backed by one buffer per record batch, rather than one large contiguous buffer, all subsequent operations will work regardless of the underlying structure.

We will add an option in future to be able to read directly into a single record batch.

In [17]:
df_0 = pd.DataFrame({"col1": np.arange(2, dtype=np.int64)})
df_1 = pd.DataFrame({"col1": np.arange(2, 4, dtype=np.int64)})
lib.write(sym, df_0)
lib.append(sym, df_1)
table = lib.read(sym).data
table

pyarrow.Table
col1: int64
----
col1: [[0,1],[2,3]]

In [18]:
print_table(table)

shape: (4, 1)
┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 0    │
│ 1    │
│ 2    │
│ 3    │
└──────┘


## These buffers can be combined back into one
This involves one allocation per column and num_record_batches*num_columns memcpys, so is not free, but may improve performance of subsequent processing operations, particularly if the data was highly fragmented on disk

In [19]:
combined_table = table.combine_chunks()
combined_table

pyarrow.Table
col1: int64
----
col1: [[0,1,2,3]]

In [20]:
combined_table.equals(table)

True

## Write some strings

In [21]:
df = pd.DataFrame({"col1": ["hello", None, "bonjour", "gutentag", np.nan, "nihao"]})
df

Unnamed: 0,col1
0,hello
1,
2,bonjour
3,gutentag
4,
5,nihao


In [22]:
lib.write(sym, df)

VersionedItem(symbol='test', library='arrow_static', data=n/a, version=5, metadata=None, host='LMDB(path=/data/team/data/arctic_native/examples/tmp/arrow_reads_demo)', timestamp=1755856407542304768)

## String columns are returned as the dictionary-encoded type
This is similar to a Pandas categorical, and also to our in-memory representation of string data. It is very space efficient if the same strings appear multiple times in a column.

We will add an option in the future to get the column back as a `pa.string()` or `pa.large_string()` type.

In [23]:
table = lib.read(sym).data
table

pyarrow.Table
col1: dictionary<values=large_string, indices=int32, ordered=0>
----
col1: [  -- dictionary:
["hello","bonjour","gutentag","nihao"]  -- indices:
[0,null,1,2,null,3]]

### Note that `None` and `NaN` values in the input dataframe will be returned as a pyarrow `null`, representing a missing value

In [24]:
# Polars uses `cat` to mean a dictionary-encoded column
print_table(table)

shape: (6, 1)
┌──────────┐
│ col1     │
│ ---      │
│ cat      │
╞══════════╡
│ hello    │
│ null     │
│ bonjour  │
│ gutentag │
│ null     │
│ nihao    │
└──────────┘


## We have provided a utility function to get back to the "default" pyarrow string column representation

In [25]:
from arcticdb.util.arrow import stringify_dictionary_encoded_columns
table = stringify_dictionary_encoded_columns(table)
table

pyarrow.Table
col1: large_string
----
col1: [["hello",null,"bonjour","gutentag",null,"nihao"]]

## With one argument, this will give the `large_string` type, as this is less limiting. If the total size of all strings in a column is <2GB, then you can use a smaller string type

In [26]:
table = lib.read(sym).data
table = stringify_dictionary_encoded_columns(table, pa.string())
table

pyarrow.Table
col1: string
----
col1: [["hello",null,"bonjour","gutentag",null,"nihao"]]

## Dynamic schema follows the usual type promotion rules

In [27]:
lib_dyn = ac.get_library("arrow_dynamic", output_format=OutputFormat.EXPERIMENTAL_ARROW)
df_0 = pd.DataFrame({"col1": np.arange(2, dtype=np.uint8)})
df_1 = pd.DataFrame({"col1": np.arange(2, 4, dtype=np.int8)})
lib_dyn.write(sym, df_0)
lib_dyn.append(sym, df_1)
table = lib_dyn.read(sym).data
print_table(table)

shape: (4, 1)
┌──────┐
│ col1 │
│ ---  │
│ i16  │
╞══════╡
│ 0    │
│ 1    │
│ 2    │
│ 3    │
└──────┘


## Columns that are missing from some row slices with dynamic schema are now backfilled with pyarrow nulls, rather than type-specific default values (e.g. NaN for floats)

In [28]:
df_0 = pd.DataFrame({"col1": np.arange(2, dtype=np.uint8)})
df_1 = pd.DataFrame({"col2": ["hello", "bonjour"]})
lib_dyn.write(sym, df_0)
lib_dyn.append(sym, df_1)
table = lib_dyn.read(sym).data
print_table(table)

shape: (4, 2)
┌──────┬─────────┐
│ col1 ┆ col2    │
│ ---  ┆ ---     │
│ u8   ┆ cat     │
╞══════╪═════════╡
│ 0    ┆ null    │
│ 1    ┆ null    │
│ null ┆ hello   │
│ null ┆ bonjour │
└──────┴─────────┘


## Multiindexes come back as regular columns as Arrow has no concept of indexes

In [29]:
df = pd.DataFrame({"col1": np.arange(10)}, index=pd.MultiIndex.from_product([pd.date_range("2025-01-01", periods=5), ["GOOG", "APPL"]], names=["ts", "ticker"]))
lib.write(sym, df)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,col1
ts,ticker,Unnamed: 2_level_1
2025-01-01,GOOG,0
2025-01-01,APPL,1
2025-01-02,GOOG,2
2025-01-02,APPL,3
2025-01-03,GOOG,4
2025-01-03,APPL,5
2025-01-04,GOOG,6
2025-01-04,APPL,7
2025-01-05,GOOG,8
2025-01-05,APPL,9


In [30]:
table = lib.read(sym).data
table

pyarrow.Table
ts: timestamp[ns]
ticker: dictionary<values=large_string, indices=int32, ordered=0>
col1: int64
----
ts: [[2025-01-01 00:00:00.000000000,2025-01-01 00:00:00.000000000,2025-01-02 00:00:00.000000000,2025-01-02 00:00:00.000000000,2025-01-03 00:00:00.000000000,2025-01-03 00:00:00.000000000,2025-01-04 00:00:00.000000000,2025-01-04 00:00:00.000000000,2025-01-05 00:00:00.000000000,2025-01-05 00:00:00.000000000]]
ticker: [  -- dictionary:
["GOOG","APPL"]  -- indices:
[0,1,0,1,0,1,0,1,0,1]]
col1: [[0,1,2,3,4,5,6,7,8,9]]

In [31]:
print_table(table)

shape: (10, 3)
┌─────────────────────┬────────┬──────┐
│ ts                  ┆ ticker ┆ col1 │
│ ---                 ┆ ---    ┆ ---  │
│ datetime[ns]        ┆ cat    ┆ i64  │
╞═════════════════════╪════════╪══════╡
│ 2025-01-01 00:00:00 ┆ GOOG   ┆ 0    │
│ 2025-01-01 00:00:00 ┆ APPL   ┆ 1    │
│ 2025-01-02 00:00:00 ┆ GOOG   ┆ 2    │
│ 2025-01-02 00:00:00 ┆ APPL   ┆ 3    │
│ 2025-01-03 00:00:00 ┆ GOOG   ┆ 4    │
│ 2025-01-03 00:00:00 ┆ APPL   ┆ 5    │
│ 2025-01-04 00:00:00 ┆ GOOG   ┆ 6    │
│ 2025-01-04 00:00:00 ┆ APPL   ┆ 7    │
│ 2025-01-05 00:00:00 ┆ GOOG   ┆ 8    │
│ 2025-01-05 00:00:00 ┆ APPL   ┆ 9    │
└─────────────────────┴────────┴──────┘


## Series will come back as a table, not a pyarrow Array, which is a closer match when indexes aren't present

In [32]:
s = pd.Series(np.arange(10, 20), name="my series")
lib.write(sym, s)
s

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
Name: my series, dtype: int64

In [33]:
table = lib.read(sym).data
print_table(table)

shape: (10, 1)
┌───────────┐
│ my series │
│ ---       │
│ i64       │
╞═══════════╡
│ 10        │
│ 11        │
│ 12        │
│ 13        │
│ 14        │
│ 15        │
│ 16        │
│ 17        │
│ 18        │
│ 19        │
└───────────┘


## Processing operations with lazy dataframes or the `QueryBuilder` can also return Arrow tables, but this only works with static schema in this release

In [34]:
rng = np.random.default_rng()
df = pd.DataFrame({"col": rng.random(1_000_000)})
lib.write(sym, df)
lazy_df = lib.read(sym, lazy=True)
lazy_df["2 * col"] = 2 * lazy_df["col"]
table = lazy_df.collect().data
print_table(table)

shape: (1_000_000, 2)
┌──────────┬──────────┐
│ col      ┆ 2 * col  │
│ ---      ┆ ---      │
│ f64      ┆ f64      │
╞══════════╪══════════╡
│ 0.871273 ┆ 1.742546 │
│ 0.206551 ┆ 0.413102 │
│ 0.407184 ┆ 0.814368 │
│ 0.394772 ┆ 0.789544 │
│ 0.535344 ┆ 1.070689 │
│ …        ┆ …        │
│ 0.635224 ┆ 1.270447 │
│ 0.20707  ┆ 0.41414  │
│ 0.111366 ┆ 0.222733 │
│ 0.610101 ┆ 1.220202 │
│ 0.112928 ┆ 0.225856 │
└──────────┴──────────┘


## Pandas interoperability - we attach our normalization metadata to the pyarrow tables, so in general `to_pandas()` should get back to original dataframe that was written

In [35]:
# Default RangeIndex
df = pd.DataFrame({"col1": np.arange(10)})
lib.write(sym, df)
table = lib.read(sym).data
table.to_pandas().index

RangeIndex(start=0, stop=10, step=1)

In [36]:
# MultiIndex
df = pd.DataFrame({"col1": np.arange(10)}, index=pd.MultiIndex.from_product([pd.date_range("2025-01-01", periods=5), ["GOOG", "APPL"]], names=["ts", "ticker"]))
lib.write(sym, df)
table = lib.read(sym).data
table.to_pandas().index

MultiIndex([('2025-01-01', 'GOOG'),
            ('2025-01-01', 'APPL'),
            ('2025-01-02', 'GOOG'),
            ('2025-01-02', 'APPL'),
            ('2025-01-03', 'GOOG'),
            ('2025-01-03', 'APPL'),
            ('2025-01-04', 'GOOG'),
            ('2025-01-04', 'APPL'),
            ('2025-01-05', 'GOOG'),
            ('2025-01-05', 'APPL')],
           names=['ts', 'ticker'])

## Pandas quirks - Pandas is very permissive in ways that Arrow is not, particularly around column names

In [37]:
# Unnamed multiindex exposes our internal name
df = pd.DataFrame({"col1": np.arange(10)}, index=pd.MultiIndex.from_product([pd.date_range("2025-01-01", periods=5), ["GOOG", "APPL"]]))
lib.write(sym, df)
table = lib.read(sym).data
print_table(table)
# And restores the unnamed indexes on calling to_pandas
table.to_pandas()

shape: (10, 3)
┌─────────────────────┬────────────┬──────┐
│ index               ┆ __fkidx__1 ┆ col1 │
│ ---                 ┆ ---        ┆ ---  │
│ datetime[ns]        ┆ cat        ┆ i64  │
╞═════════════════════╪════════════╪══════╡
│ 2025-01-01 00:00:00 ┆ GOOG       ┆ 0    │
│ 2025-01-01 00:00:00 ┆ APPL       ┆ 1    │
│ 2025-01-02 00:00:00 ┆ GOOG       ┆ 2    │
│ 2025-01-02 00:00:00 ┆ APPL       ┆ 3    │
│ 2025-01-03 00:00:00 ┆ GOOG       ┆ 4    │
│ 2025-01-03 00:00:00 ┆ APPL       ┆ 5    │
│ 2025-01-04 00:00:00 ┆ GOOG       ┆ 6    │
│ 2025-01-04 00:00:00 ┆ APPL       ┆ 7    │
│ 2025-01-05 00:00:00 ┆ GOOG       ┆ 8    │
│ 2025-01-05 00:00:00 ┆ APPL       ┆ 9    │
└─────────────────────┴────────────┴──────┘


Unnamed: 0,Unnamed: 1,col1
2025-01-01,GOOG,0
2025-01-01,APPL,1
2025-01-02,GOOG,2
2025-01-02,APPL,3
2025-01-03,GOOG,4
2025-01-03,APPL,5
2025-01-04,GOOG,6
2025-01-04,APPL,7
2025-01-05,GOOG,8
2025-01-05,APPL,9


In [38]:
# pyarrow does something similar, although without putting index columns first
print_table(pa.Table.from_pandas(df))

shape: (10, 3)
┌──────┬─────────────────────┬───────────────────┐
│ col1 ┆ __index_level_0__   ┆ __index_level_1__ │
│ ---  ┆ ---                 ┆ ---               │
│ i64  ┆ datetime[ns]        ┆ str               │
╞══════╪═════════════════════╪═══════════════════╡
│ 0    ┆ 2025-01-01 00:00:00 ┆ GOOG              │
│ 1    ┆ 2025-01-01 00:00:00 ┆ APPL              │
│ 2    ┆ 2025-01-02 00:00:00 ┆ GOOG              │
│ 3    ┆ 2025-01-02 00:00:00 ┆ APPL              │
│ 4    ┆ 2025-01-03 00:00:00 ┆ GOOG              │
│ 5    ┆ 2025-01-03 00:00:00 ┆ APPL              │
│ 6    ┆ 2025-01-04 00:00:00 ┆ GOOG              │
│ 7    ┆ 2025-01-04 00:00:00 ┆ APPL              │
│ 8    ┆ 2025-01-05 00:00:00 ┆ GOOG              │
│ 9    ┆ 2025-01-05 00:00:00 ┆ APPL              │
└──────┴─────────────────────┴───────────────────┘


In [39]:
# Unnamed Series just call the column "0" in the output table
s = pd.Series(np.arange(10, 20))
lib.write(sym, s)
table = lib.read(sym).data
print_table(table)
# This is maintained when converting back to Pandas
table.to_pandas()

shape: (10, 1)
┌─────┐
│ 0   │
│ --- │
│ i64 │
╞═════╡
│ 10  │
│ 11  │
│ 12  │
│ 13  │
│ 14  │
│ 15  │
│ 16  │
│ 17  │
│ 18  │
│ 19  │
└─────┘


Unnamed: 0,0
0,10
1,11
2,12
3,13
4,14
5,15
6,16
7,17
8,18
9,19


In [40]:
# Pandas allows duplicate column names, ArcticDB does not internally, and nor does Arrow. Our internal name is exposed in this case
df = pd.DataFrame(np.ones((10, 2)), columns=["col1", "col1"])
lib.write(sym, df)
table = lib.read(sym).data
print_table(table)
# Original names are recovered on conversion back to Pandas
table.to_pandas()

shape: (10, 2)
┌───────────────┬───────────────┐
│ __col_col1__0 ┆ __col_col1__1 │
│ ---           ┆ ---           │
│ f64           ┆ f64           │
╞═══════════════╪═══════════════╡
│ 1.0           ┆ 1.0           │
│ 1.0           ┆ 1.0           │
│ 1.0           ┆ 1.0           │
│ 1.0           ┆ 1.0           │
│ 1.0           ┆ 1.0           │
│ 1.0           ┆ 1.0           │
│ 1.0           ┆ 1.0           │
│ 1.0           ┆ 1.0           │
│ 1.0           ┆ 1.0           │
│ 1.0           ┆ 1.0           │
└───────────────┴───────────────┘


Unnamed: 0,col1,col1.1
0,1.0,1.0
1,1.0,1.0
2,1.0,1.0
3,1.0,1.0
4,1.0,1.0
5,1.0,1.0
6,1.0,1.0
7,1.0,1.0
8,1.0,1.0
9,1.0,1.0


In [41]:
# pyarrow does not allow duplicate column names when converting from Pandas
try:
    pa.Table.from_pandas(df)
except Exception as e:
    print(e)

Duplicate column names found: ['col1', 'col1']
