# Arrow and parquet

* [Comparing Data Storage: Parquet vs. Arrow](https://medium.com/@diehardankush/comparing-data-storage-parquet-vs-arrow-aa2231e51c8a)

In [1]:
# import sys
# sys.executable

'/home/hendrik/Documents/repos/github/mygists/.venv/bin/python'

In [None]:
Difference between Arrow and Parquet:

* Parquet: 
    * columnar
    * disk-based storage 
    * optimized for Hadoop
    * high compression (Snappy, Gzip, LZO)
* Arrow: 
    * in-memory data processing framework
    * multi-language bindings, e.g. Python, R, Java, JavaScript, ...

Example for reading and writing:

In [2]:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import time

In [3]:
# Generate a sample DataFrame
df = pd.DataFrame({
    'one': pd.Series(range(1000000)),
    'two': pd.Series(range(1000000, 2000000)),
    'three': pd.Series(range(2000000, 3000000))
})

In [4]:
df.head()

Unnamed: 0,one,two,three
0,0,1000000,2000000
1,1,1000001,2000001
2,2,1000002,2000002
3,3,1000003,2000003
4,4,1000004,2000004


Read/write from/to parquet using Pandas

In [13]:
# Writing and reading with Parquet
start = time.time()
df.to_parquet('data.pandas')
read_parquet = pd.read_parquet('data.pandas')
end = time.time()
print(f"Parquet Write + Read time: {round(end - start, ndigits = 2)}s")

Parquet Write + Read time: 0.12s


Read/write from/to Parquet using Pyarrow

In [14]:
# Convert the DataFrame to Arrow Table
table = pa.Table.from_pandas(df)
table

pyarrow.Table
one: int64
two: int64
three: int64
----
one: [[0,1,2,3,4,...,999995,999996,999997,999998,999999]]
two: [[1000000,1000001,1000002,1000003,1000004,...,1999995,1999996,1999997,1999998,1999999]]
three: [[2000000,2000001,2000002,2000003,2000004,...,2999995,2999996,2999997,2999998,2999999]]

In [15]:
# Writing and reading with Arrow
start = time.time()
pq.write_table(table, 'data.parquet')
read_arrow = pq.read_table('data.parquet').to_pandas()
end = time.time()
print(f"Arrow Write + Read time: {round(end - start, ndigits = 2)}")

Arrow Write + Read time: 0.12


In [16]:
!rm data.parquet data.pandas