# Arrow (Feather and IPC)

## What is Arrow?

Arrow is the name of an Apache project which provides:
- data format: Feather(on disk) and IPC(in memory) data format shares the same design and architecture, which is a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The **Arrow memory format(i.e. IPC)** also supports zero-copy reads for lightning-fast data access without serialization overhead. The **Feather** format is the on disk representation of `IPC`.
- Libraries: Arrow's libraries implement the format and provide building blocks for a range of use cases such as (Reading/writing columnar storage formats, Sharing memory locally, Moving data over the network, In-memory data structure for analytics). It provides API in Python, R, Java, etc.

## Feather

Feather is a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally. There are two file format versions for Feather:

- Version 2 (V2), the default version, which is exactly represented as the Arrow IPC file format on disk. V2 files support storing all Arrow data types as well as compression with `LZ4 or ZSTD`. V2 was first made available in Apache Arrow 0.17.0.
- Version 1 is deprecated, **Don't use it**

In [1]:
import os
import pandas as pd
import pyarrow.parquet as pq
import pyarrow.feather as pf
import sys
import time
import s3fs

In [2]:
root_path="/home/pengfei/data_set/kaggle/data_format"
parquet_file=f"{root_path}/netflix.parquet"
feather_file=f"{root_path}/netflix.feather"

In [22]:
def get_size(file:str):
    file_size = os.path.getsize(file) / (1024*1024)
    print(f"The file size of {file} is {file_size} MB")

In [27]:
def pandas_read_parquet_perf(parquet_file_path:str):
    framework = "pandas"
    action = "read"
    dformat = "parquet"

    start = time.perf_counter()
    df = pd.read_parquet(parquet_file_path)
    print(df.shape)
    stop = time.perf_counter()
    print(df.head(5))
    elapsed = stop - start
    print (f"framework: {framework} ,action: {action}, format: {dformat}, time: {elapsed}")
    return f"{framework},{action},{dformat},{elapsed}"


In [25]:
def arrow_read_parquet_perf(parquet_file_path:str):
    framework = "pyarrow"
    action = "read"
    dformat = "parquet"
    start = time.perf_counter()
    arrow_table = pq.read_table(parquet_file_path)
    print(arrow_table.shape)
    stop = time.perf_counter()
    elapsed = stop - start
    print (f"framework: {framework} ,action: {action}, format: {dformat}, time: {elapsed}")
    return f"{framework},{action},{dformat},{elapsed}"

In [17]:
def pandas_parquet_to_feather_perf(parquet_file_path:str,feather_out_path:str):
    framework = "pandas"
    action = "write"
    dformat = "feather"
    df = pd.read_parquet(parquet_file_path)
    print(df.head(5))
    start = time.perf_counter()
    df.to_feather(feather_out_path)
    stop = time.perf_counter()
    elapsed = stop - start
    print (f"framework: {framework} ,action: {action}, format: {dformat}, time: {elapsed}")
    return f"{framework},{action},{dformat},{elapsed}"

In [30]:
def pandas_read_feather_perf(feather_file_path:str):
    framework = "pandas"
    action = "read"
    dformat = "feather"
    start = time.perf_counter()
    df=pd.read_feather(feather_file_path)
    print(df.shape)
    stop = time.perf_counter()
    print(df.head(5))
    elapsed = stop - start
    print (f"framework: {framework} ,action: {action}, format: {dformat}, time: {elapsed}")
    return f"{framework},{action},{dformat},{elapsed}"

In [33]:
def arrow_read_feather_perf(feather_file_path:str):
    framework = "pyarrow"
    action = "read"
    dformat = "feather"
    start = time.perf_counter()
    arrow_table=pf.read_table(feather_file_path)
    print(arrow_table.shape)
    stop = time.perf_counter()
    elapsed = stop - start
    print (f"framework: {framework} ,action: {action}, format: {dformat}, time: {elapsed}")
    return f"{framework},{action},{dformat},{elapsed}"

In [28]:
# Read parquet with pandas
pandas_read_parquet_perf(parquet_file)

(24058262, 3)
   user_id rating        date
0  1488844      3  2005-09-06
1   822109      5  2005-05-13
2   885013      4  2005-10-19
3    30878      4  2005-12-26
4   823519      3  2004-05-03
framework: pandas ,action: read, format: parquet, time: 12.230357868014835


'pandas,read,parquet,12.230357868014835'

In [29]:
# read parquet with arrow
arrow_read_parquet_perf(parquet_file)

(24058262, 3)
framework: pyarrow ,action: read, format: parquet, time: 2.3181647080054972


'pyarrow,read,parquet,2.3181647080054972'

In [18]:
# generate feather from parquet
pandas_parquet_to_feather_perf(parquet_file,feather_file)

   user_id rating        date
0  1488844      3  2005-09-06
1   822109      5  2005-05-13
2   885013      4  2005-10-19
3    30878      4  2005-12-26
4   823519      3  2004-05-03
framework: pandas ,action: write, format: feather, time: 6.963145193003584


'pandas,write,feather,6.963145193003584'

In [31]:
# pandas read feather
pandas_read_feather_perf(feather_file)

(24058262, 3)
   user_id rating        date
0  1488844      3  2005-09-06
1   822109      5  2005-05-13
2   885013      4  2005-10-19
3    30878      4  2005-12-26
4   823519      3  2004-05-03
framework: pandas ,action: read, format: feather, time: 12.863280831981683


'pandas,read,feather,12.863280831981683'

In [34]:
# arrow read feather
arrow_read_feather_perf(feather_file)

(24058262, 3)
framework: pyarrow ,action: read, format: feather, time: 1.0301093540037982


'pyarrow,read,feather,1.0301093540037982'

In [23]:
# compare the size diff
get_size(parquet_file)
get_size(feather_file)

The file size of /home/pengfei/data_set/kaggle/data_format/netflix.parquet is 196.41603565216064 MB
The file size of /home/pengfei/data_set/kaggle/data_format/netflix.feather is 523.9301090240479 MB
