# Requirements

In [1]:
import numpy as np
import pandas as pd

# Data set

For benchmarking, we create a dataaframe with a size of the order of several 100 MB.

In [5]:
nr_rows = 10_000_000
df = pd.DataFrame({
    'A': np.random.normal(size=(nr_rows, )),
    'B': np.random.randint(1, high=5, size=(nr_rows, )),
    'C': np.random.normal(size=(nr_rows, )),
})

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 3 columns):
 #   Column  Dtype  
---  ------  -----  
 0   A       float64
 1   B       int64  
 2   C       float64
dtypes: float64(2), int64(1)
memory usage: 228.9 MB


# Formats

## CSV

CSV has the advantage that it is human-readable, but it is neither fast, nor compact.

In [14]:
%timeit df.to_csv('data.csv')

27.3 s ± 197 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [12]:
%timeit pd.read_csv('data.csv')

2.71 s ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [15]:
!ls -hl data.csv

-rw-r--r-- 1 gjb gjb 469M Sep 14 14:44 data.csv


## Parquet

Parquet is a binary column-store format that has significantly better performance than CSV.

In [9]:
%timeit df.to_parquet('data.parquet')

477 ms ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [13]:
%timeit pd.read_parquet('data.parquet')

162 ms ± 8.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [16]:
!ls data.parquet -lh

-rw-r--r-- 1 gjb gjb 156M Sep 14 14:37 data.parquet


Parquet files are also more compact than their CSV counterparts.

## Feather

In [17]:
%timeit df.to_feather('data.feather')

473 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [18]:
%timeit pd.read_feather('data.feather')

201 ms ± 16.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [19]:
!ls -hl data.feather

-rw-r--r-- 1 gjb gjb 175M Sep 14 14:56 data.feather
