# TL; DR

- R's `data.table` is much faster than Python `pandas`' `DataFrame`: between 1.7 and 34 times for a set of standard operations.
- Note in particular the order of magnitude for conditional subsetting.
- In addition, although this notebook doesn't show it, `data.table` is also more memory efficient.
- Notes:
    - The benchmarking below is not super rigorous, but good enough to demonstrate the large differences between the two implementations.
    - All benchmarks are run on an average laptop under Windows 10, but these differences probably hold across OS and hardware.
    - The CSV write is astonishing. The test was run on a Samsung MZVLB512HBJQ which is an SSD drive, but maybe other hardware has better caching capabilities.
    - Tried out `modin` (modin==0.10.2 as of writing this), both with `ray` and `dask` and both under Linux and Windows, but doesn't seem to be working out of the box: only `merge` and `csv_write` are faster and all the rest is, sometimes dramatically, slower.

# Setup

We use `rpy2` to run a dual Python/R notebook. Using R, we create a dataframe and save it to disk as a `CSV` file, so that both implementations can retrieve and work on the same data.

In [1]:
import pandas as pd
import modin.pandas as mpd
import ray
import numpy as np
import timeit
import os

%load_ext rpy2.ipython

In [2]:
path = "~/tmp"

In [3]:
%%R
library(data.table)
library(rbenchmark)

R[write to console]: data.table 1.14.0 using 1 threads (see ?getDTthreads).  Latest news: r-datatable.com



In [4]:
%%R
path <- "~/tmp"

N = 1e7  # the total number of rows

a_h <- paste0("a", runif(2))
b_h <- paste0("b", runif(4))
c_h <- paste0("c", runif(1e2))
d_h <- paste0("d", runif(1e5))

dt <- data.table(a=rep(a_h, length.out=N),
                 b=rep(b_h, length.out=N),
                 c=rep(c_h, length.out=N),
                 d=rep(d_h, length.out=N),
                 f=runif(N),
                 g=runif(N))
                 
fwrite(dt, file.path(path, "temp.csv"), quote=TRUE)

In [5]:
%%R
dt_sample_d <- dt[!duplicated(dt[, .(a,b,c,d)])]
setnames(dt_sample_d, c("f","g"), c("ff", "gg"))
fwrite(dt_sample_d, file.path(path, "temp_sample_d.csv"), quote=TRUE)

In [6]:
%%R
print(nrow(dt_sample_d))

[1] 100000


# CSV read/write

## read

In [7]:
%%R -o csv_read_r

s <- Sys.time()
for (i in 1:3) 
    fread(file.path(path, "temp.csv"))
e <- Sys.time()
csv_read_r <- difftime(e, s, units="secs")

R[write to console]: |--------------------------------------------------|
|
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[wr

In [8]:
csv_read_r.item()

15.635901689529419

In [9]:
s = timeit.default_timer()
for i in range(0, 3):
    pd.read_csv(os.path.join(path, "temp.csv"))   
e = timeit.default_timer()
csv_read_py = e-s

In [10]:
csv_read_py

23.68994781697984

In [11]:
s = timeit.default_timer()
for i in range(0, 3):
    mpd.read_csv(os.path.join(path, "temp.csv"))   
e = timeit.default_timer()
csv_read_mpy = e-s


    import ray
    ray.init()

To request implementation, send an email to feature_requests@modin.org.


In [12]:
csv_read_mpy

35.80145585400169

## Write

In [13]:
%%R
print(dim(dt))

[1] 10000000        6


In [14]:
%%R -o csv_write_r

s <- Sys.time()
for (i in 1:3) 
    fwrite(dt, file.path(path, "temp.csv"))
e <- Sys.time()
csv_write_r <- difftime(e, s, units="secs")

                                                                                                                                     

In [15]:
csv_write_r.item()

12.786469459533691

In [16]:
df = pd.read_csv(os.path.join(path, "temp.csv"))
df.shape

(10000000, 6)

In [17]:
s = timeit.default_timer()
for i in range(0, 3):
    df.to_csv(os.path.join(path, "temp_a.csv"))   
e = timeit.default_timer()
csv_write_py = e-s

In [18]:
csv_write_py

126.7582435059885

In [19]:
mdf = mpd.read_csv(os.path.join(path, "temp.csv"))

In [20]:
s = timeit.default_timer()
for i in range(0, 3):
    mdf.to_csv(os.path.join(path, "temp_a.csv"))   
e = timeit.default_timer()
csv_write_mpy = e-s

In [21]:
csv_write_mpy

31.90112459100783

# Subsetting

## Boolean condition

In [22]:
%%R -o subset_r
afind = dt$a[1]
bfind = dt$b[1]

s <- Sys.time()
for (i in 1:10)
    dt[a==afind & b==bfind]
e <- Sys.time()
subset_r <- difftime(e, s, units="secs")

In [23]:
subset_r.item()

1.4641764163970947

In [24]:
afind = df.a[0]
bfind = df.b[0]

s = timeit.default_timer()
for i in range(0, 10):
    df[(df.a==afind) & (df.b==bfind)]
e = timeit.default_timer()
subset_py = e-s

In [25]:
subset_py

12.235067892004736

In [26]:
afind = mdf.a[0]
bfind = mdf.b[0]

s = timeit.default_timer()
for i in range(0, 10):
    mdf[(mdf.a==afind) & (mdf.b==bfind)]
e = timeit.default_timer()
subset_mpy = e-s

In [27]:
subset_mpy

19.219890233012848

## Integer index

In [28]:
%%R -o subset_int_r
idx <- seq(1, nrow(dt), by=2)

s <- Sys.time()
for (i in 1:10) 
    dt[idx]
e <- Sys.time()
subset_int_r <- difftime(e, s, units="secs")

In [29]:
subset_int_r

array([2.38951945])

In [30]:
idx = np.arange(0, df.shape[0], step=2)
s = timeit.default_timer()
for i in range(0,10):
    df.iloc[idx]
e = timeit.default_timer()
subset_int_py = e-s

In [31]:
subset_int_py

3.26325211499352

In [32]:
idx = np.arange(0, mdf.shape[0], step=2)
s = timeit.default_timer()
for i in range(0,10):
    mdf.iloc[idx]
e = timeit.default_timer()
subset_int_mpy = e-s

In [33]:
subset_int_mpy

26.182289513002615

# Merge


In [34]:
%%R -o merge_r

s <- Sys.time()
for (i in 1:10)
    merge(dt, dt_sample_d, by=c("a","b","c","d"))
e <- Sys.time()
merge_r <- subset_r <- difftime(e, s, units="secs")

In [35]:
merge_r.item()

36.40645885467529

In [36]:
df_sample_d = pd.read_csv(os.path.join(path, "temp_sample_d.csv"))

s = timeit.default_timer()
for i in range(0,10):
    df.merge(df_sample_d, on=["a","b","c","d"])
e = timeit.default_timer()
merge_py = e-s

In [37]:
merge_py

38.71329345900449

In [38]:
mdf_sample_d = mpd.read_csv(os.path.join(path, "temp_sample_d.csv"))

s = timeit.default_timer()
for i in range(0,10):
    mdf.merge(mdf_sample_d, on=["a","b","c","d"])
e = timeit.default_timer()
merge_mpy = e-s

In [39]:
merge_mpy

29.119130103994394

# Groupby

## Built-in `mean`

In [40]:
%%R -o groupby_builtin_r

s <- Sys.time()
for (i in 1:10) 
    dt[, list(mean(f), mean(g)), by=.(a,b,c,d)]
e <- Sys.time()
groupby_builtin_r <- difftime(e, s, units="secs")

In [41]:
groupby_builtin_r.item()

11.856996059417725

In [42]:
s = timeit.default_timer()
for i in range(0, 10):
    df.groupby(['a', 'b', 'c', 'd']).mean()
e = timeit.default_timer()
groupby_builtin_py = e-s

In [43]:
groupby_builtin_py

39.23730262098252

In [44]:
s = timeit.default_timer()
for i in range(0, 10):
    mdf.groupby(['a', 'b', 'c', 'd']).mean()
e = timeit.default_timer()
groupby_builtin_mpy = e-s

In [45]:
groupby_builtin_mpy

99.87527405799483

## Custom function

In [46]:
%%R -o groupby_custom_r

get_nrows <- function(x) nrow(x)

s <- Sys.time()
for (i in 1:10)
    dt[, list(N=get_nrows(.SD)), by=.(a,b,c,d)]
e <- Sys.time()
groupby_custom_r <- difftime(e, s, units="secs")

In [47]:
groupby_custom_r.item()

11.829243183135986

In [48]:
def get_nrows(x):
    return x.shape[0]

s = timeit.default_timer()
for i in range(0, 10):
    df.groupby(['a', 'b', 'c', 'd']).apply(get_nrows)
e = timeit.default_timer()
groupby_custom_py = e-s

In [49]:
groupby_custom_py

96.0158728910028

In [50]:
def get_nrows(x):
    return x.shape[0]

s = timeit.default_timer()
for i in range(0, 10):
    mdf.groupby(['a', 'b', 'c', 'd']).apply(get_nrows)
e = timeit.default_timer()
groupby_custom_mpy = e-s

In [51]:
groupby_custom_mpy

158.1017251239973

## Custom function multiple variables

In [52]:
%%R -o groupby_custom_r_2

get_nrows <- function(x) nrow(x)

s <- Sys.time()
for (i in 1:10)
    dt[, list(mean=mean(g), N=get_nrows(.SD)), by=.(a,b,c,d)]
e <- Sys.time()
groupby_custom_r_2 <- difftime(e, s, units="secs")

In [53]:
groupby_custom_r_2.item()

14.819897174835205

In [54]:
def get_nrows(x):
    return {"mean": np.mean(x.g), "N": x.shape[0]}

s = timeit.default_timer()
for i in range(0, 10):
    df.groupby(['a', 'b', 'c', 'd']).apply(get_nrows)
e = timeit.default_timer()
groupby_custom_py_2 = e-s

In [55]:
groupby_custom_py_2

180.62255078699673

In [56]:
def get_nrows(x):
    return {"mean": np.mean(x.g), "N": x.shape[0]}

s = timeit.default_timer()
for i in range(0, 10):
    mdf.groupby(['a', 'b', 'c', 'd']).apply(get_nrows)
e = timeit.default_timer()
groupby_custom_mpy_2 = e-s

In [57]:
groupby_custom_mpy_2

245.4137103749963

# Unique

## Elements of a column

In [58]:
%%R -o unique_r

s <- Sys.time()
for (i in 1:10)
    unique(dt$c)
e <- Sys.time()
unique_r <- difftime(e, s, units="secs")

In [59]:
unique_r.item()

1.7795279026031494

In [60]:
s = timeit.default_timer()
for i in range(0, 10):
    df.c.unique()
e = timeit.default_timer()
unique_py = e-s

In [61]:
unique_py

5.20910850900691

In [62]:
s = timeit.default_timer()
for i in range(0, 10):
    mdf.c.unique()
e = timeit.default_timer()
unique_mpy = e-s

In [63]:
unique_mpy

11.413826890988275

## Rows

In [64]:
%%R -o unique_row_r

s <- Sys.time()
for (i in 1:10)
        unique(dt, by=c("a","b","c","d"))
e <- Sys.time()
unique_row_r <- difftime(e, s, units="secs")

In [65]:
unique_row_r.item()

8.304207563400269

In [66]:
s = timeit.default_timer()
for i in range(0, 10):
    df.drop_duplicates(subset=["a","b","c","d"])
e = timeit.default_timer()
unique_row_py = e-s

In [67]:
unique_row_py

32.04370676301187

In [68]:
s = timeit.default_timer()
for i in range(0, 10):
    mdf.drop_duplicates(subset=["a","b","c","d"])
e = timeit.default_timer()
unique_row_mpy = e-s

In [69]:
unique_row_mpy

444.421146587003

# Summary

In [70]:
summary = pd.DataFrame({"Operation": ["csv read", "csv write", 
                                      "bool subset", "int subset", 
                                      "merge",
                                      "groupby builtin", "groupby custom", "groupby custom multiple",
                                      "unique within col", "unique rows"], 
                 "pandas": [csv_read_py, csv_write_py, 
                            subset_py, subset_int_py,
                            merge_py,
                            groupby_builtin_py, groupby_custom_py, groupby_custom_py_2,
                            unique_py, unique_row_py], 
                 "modin": [csv_read_mpy, csv_write_mpy, 
                            subset_mpy, subset_int_mpy,
                            merge_mpy,
                            groupby_builtin_mpy, groupby_custom_mpy, groupby_custom_mpy_2,
                            unique_mpy, unique_row_mpy], 
                 "data.table": [csv_read_r, csv_write_r, 
                                subset_r, subset_int_r,
                                merge_r,
                                groupby_builtin_r, groupby_custom_r, groupby_custom_r_2, 
                                unique_r, unique_row_r]})

In [71]:
summary["data.table"] = summary["data.table"].apply(lambda x: x.item())

In [72]:
summary["modin speedup"] = summary.pandas / summary.modin
summary["data.table speedup"] = summary.pandas / summary["data.table"]

In [73]:
# times are in seconds:
summary

Unnamed: 0,Operation,pandas,modin,data.table,modin speedup,data.table speedup
0,csv read,23.689948,35.801456,15.635902,0.661703,1.5151
1,csv write,126.758244,31.901125,12.786469,3.973473,9.913467
2,bool subset,12.235068,19.21989,1.464176,0.636584,8.35628
3,int subset,3.263252,26.18229,2.389519,0.124636,1.365652
4,merge,38.713293,29.11913,36.406459,1.32948,1.063363
5,groupby builtin,39.237303,99.875274,11.856996,0.392863,3.309211
6,groupby custom,96.015873,158.101725,11.829243,0.607304,8.116823
7,groupby custom multiple,180.622551,245.41371,14.819897,0.735992,12.187841
8,unique within col,5.209109,11.413827,1.779528,0.456386,2.927242
9,unique rows,32.043707,444.421147,8.304208,0.072102,3.858731
