# 01-polars-vs-pandas

The objective of this notebook is to evaluate and demonstrate the utility of [polars](https://github.com/pola-rs/polars), particularly in comparison against the more traditional pandas.

# 1. Env

In [1]:
import pandas as pd
import polars as pl
import numpy as np
import os

# 2. DataFrame Creation

In [2]:
length = 10000000
a = np.random.random(length)
b = np.random.randint(0, 5, size=length)
c = np.random.randint(0, 10, size=length)

In [3]:
fruit_list = ["banana", "apple", "cherry", "pineapple", "blueberry"]
fruit_weights = 5 + 4 * np.random.random(len(fruit_list))
fruits = np.random.choice(fruit_list, length)

In [4]:
%%time
pd_df = pd.DataFrame(
    {
        "a": a,
        "b": b,
        "c": c,
        "fruit": fruits
    }
)

pd_fruit_weights = pd.DataFrame(
    {
        "fruit": fruit_list,
        "fruit_weight": fruit_weights
    }
)

CPU times: user 1.26 s, sys: 194 ms, total: 1.45 s
Wall time: 1.45 s


In [5]:
%%time
pl_df = pl.DataFrame(
    {
        "a": a,
        "b": b,
        "c": c,
        "fruit": fruits
    }
)

pl_fruit_weights = pl.DataFrame(
    {
        "fruit": fruit_list,
        "fruit_weight": fruit_weights
    }
)

CPU times: user 3.64 s, sys: 678 ms, total: 4.32 s
Wall time: 4.33 s


# 3. Simple Aggregates

#### `sum`

In [6]:
%%time
pd_df[["a", "b", "c"]].sum()

CPU times: user 100 ms, sys: 41.9 ms, total: 142 ms
Wall time: 141 ms


a    5.000375e+06
b    1.999943e+07
c    4.499293e+07
dtype: float64

In [7]:
%%time
pl_df[["a", "b", "c"]].sum()

CPU times: user 37.3 ms, sys: 1.02 ms, total: 38.4 ms
Wall time: 13.1 ms


a,b,c
f64,i64,i64
5000374.983470603,19999429,44992931


#### `mean`

In [8]:
%%time
pd_df[["a", "b", "c"]].mean()

CPU times: user 86.1 ms, sys: 12 ms, total: 98.1 ms
Wall time: 96.7 ms


a    0.500037
b    1.999943
c    4.499293
dtype: float64

In [9]:
%%time
pl_df[["a", "b", "c"]].mean()

CPU times: user 17.2 ms, sys: 332 µs, total: 17.5 ms
Wall time: 6.09 ms


a,b,c
f64,f64,f64
0.5000374983470602,1.9999429,4.4992931


# 4. Groupby

#### one aggregate

In [10]:
%%time
(pd_df
 .groupby("fruit")
 ["a"]
 .mean())

CPU times: user 494 ms, sys: 33.8 ms, total: 528 ms
Wall time: 528 ms


fruit
apple        0.500161
banana       0.500117
blueberry    0.499939
cherry       0.500026
pineapple    0.499945
Name: a, dtype: float64

In [11]:
%%time
(pl_df
 .groupby("fruit")
 .agg(pl.mean("a"))
)

CPU times: user 536 ms, sys: 143 ms, total: 679 ms
Wall time: 107 ms


fruit,a_mean
str,f64
"""cherry""",0.5000256109200087
"""blueberry""",0.4999387548617152
"""banana""",0.5001171123584991
"""pineapple""",0.4999450227412526
"""apple""",0.5001610779004338


#### multiple aggregates

In [12]:
%%time
(pd_df
 .groupby("fruit")
 .mean())

CPU times: user 630 ms, sys: 97.5 ms, total: 727 ms
Wall time: 737 ms


Unnamed: 0_level_0,a,b,c
fruit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
apple,0.500161,1.999102,4.496541
banana,0.500117,1.999839,4.501173
blueberry,0.499939,2.001447,4.49898
cherry,0.500026,2.001221,4.497767
pineapple,0.499945,1.998104,4.502006


In [13]:
%%time
(pl_df
 .groupby("fruit")
 .agg(pl.mean("*"))
)

CPU times: user 580 ms, sys: 43.7 ms, total: 624 ms
Wall time: 95.4 ms


fruit,a_mean,b_mean,c_mean
str,f64,f64,f64
"""banana""",0.5001171123584991,1.9998389246167203,4.501173048986926
"""pineapple""",0.4999450227412526,1.9981042838883636,4.502005728653066
"""apple""",0.5001610779004338,1.9991017714008212,4.496540869651327
"""blueberry""",0.4999387548617152,2.0014466846227523,4.498980222311536
"""cherry""",0.5000256109200087,2.0012212446601776,4.497767131029458


# 5. Join

In [14]:
%%time
(pd_df
 .merge(pd_fruit_weights, how="left", on="fruit")
)

CPU times: user 1.22 s, sys: 98 ms, total: 1.32 s
Wall time: 1.32 s


Unnamed: 0,a,b,c,fruit,fruit_weight
0,0.739783,1,1,pineapple,6.627421
1,0.419434,2,9,apple,8.327433
2,0.666899,1,6,pineapple,6.627421
3,0.851892,3,1,pineapple,6.627421
4,0.640520,2,3,cherry,6.192603
...,...,...,...,...,...
9999995,0.835908,1,8,blueberry,5.738635
9999996,0.436029,1,3,apple,8.327433
9999997,0.026649,4,4,cherry,6.192603
9999998,0.595930,1,4,pineapple,6.627421


In [15]:
%%time
(pl_df
 .join(pl_fruit_weights, on="fruit")
)

CPU times: user 1.02 s, sys: 199 ms, total: 1.22 s
Wall time: 320 ms


a,b,c,fruit,fruit_weight
f64,i64,i64,str,f64
0.7397829757695302,1,1,"""pineapple""",6.627420597487596
0.41943435393901574,2,9,"""apple""",8.327432764116928
0.6668992640680729,1,6,"""pineapple""",6.627420597487596
0.8518916839688154,3,1,"""pineapple""",6.627420597487596
0.6405196843983809,2,3,"""cherry""",6.192602973553914
0.4567631253979375,2,4,"""cherry""",6.192602973553914
0.056020224996554346,4,2,"""blueberry""",5.738634609368556
0.5252347182772842,1,4,"""blueberry""",5.738634609368556
0.08609717187320609,1,1,"""apple""",8.327432764116928
0.20010226314697177,2,3,"""cherry""",6.192602973553914


# 6. I/O

#### `to_csv`

In [16]:
%%time
pd_df.to_csv('./pd_df.csv')

CPU times: user 21.4 s, sys: 752 ms, total: 22.2 s
Wall time: 22.3 s


In [17]:
%%time
pl_df.to_csv('./pl_df.csv')

CPU times: user 1.38 s, sys: 328 ms, total: 1.71 s
Wall time: 1.71 s


#### `from_csv`

In [18]:
%%time
_ = pd.read_csv('./pd_df.csv')

CPU times: user 2.47 s, sys: 348 ms, total: 2.82 s
Wall time: 2.84 s


In [19]:
%%time
_ = pl.read_csv('./pl_df.csv')

CPU times: user 2.91 s, sys: 484 ms, total: 3.39 s
Wall time: 714 ms


#### `to_parquet`

In [20]:
%%time
pd_df.to_parquet('./pd_df.parquet')

CPU times: user 1.16 s, sys: 96.9 ms, total: 1.25 s
Wall time: 1.24 s


In [21]:
%%time
pl_df.to_parquet('./pl_df.parquet')

CPU times: user 769 ms, sys: 27.4 ms, total: 797 ms
Wall time: 800 ms


#### `from_parquet`

In [22]:
%%time
_ = pd.read_parquet('./pd_df.parquet')

CPU times: user 745 ms, sys: 250 ms, total: 995 ms
Wall time: 674 ms


In [23]:
%%time
_ = pl.read_parquet('./pl_df.parquet')

CPU times: user 640 ms, sys: 165 ms, total: 805 ms
Wall time: 645 ms


#### clean up directory of I/O files

In [24]:
for file in ["./pd_df.csv", "./pl_df.csv", "./pd_df.parquet", "./pl_df.parquet"]:
    os.remove(file)

# 7. String Methods

#### slice

In [25]:
%%time
pd_df["fruit"].str[1:5] # The arguments are "start" and "stop" index.

CPU times: user 2.06 s, sys: 165 ms, total: 2.23 s
Wall time: 2.24 s


0          inea
1          pple
2          inea
3          inea
4          herr
           ... 
9999995    lueb
9999996    pple
9999997    herr
9999998    inea
9999999    herr
Name: fruit, Length: 10000000, dtype: object

In [26]:
%%time
pl_df["fruit"].str.slice(1, 4) # The arguments are "start" index and "length"

CPU times: user 155 ms, sys: 29.4 ms, total: 185 ms
Wall time: 186 ms


shape: (10000000,)
Series: 'fruit' [str]
[
	"inea"
	"pple"
	"inea"
	"inea"
	"herr"
	"herr"
	"lueb"
	"lueb"
	"pple"
	"herr"
	"herr"
	"anan"
	...
	"lueb"
	"herr"
	"lueb"
	"pple"
	"pple"
	"anan"
	"pple"
	"lueb"
	"pple"
	"herr"
	"inea"
	"herr"
]

# 8. Conclusion

Polars supports all the same basic functionality as Pandas, including easy interaction with cloud providers, but at least 4 times faster for almost all basic functions. Furthermore, Polars and Pandas are commensurately expressive, with some differences (e.g. Polars's dataframe output displays are more explicitly clear regarding data-types).