# Polars VS Pandas

## What is Polars

[Polars](https://pola.rs/) is a library written in Rust designed to do efficient data manipulation on a single machine.

How is it different from Pandas?

One of the key differences is that Pandas is built on top of python libraries, in particular NumPy. While NumPy core is written in C, it is mostly optimized for numeric types, not for string-like data, like categorical data.

In [None]:
# imports

import time
import numpy as np
import pandas as pd
import polars as pl

from dataclasses import dataclass

In [None]:
# Create random dataframe: 1M rows, 2 columns are string, 3 are float / int

@dataclass
class Column:
    name: str
    type: type
    length: int

def random_data_set(columns: list[Column]):
    rng = np.random.default_rng()
    letters = list("abcdefghijklmnopqrstuvwxyz")

    data = {}
    for column in columns:
        col_len = column.length
        match column.type.__name__:
            case str.__name__:
                data[column.name] = ["".join(rng.choice(letters, 10)) for _ in range(col_len)]
            case float.__name__:
                data[column.name] = list(map(float, rng.standard_normal(size=col_len)))
            case int.__name__:
                data[column.name] = list(map(int, rng.integers(10_000, size=col_len)))
            case _:
                raise ValueError

    return data

n = 1_000_000
columns = [Column("col_1", str, n), Column("col_2", float, n), Column("col_3", str, n), Column("col_4", float, n), Column("col_5", int, n)]
# data = random_data_set(columns=columns)

In [256]:
# time_it

@dataclass
class Time:
    wall_time: float
    cpu_time: float

def time_it(n_times: int):
    def _time_it(func):
        def wrapped(*args, **kwargs):
            start, p_start = time.time(), time.process_time()
            for _ in range(n_times):
                res = func(*args, **kwargs)
            end, p_end = time.time(), time.process_time()

            return res, Time(end-start, p_end-p_start)

        return wrapped
    return _time_it

In [253]:
# timing pretty-print
def print_timing(res: tuple):
    *ret, execution_time = res
    print(f"CPU Execution time: {execution_time} seconds")
    return *ret,

In [None]:
# define the dataset to load
file_name = "data.csv"

In [268]:
# read a large csv in polar
@time_it(10)
def pl_read_csv(path):
    return pl.read_csv(path)

pl_df = print_timing(pl_read_csv(file_name))[0]

CPU Execution time: Time(wall_time=0.5726299285888672, cpu_time=5.2285115000000815) seconds


In [286]:
# read a large csv in "naive" pandas 1- times
@time_it(10)
def pd_read_csv(path):
    return pd.read_csv(path)

pd_df_numpy_backend = print_timing(pd_read_csv(file_name))[0]

CPU Execution time: Time(wall_time=15.525216579437256, cpu_time=15.494220399999904) seconds


Is that the whole story? Nope! Since version 2.0 pandas support (Py)Arrow as well. The "pyarrow" engine also support multithreading, which is not supported by the other engines.

In [287]:
# read a large csv in pandas - try again..
@time_it(10)
def pd_read_csv(path):
    return pd.read_csv(path, engine="pyarrow", dtype_backend='pyarrow')

pd_df = print_timing(pd_read_csv(file_name))[0]

CPU Execution time: Time(wall_time=0.684666633605957, cpu_time=6.8203726000000415) seconds


Pyarrow are Arrow Python bindings based on the C++ implementation of Apache Arrow. 

Apache Arrow is a set of specification plust implementation in multiple languages (C++, Rust, Julia, ..) 
The main characteristics are:
1. designed for in-memory analytics: it is optimized to read and write data in memory and to move data around
2. designed for tabular data: it optimize memory by having data in the same column (and thus of the same type) near in memory. This allows for optimization like SIMD and cache-missing optimization.
3. specification is (by definition) language-agnostic: multiple implementation allow for easy interoperability between code written in different languages.

One of the most interesting examples that illustrate well the advantage of the Arrow memory layout is how strings are treated in Pandas (see this great article [Apache Arrow and the “10 Things I Hate About pandas”](https://wesmckinney.com/blog/apache-arrow-pandas-internals/)). By default an array of strings is an array of PyObject pointers. This are continuous in memory. What is not though, is the actual string, which lives inside a structure, PyBytes or PyUnicode, allocated in the heap.

In Python, the simple string 'wes' occupies 52 bytes of memory. '' occupies 49 bytes. For a great discussion of issues around this, see Jake Vanderplas’s epic exposé on [Why Python is Slow](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/).

In [None]:
# Pandas memory usage (Mb)
mb = 1024 * 1024
pd_df_numpy_backend.memory_usage(deep=True)/mb

Index     0.000126
col_1    56.266785
col_2     7.629395
col_3    56.266785
col_4     7.629395
col_5     7.629395
dtype: float64

In [None]:
# Pandas memory usage with Pyarrow (Mb)
pd_df.memory_usage(deep=True) / mb

Index     0.000126
col_1    13.351440
col_2     7.629395
col_3    13.351440
col_4     7.629395
col_5     7.629395
dtype: float64

In [None]:
# Polars (estimated) memory usage (Mb)
pl_df.estimated_size() / mb

41.961669921875

## Mean

In [298]:
# mean with Polars

@time_it(100)
def calc_pl_mean_by_index(df: pl.DataFrame):
    return df.select(pl.all().exclude(["col_1", "col_3"])).mean()

pl_mean = print_timing(calc_pl_mean_by_index(pl_df))[0]

CPU Execution time: Time(wall_time=0.26548242568969727, cpu_time=0.5598384999999553) seconds


In [None]:
# mean with Pandas

@time_it(100)
def calc_pd_mean_by_index(df: pd.DataFrame):
    return df.drop(["col_1", "col_3"], axis=1).mean()

pd_mean = print_timing(calc_pd_mean_by_index(pd_df))[0]

CPU Execution time: Time(wall_time=0.5080535411834717, cpu_time=0.5108791999999767) seconds


## Groupby

In [None]:
# Groupby with Polars

@time_it(100)
def calc_pl_mean_close_by_index(df: pl.DataFrame):
    return df.filter(pl.col('col_2') > pl.col('col_4').mean())

pl_mean_close = print_timing(calc_pl_mean_close_by_index(pl_df))[0]

CPU Execution time: Time(wall_time=1.1078405380249023, cpu_time=2.929636999999957) seconds


In [None]:
# Groupby with Pandas

@time_it(100)
def calc_pd_mean_close_by_index(df: pd.DataFrame):
    return df[df['col_2'] > df['col_4'].mean()]

pd_mean_close = print_timing(calc_pd_mean_close_by_index(pd_df))[0]

CPU Execution time: Time(wall_time=3.943446636199951, cpu_time=3.928661899999952) seconds


## Non-trivial op

In [None]:
# Rolling mean with Polars

@time_it(100)
def calc_pl_ma(df: pl.DataFrame):
    return df["col_2"].rolling_mean(window_size=10)

pl_ma = print_timing(calc_pl_ma(pl_df))

CPU Execution time: Time(wall_time=1.6172640323638916, cpu_time=1.6198280999999497) seconds


In [None]:
# Rolling mean with Pandas

@time_it(100)
def calc_pd_ma(df: pd.DataFrame):
    return df["col_2"].rolling(window=10).mean()

pd_ma = print_timing(calc_pd_ma(pd_df))

CPU Execution time: Time(wall_time=2.4736294746398926, cpu_time=2.47524599999997) seconds


In [None]:
# Rolling mean with Numpy

@time_it(100)
def calc_np__ma(df: pd.DataFrame):
    ts = df["col_2"].to_numpy()
    window=10
    ret = np.cumsum(ts, dtype=float)
    ret[window:] = ret[window:] - ret[:-window]
    return ret[window - 1:] / window

np_ma = print_timing(calc_np__ma(pd_df))

CPU Execution time: Time(wall_time=1.1847584247589111, cpu_time=1.1877812000000176) seconds
