In [2]:
import polars as pl
import pandas as pd
import numpy as np
import pyarrow as pa
import plotly.express as px
import string
import random

# Motivation

1. Small memory footpring
  - Native dtypes: missing, strings.
  - Arrow format.
1. Query Planning
1. Parallelism:
    - Speed
    - Debugging




## Memory Footprint


### Memory Footprint of Storage

Polars vs. Pandas:

In [3]:
letters = pl.Series(list(string.ascii_letters))

n = int(10e6)
letter1 = letters.sample(n,with_replacement=True)
letter1.estimated_size(unit='gb')

0.08381903916597366

In [4]:
letter1_pandas = letter1.to_pandas() 
letter1_pandas.memory_usage(deep=True, index=False) / 1e9

0.58

The memory footprint of the polars Series is 1/7 of the pandas Series(!).
But I did cheat- I used string type data to emphasize the difference. The difference would have been smaller if I had used integers or floats. 




### Memory Footprint of Compute

You are probably storing your data to compute with it.
Let's compare the memory footprint of computations. 


In [5]:
%load_ext memory_profiler

In [6]:
%memit letter1.sort()

peak memory: 543.34 MiB, increment: 214.46 MiB


In [7]:
%memit letter1_pandas.sort_values()

peak memory: 669.23 MiB, increment: 339.91 MiB


In [8]:
%memit letter1[10]='a'

peak memory: 501.45 MiB, increment: 85.84 MiB


In [9]:
%memit letter1_pandas[10]='a'

peak memory: 416.68 MiB, increment: 0.00 MiB


Things to notice:

- Operating on existing data consumes less memory in polars than in pandas.
- Changing the data consumes more memory in polars than in pandas. Why is that?


### Operating From Disk to Disk

What if my data does not fit into RAM?
Turns out you can read from disk, process in RAM, and write to disk. This allows you to process data larger than your memory. 

TODO: demonstrate sink_parquet from [here](https://www.rhosignal.com/posts/sink-parquet-files/).





## Query Planning

Consider a sort opperation that follows a filter operation. 
Ideally, filter precededs the sort, but we did not ensure this...
We now demonstarte that polars' query planner will do it for you. 
En passant, we see polars is more efficient also without the query planner. 


Polars' Eager evaluation, without query planning. 
Sort then filter. 

In [10]:
%timeit -n 2 -r 2 letter1.sort().filter(letter1.is_in(['a','b','c']))

734 ms ± 57.4 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)


Polars' Eager evaluation, without query planning. 
Filter then sort. 

In [11]:
%timeit -n 2 -r 2 letter1.filter(letter1.is_in(['a','b','c'])).sort()

220 ms ± 4.37 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


Polars' Lazy evaluation with query planning. 
Recieves sort then filter; executes filter then sort. 

In [12]:
%timeit -n 2 -r 2 letter1.alias('letters').to_frame().lazy().sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).collect()

206 ms ± 1.73 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


Pandas' eager evaluation in the wrong order: Sort then filter. 

%timeit -n 2 -r 2 letter1_pandas.sort_values().loc[lambda x: x.isin(['a','b','c'])]
```


Pandas eager evaluation in the right order: Filter then sort. 

In [13]:
%timeit -n 2 letter1_pandas.loc[lambda x: x.isin(['a','b','c'])].sort_values()

712 ms ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)


Pandas alternative syntax, just as slow. 

In [14]:
%timeit -n 2 -r 2 letter1_pandas.loc[letter1_pandas.isin(['a','b','c'])].sort_values()

908 ms ± 13.8 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


Things to note:

1. Query planning works!
1. Polars faster than Pandas even in eager evaluation (without query planning).



## Parallelism

Polars seamlessly parallelizes over columns (also within, when possible).
As the number of columns in the data grows, we would expect fixed runtime until all cores are used, and then linear scaling.
The following code demonstrates this idea, using a simple sum-within-column.


In [15]:
import time

def scaling_of_sums(n_rows, n_cols):
  # n_cols = 2
  # n_rows = int(1e6)
  A = {}
  A_numpy = np.random.randn(n_rows,n_cols)
  A['numpy'] = A_numpy.copy()
  A['polars'] = pl.DataFrame(A_numpy)
  A['pandas'] = pd.DataFrame(A_numpy)

  times = {}
  for key,value in A.items():
    start = time.time()
    value.sum()
    end = time.time()
    times[key] = end-start

  return(times)

In [16]:
scaling_of_time = {
  p:scaling_of_sums(n_rows= int(1e6),n_cols = p) for p in np.arange(1,16)}

In [17]:
data = pd.DataFrame(scaling_of_time).T
px.line(
  data, 
  labels=dict(
    index="Number of Columns", 
    value="Runtime")
)

Things to note:

- Pandas is slow. 
- Numpy is quite efficient.
- My machine has 8 cores. I would thus expect a fixed timing until 8 columns, and then linear scaling. This is not the case. I wonder why?


## Speed Of Import

Polar's `read_x` functions are quite faster than Pandas. 
This is due to better type "guessing" heuristics, and to native support of the parquet file format. 

We now make synthetic data, save it as csv or parquet, and reimport it with polars and pandas.

Starting with CSV:

In [18]:
n_rows = int(1e5)
n_cols = 10
data = np.random.randn(n_rows,n_cols)
data.tofile('data/data.csv', sep = ',')

Import with pandas. 

In [19]:
%timeit -n2 -r2 data_pandas = pd.read_csv('data/data.csv', header = None)

21.1 s ± 395 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


Import with polars. 

In [20]:
%timeit -n2 -r2 data_polars = pl.read_csv('data/data.csv', has_header = False)

4.52 s ± 184 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


Moving to parquet:


In [21]:
data_pandas = pd.DataFrame(data)
data_pandas.columns = data_pandas.columns.astype(str)
data_pandas.to_parquet('data/data.parquet', index = False)

In [22]:
%timeit -n2 -r2 data_pandas = pd.read_parquet('data/data.parquet')

21 ms ± 6.88 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


In [23]:
%timeit -n2 -r2 data_polars = pl.read_parquet('data/data.parquet')

9.86 ms ± 2.01 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


Things to note:

- The difference in speed is quite large.
- I dare argue that polars' type guessing is better, but I am not demonstrating it here. 
- Bonus fact: parquet is much faster than csv, and also saves the frame's schema.



## Speed Of Join

Because pandas is built on numpy, people see it as both an in-memory database, and a matrix/array library.
With polars, it is quite clear it is an in-memory database, and not an array processing library.
As such, you cannot multiply two polars dataframes, but you can certainly join then efficiently.

Make some data:

In [24]:
def make_data(n_rows, n_cols):
  data = np.concatenate(
  (
    np.arange(n_rows)[:,np.newaxis], # index
    np.random.randn(n_rows,n_cols), # values
    ),
    axis=1)
    
  return data


n_rows = int(1e6)
n_cols = 10
data_left = make_data(n_rows, n_cols)
data_right = make_data(n_rows, n_cols)

Polars join:

In [25]:
data_left_polars = pl.DataFrame(data_left)
data_right_polars = pl.DataFrame(data_right)

%timeit -n2 -r2 polars_joined = data_left_polars.join(data_right_polars, on = 'column_0', how = 'inner')

223 ms ± 27.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


Pandas join:

In [26]:
data_left_pandas = pd.DataFrame(data_left)
data_right_pandas = pd.DataFrame(data_right)

%timeit -n2 -r2 pandas_joined = data_left_pandas.merge(data_right_pandas, on = 0, how = 'inner')

1.18 s ± 69.3 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


## Moving Forward...

If this motivational seection has convinced you to try polars instead of pandas, here is a  more structured intro. 






# Polars Series

Much like pandas, polars' fundamental building block is the series. 
A series is a column of data, with a name, and a dtype.
In the following we:

1. Create a series and demonstrate basic operations on it.
1. Demonstrate the various dtypes. 
1. Discuss missing values.
1. Filter a series.

## Object Housekeeping
Construct a series

In [27]:
s = pl.Series("a", [1, 2, 3])
s

a
i64
1
2
3


Make pandas series for comparison:

In [28]:
s_pandas = pd.Series([1, 2, 3], name = "a")

In [29]:
type(s)

polars.internals.series.series.Series

In [30]:
type(s_pandas)

pandas.core.series.Series

In [31]:
s.dtype

Int64

In [32]:
s_pandas.dtype

dtype('int64')

In [33]:
f = pl.Series("a", [1., 2., 3.])
f

a
f64
1.0
2.0
3.0


In [34]:
f.dtype

Float64

In [35]:
s.cleared() # creates an empty series, with same dtype

a
i64


Object size in memory. Super useful for profiling:

In [36]:
s.estimated_size(unit="gb")

2.2351741790771484e-08

In [37]:
s.chunk_lengths() # what is the length of each memory chunk?

[3]

## Filtering and Subsetting


In [38]:
s[0]

1

## Aggregations

In [39]:
s.sum()

6

In [40]:
s.min()

1

In [41]:
s.arg_min()

0

In [42]:
s.mean()

2.0

In [43]:
s.median()

2.0

In [44]:
s.entropy()

-4.68213122712422

In [45]:
s.describe()

statistic,value
str,f64
"""min""",1.0
"""max""",3.0
"""null_count""",0.0
"""mean""",2.0
"""std""",1.0
"""count""",3.0


In [46]:
s.value_counts()

a,counts
i64,u32
1,1
2,1
3,1


## Transformations

In [47]:
s.abs()

a
i64
1
2
3


In [48]:
s.sin()

a
f64
0.841471
0.909297
0.14112


In [49]:
s.exp()

a
f64
2.718282
7.389056
20.085537


In [50]:
s.hash()

a
u64
13321499719149775801
8196255364589999986
3071011010030224171


In [51]:
s.log()

a
f64
0.0
0.693147
1.098612


In [52]:
s.peak_max()

false
False
True


In [53]:
s.sqrt()

a
f64
1.0
1.414214
1.732051


In [54]:
s.clip_max(2)

a
i64
1
2
2


In [55]:
s.clip_min(1)

a
i64
1
2
3


You cannot round integers, but you can round floats.


In [56]:
f.round(2)

a
f64
1.0
2.0
3.0


In [57]:
f.ceil()

a
f64
1.0
2.0
3.0


In [58]:
f.floor()

a
f64
1.0
2.0
3.0


In [59]:
s.is_in(pl.Series([1, 10]))

a
bool
True
False
False


__Caution__: `is_in()` in polars has an underscore, unlike `isin()` in pandas.

## Cummulative Operations


In [60]:
s.cummax()

a
i64
1
2
3


In [61]:
s.cumsum()

a
i64
1
3
6


In [62]:
s.cumprod()

a
i64
1
2
6


In [63]:
s.ewm_mean(com=0.5)

a
f64
1.0
1.75
2.615385


## Sequential Operations


In [64]:
s.diff()

a
i64
""
1.0
1.0


In [65]:
s.pct_change()

a
f64
""
1.0
0.5


## Windowed Operations


In [66]:
s.rolling_apply(
  pl.sum, 
  window_size=2)

a
i64
""
3.0
5.0


Not all functions will work within a `rolling_apply`! Only polars' functions will.

In [67]:
#| eval: false

s.rolling_apply(np.sum, window_size=2) # will not work

In [68]:
s.rolling_max(window_size=2)

a
i64
""
2.0
3.0


In [69]:
s.clip(1, 2)

a
i64
1
2
2


In [70]:
s.clone()

a
i64
1
2
3


In [71]:
# check equality with clone
s == s.clone()

a
bool
True
True
True


## Binary Operations

Despite my introduction above, you still can think of polars series as 1D arrays...

In [72]:
s.dot(pl.Series("b", [1, 2, 3]))

14.0

## Uniques and Duplicates


In [73]:
s.is_duplicated()

a
bool
False
False
False


In [74]:
s.is_unique()

a
bool
True
True
True


In [75]:
s.n_unique()

3

In [76]:
pl.Series([1,2,3,4,1]).unique_counts()

2
1
1
1


The first appearance of a value in a series:

In [77]:
pl.Series([1,2,3,4,1]).is_first()

true
True
True
True
False


## Handeling dtypes

__Note__. Unlike pandas, polars' test functions have an underscore: `is_numeric()` instead of `isnumeric()`.


In [78]:
s.is_numeric()

True

In [79]:
s.is_float()

False

In [80]:
s.is_utf8()

False

In [81]:
s.is_boolean()

False

In [82]:
s.is_datelike()

False

### Compare with Pandas Type Checkers

In [83]:
pd.api.types.is_string_dtype(s_pandas)

False

In [84]:
pd.api.types.is_string_dtype(s)

False

### Optimizing dtypes

Find the most efficient dtype for a series:

In [85]:
s.shrink_dtype()

a
i8
1
2
3


Also see [here](http://braaannigan.github.io/software/2022/10/31/polars-dtype-diet.html).

Shrink the memory allocation to the size of the actual data (in place).

In [86]:
s.shrink_to_fit() 

a
i64
1
2
3


## Missing

Pandas users will be excited to know that polars has built in missing value support (!) for all dtypes.
This has been a long awaited feature in the Python data science ecosystem, with implications on performance and syntax.


In [87]:
m = pl.Series("a", [1, 2, None, np.nan])
m.is_null()

a
bool
False
False
True
False


In [88]:
m.is_nan()

a
bool
False
False
""
True


In [89]:
m1 = pl.Series("a", [1, None, 2, ]) # python native None
m2 = pl.Series("a", [1, np.nan, 2, ]) # numpy's nan
m3 = pl.Series("a", [1, float('nan'), 2, ]) # python's nan
m4 = pd.Series([1, None, 2 ])
m5 = pd.Series([1, np.nan, 2, ])
m6 = pd.Series([1, float('nan'), 2, ])

In [90]:
[m1.sum(), m2.sum(), m3.sum(), m4.sum(), m5.sum(), m6.sum()]

[3, nan, nan, 3.0, 3.0, 3.0]

Things to note:

- The use of `is_null()` instead of pandas `isna()`.
- Polars supports `np.nan` but that is a different dtype than `None` (which is a `Null` type). `None` is not considered 
- Aggregating pandas and polars series behave differently w.r.t. missing values:
  - Both will ignore `None`; which is unsafe.
  - Polars will not ignore `np.nan`; which is safe. Pandas is unsafe w.r.t. `np.nan`, and will ignore it. 


Filling missing values; `None` and `np.nan` are treated differently:

In [91]:
m1.fill_null(0)

a
i64
1
0
2


In [92]:
m1.interpolate()

a
i64
1
1
2


In [93]:
m2.fill_null(0)

a
f64
1.0
""
2.0


In [94]:
m2.fill_nan(0)

a
f64
1.0
0.0
2.0


# DataFrames

## Object Descriptives

## Statistical Aggregations

## Filtering, Selection, and Other Manipulations

## Joins

## Reshaping

## Groupby

## Query Planning and Optimization

- describe_plan
- show_graph
- describe_optimized_plan



# I/O

## Import

- From csv
- From parquet
- From multiple parquets
- From Arrow DataSet



Warnings:

1. String caching



## Export


# Plotting

# Polars and ML

# Strings

# Datatimes


# Config


In [95]:
list(dir(pl.Config))

['__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'load',
 'restore_defaults',
 'save',
 'set_ascii_tables',
 'set_fmt_str_lengths',
 'set_tbl_cell_alignment',
 'set_tbl_cols',
 'set_tbl_column_data_type_inline',
 'set_tbl_dataframe_shape_below',
 'set_tbl_formatting',
 'set_tbl_hide_column_data_types',
 'set_tbl_hide_column_names',
 'set_tbl_hide_dataframe_shape',
 'set_tbl_hide_dtype_separator',
 'set_tbl_rows',
 'set_tbl_width_chars',
 'set_verbose',
 'state',
 'with_columns_kwargs']