# Polars: Blazingly fast dataframes!! 🚀🚀

_An in memory query engine + DataFrame library written from scratch in Rust_

<div>
<img src="../img/polars_logo_black_circle.png" style="display:block;margin-left:auto;margin-right:auto;width:20%" </img>
</div>
<div>
<img src="../img/xomnia_logo.png" style="display:block;margin-left:auto;margin-right:auto;width:20%" </img>
</div>


Agenda

* Polars in a nutshell
* Blazingly fast claim
* Coming from pandas
* Example: Expression language


## Why Polars?

1. Rust did not have a DataFrame library

2. Moore's law is ending
    - response: more cores.
    - numpy + pandas are single threaded and don't benefit from this hardware trend
    - decades of database development (e.g.) query optimizations are not being used in python dataframe stack.
    - we got our ass kicked by R datatable


<img src="../img/moores_law_dead.png" width=400/>


## Polars in a nutshell: Rust


    - Rust: memory safe programming language w/. performance == C/C++
    - Full control over memory / close to the metal
    
<div>
<img src="../presentation_xomnia/img/rustlang.png" style="display:block;margin-left:auto;margin-right:auto;width:20%" </img>
</div>




## Polars in a nutshell: Arrow

* Apache Arrow as memory model
    - The future of big data communication and columnar store
    - Reduction of serialization/deserialization by standardizing and optimization of columnar memory.
    - within same process, data access is free (ptr sharing)
    
<div>
<img src="../img/arrow_graph.png" style="display:block;margin-left:auto;margin-right:auto;width:70%" </img>
</div>
    

## Polars in a nutshell: Expressions + Query optimizations

* Lazy evaluation
    - Declarative
        - e.g. descrive what you want, not how to do it
    - More context for optimizations

**declarative**
> ```
> get me a cup of coffee.
> ```

**procedural**
> ```
> take 5 steps of 75cm
> turn 90 degrees
> take 3 steps of 75cm
> take 1 step of 23cm
> turn 45 degrees
> lift your hand
> ...
> ...
> lower the coffee cup on the table
> ```

The latter has less room for optimization

## Polars in a nutshell: Memory

* Copy on write semantics
    - pure functions are free (a pure function does not change state of input)
    - cloning data is free
    - mutating single elements is expensive
    - No `SettingWithCopyWarning`
    
**pandas:**
> ```
> array:      [..............]        5MB
> 
> copy:
>             [..............]        5MB
> copy:
>             [..............]        5MB
> 
> total: 15 MB
> ```


**polars:**
> ```
> RC = Reference counter abstraction
>     
> array:      RC<[..............]>   5MB
> 
> copy:
>             *ptr
>             increment refcount
> copy:
>             *ptr
>             increment refcount
>             
> total: 5 MB
> ```

## Polars in a nutshell: Memory

* tightly packed
    - less indirection/ cache misses
    - less fragmention
* bitpacked validity

**pandas**
<div>
<img src="../presentation_xomnia/img/pandas-string.svg"</img>
</div>

**polars/arrow**
<div>
<img src="../img/arrow-string.svg"</img>
</div>

## Polars in a nutshell: Blazingly fast 🚀🚀

<div>
<img src="../presentation_xomnia/img/db-benchmark.png" width=700>
</div>


## Coming from pandas: Performance / Memory usage

**performance**
- 5-20x  runtime improvement

**memory requirements**
- ~10x dataset size
- ~2-4x dataset size




## Coming from pandas (design): first class dtypes

### Missing data != `NaN`

* adding missing values, can cause dtype changes

```python
assert pd.DataFrame({"integers": [1, None, 2]}).dtypes[0] == np.int64
```

```
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-13-88b9defd12a1> in <module>
----> 1 assert pd.DataFrame({"integers": [1, None, 2]}).dtypes[0] == np.int64

AssertionError: 
```


## Coming from pandas no indexes

_my subjective claims_

* Code is read more often than written.
* Usage of indexes in the wild cost more compute than what they should save

* I never understood them much

## No indexes gives us predictable indexing

```
df.iloc # not needed
df.loc # not needed
```

https://pola-rs.github.io/polars-book/user-guide/indexing.html#comparison-with-pandas

# Expressions

* `Fn(Series) -> Series`
* Very composable expression language
* Reduces the need for custom (slow) python functions
* can be combined indefinitetly


In [1]:
import polars as pl
from polars import col, lit
import numpy as np
import pandas as pd

df = pl.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "B": [5, 4, 3, 2, 1],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
        "optional": [28, 300, None, 2, -30],
    }
)
df

A,fruits,B,cars,optional
i64,str,i64,str,i64
1,"""banana""",5,"""beetle""",28.0
2,"""banana""",4,"""audi""",300.0
3,"""apple""",3,"""beetle""",
4,"""apple""",2,"""beetle""",2.0
5,"""banana""",1,"""beetle""",-30.0


## Selection context

* You can selection any value from a DataFrame.
* The data flowing through the expressions are columns in the `DataFrame`
* Every expression is evaluated in parallel.



In [2]:
# We select everything in normal order
# Then we select everything in reversed order
(df.select([
    pl.all(),
    pl.all().reverse().suffix("_reverse")
]))

A,fruits,B,cars,optional,A_reverse,fruits_reverse,B_reverse,cars_reverse,optional_reverse
i64,str,i64,str,i64,i64,str,i64,str,i64
1,"""banana""",5,"""beetle""",28.0,5,"""banana""",1,"""beetle""",-30.0
2,"""banana""",4,"""audi""",300.0,4,"""apple""",2,"""beetle""",2.0
3,"""apple""",3,"""beetle""",,3,"""apple""",3,"""beetle""",
4,"""apple""",2,"""beetle""",2.0,2,"""banana""",4,"""audi""",300.0
5,"""banana""",1,"""beetle""",-30.0,1,"""banana""",5,"""beetle""",28.0


In [3]:
# we an combine aggregations, filters, and projection
df.select([
    "A",
    "B",
    "fruits",
    col("A").mean().alias("A_mean"),
    col("B").filter(col("A") > 2).sum().alias("sum_a_gt_1"),
    col("B").filter(col("fruits") == "banana").sum().alias("sum_a=banana")
])

A,B,fruits,A_mean,sum_a_gt_1,sum_a=banana
i64,i64,str,f64,i64,i64
1,5,"""banana""",3,6,10
2,4,"""banana""",3,6,10
3,3,"""apple""",3,6,10
4,2,"""apple""",3,6,10
5,1,"""banana""",3,6,10


In [4]:
# this would fail because all columns size must match or be len==1.

df.select([
    "A",
    "B",
    "fruits",
    col("B").filter(col("A") > 2).alias("sum_a_gt_1"),
    col("B").filter(col("fruits") == "banana").alias("sum_a=banana")
])

thread '<unnamed>' panicked at 'The columns lengths in the DataFrame are not equal.', /github/workspace/polars/polars-core/src/fmt.rs:387:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


PanicException: The columns lengths in the DataFrame are not equal.

In [5]:
# we can combine columns by a predicate

# pandas
df_pd = df.to_pandas()
df_pd["new"] = df_pd["B"]
df_pd[df_pd["fruits"] != "banana"] = -1
print(df_pd)

# polars
df.with_column(
    pl.when(col("fruits") == "banana").then(col("B")).otherwise(-1).alias("new")
)

   A  fruits  B    cars  optional  new
0  1  banana  5  beetle      28.0    5
1  2  banana  4    audi     300.0    4
2 -1      -1 -1      -1      -1.0   -1
3 -1      -1 -1      -1      -1.0   -1
4  5  banana  1  beetle     -30.0    1


A,fruits,B,cars,optional,new
i64,str,i64,str,i64,i64
1,"""banana""",5,"""beetle""",28.0,5
2,"""banana""",4,"""audi""",300.0,4
3,"""apple""",3,"""beetle""",,-1
4,"""apple""",2,"""beetle""",2.0,-1
5,"""banana""",1,"""beetle""",-30.0,1


In [6]:
# or we could keep combining predicates
df.with_column(
    pl.when(col("fruits") == "banana")
    .then(col("B"))
    .when(col("fruits") == "apple")
    .then(12)
    .otherwise(-1).alias("new")
)

A,fruits,B,cars,optional,new
i64,str,i64,str,i64,i64
1,"""banana""",5,"""beetle""",28.0,5
2,"""banana""",4,"""audi""",300.0,4
3,"""apple""",3,"""beetle""",,12
4,"""apple""",2,"""beetle""",2.0,12
5,"""banana""",1,"""beetle""",-30.0,1


```python
def pandas_some_function(df: pd.DataFrame) -> pd.DataFrame:
    # often seen and needed to write pure functions
    # but its very expensive
    df = df.clone()
    df["a"] = df["b"] + df["c"]
    return df
    
# all runs sequential
(pandas_df
    .pipe(pandas_some_function)
    .pipe(pandas_some_other_function)
)

def polars_some_function() -> pl.Expr:
    return col("b") + col("c").alias("a")
    
# sequential + parallel
(polars_df
     .with_column(foo_fn)
     .with_columns([bar_fn, ham_fn])  # parallel
     .select([
         polars_some_function(),      # parallel
         polars_some_other_function
         ...
         compute_bar()
     ])
)
```

## Groupby context

* syntax: `df.groupby(..).agg([exprs..])`
* The data flowing through the expressions are the groups of the groupby operation
* Every expression is evaluated in parallel.

In [7]:
(df.groupby("fruits")
 .agg([
     col("cars")
 ])
)

fruits,cars
str,list
"""banana""","[beetle, audi, beetle]"
"""apple""","[beetle, beetle]"


In [8]:
(df.groupby("fruits")
 .agg([
     col("cars").filter(col("cars") == "beetle"),
     col("cars").filter(col("cars") == "beetle").count().alias("beetle_count"),
 ])
)

fruits,cars,beetle_count
str,list,u32
"""apple""","[beetle, beetle]",2
"""banana""","[beetle, beetle]",2


In [9]:
# Example of multiple aggregations in pandas
# however combining aggregations seems not possible? 

# df_pd = df.to_pandas()
# df_pd.groupby("fruits").agg({
#     "A": ["shift", "sum"]
# })

(df.groupby("fruits")
    .agg([
        (col("A").reverse().rolling_min(window_size=2) ** 2),
        (col("A").reverse().rolling_min(window_size=2) ** 2).sum(),
    ])
)

fruits,A,A_sum
str,list,f64
"""apple""","[null, 9]",9
"""banana""","[null, 4, 1]",5


# Window functions!

* Expression with superpowers.
* Aggregation in selection context
* Groupby over different columns


```python
col("foo").aggregation_expression(..).over("column_used_to_group")
```

In [10]:
(df.sort("fruits")
 .select([
    "fruits",
    "cars",
    col("A").rank().over("fruits").alias("sum_by_fruits"),
    col("A").mean().over("cars").alias("sum_by_cars"),
    col("A").rank().over("fruits").flatten().alias("A_ranked_by_fruits"),
]))

fruits,cars,sum_by_fruits,sum_by_cars,A_ranked_by_fruits
str,str,list,f64,f32
"""apple""","""beetle""","[1, 2]",3.25,1
"""apple""","""beetle""","[1, 2]",3.25,2
"""banana""","""beetle""","[1, 2, 3]",3.25,1
"""banana""","""audi""","[1, 2, 3]",2.0,2
"""banana""","""beetle""","[1, 2, 3]",3.25,3


# Questions?