<img src="images/polars-logo.svg" width="25%" align="right">

# Blazingly-fast DataFrames with Polars

In this notebook, we'll get a high-level understanding of Polars by exploring ~1 year the airline on-time performance data.

---

[Polars](https://docs.pola.rs) is a relatively new, but increasingly popular library for manipulating structured data. With a Rust-based core query engine, it is designed for efficient, out-of-core, and parallel operations.

## Read a Parquet file

Polars has a strict grammar and composable API, as we will see in the following section.

But to begin with, there are some similarities with the pandas API.
For instance, you can use the similar `read_parquet`, `read_csv`, syntax for I/O, and `head`, `shape`, `column` to inspect the DataFrame.

In [None]:
import polars as pl

In [None]:
source = "gs://quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2022/part.200.parquet"
df = pl.read_parquet(source)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.columns[:10]

## Expression system

The expression system is one of the most powerful concepts in Polars.

Before we look into expressions, let's walk through the building blocks: `select`, `filter`, `group_by`, and `with_columns`.

### Building blocks

#### Let's `select` a subset of columns

In [None]:
columns = [
    'YEAR', 'MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'FL_DATE', 'OP_CARRIER', 
    'TAIL_NUM', 'OP_CARRIER_FL_NUM', 'ORIGIN', 'DEST', 'CRS_DEP_TIME', 
    'DEP_TIME', 'DEP_DELAY', 'ARR_TIME', 'ARR_DELAY', 'CANCELLED', 
    'CANCELLATION_CODE', 'DIVERTED', 'AIR_TIME', 'FLIGHTS', 'DISTANCE',
]

In [None]:
df = df.select(pl.col(columns))

In [None]:
df.columns

#### Let's `filter` for the "DL" carrier

In [None]:
df.filter((pl.col("OP_CARRIER") == "DL"))

#### Find the number of entries for each day of the week (`groupby`)

In [None]:
df.group_by("DAY_OF_WEEK").len()

#### Create a new column with  "DISTANCE" values in kilometers instead of miles (`with_columns`)

In [None]:
df.with_columns((pl.col("DISTANCE")*1.609344).alias("DISTANCE_KM"))

### Expressions & Context

> An expression is a tree of operations that describe how to construct one or more Series. As the outputs are Series, it is straightforward to apply a sequence of expressions (similar to method chaining in pandas) each of which transforms the output from the previous step.
>
> ~ [Expressions - Polars documentation](https://docs.pola.rs/user-guide/concepts/expressions/)

Expressions allow you to decouple the logic from execution, and Polars can optimize and parallelize the expressions.

A related concept ins **Context**:

> A context, as implied by the name, refers to the context in which an expression needs to be evaluated. There are three main contexts:
>
> * Selection: df.select(...), df.with_columns(...)
> * Filtering: df.filter()
> * Group by / Aggregation: df.group_by(...).agg(...)
>
> ~ [Context - Polars documentation](https://docs.pola.rs/user-guide/concepts/contexts/)

Let's understand these with some example computations similar to the initial pandas and Dask computations:

#### What are the mean & median, arrival and departure delays?

The expression would be:

In [None]:
(
    pl.col("DEP_DELAY", "ARR_DELAY").mean()
)

Now, you can apply this expression to any DataFrame with `select`, putting this in the "selection" context:

In [None]:
df.select(pl.col("DEP_DELAY", "ARR_DELAY").mean())

You can provide several expressions, and they will be executed in parallel:

In [None]:
# executed in parallel
df.select(
    pl.col("DEP_DELAY").mean().alias("MEAN_DEP_DELAY"),
    pl.col("DEP_DELAY").median().alias("MEDIAN_DEP_DELAY"),
    pl.col("ARR_DELAY").mean().alias("MEAN_ARR_DELAY"),
    pl.col("ARR_DELAY").median().alias("MEDAIN_ARR_DELAY"),
)

#### 💻 Your turn: Find the mean arrival and departure delays for each airline

Hint: Use `df.group_by(...).agg(...)`

In [None]:
# Your code here. When ready, click on the three dots for the solutions.

In [None]:
df.group_by("OP_CARRIER").agg(
     pl.col("DEP_DELAY").mean(),
     pl.col("ARR_DELAY").mean(),
)

## Lazy evaluation

Similar to Dask, Polars also supports lazy evaluation. A lazily evaluated DataFrame is called the `LazyFrame` in Polars.

Our expressions operate the same in lazy and eager mode.

In [None]:
import polars as pl
import numpy as np
from datetime import date

times = pl.datetime_range(date(1900, 1, 1), date(2100, 1, 1), '1m', eager=True)

df = pl.DataFrame({
    'time': times,
    'value': np.random.randn(len(times)),
    'category': np.random.randint(0, 3, size=len(times)),
    'id': np.random.randint(0, 2, size=(len(times))),
})

In [None]:
df.write_parquet('big_dataset.parquet')
import gc
del df
gc.collect()

Next, let's:
- read in the parquet file `big_dataset.parquet`
- select column `'time'`, but sorted descending, and then take only the first 5 elements

If you did this using `read_parquet`, try changing it to `scan_parquet`, and add `.collect` at the end

You probably got a 2x speed-up by doing practically nothing!

Let's learn more about laziness!

To create a `LazyFrame`, we use `scan_parquet` (instead of `read_parquet`):

In [None]:
source = "gs://quansight-datasets/airline-ontime-performance/parquet_by_year/YEAR=2022/*.parquet"
df_lazy = pl.scan_parquet(source)

In [None]:
df_lazy

`LazyFrame`s have an explain method, where "PROJECT" refers to the number of columns Polars will operate on.

In [None]:
print(df.explain())

### Re-compute mean & median, arrival and departure delays

In [None]:
df_lazy.select(
    pl.col("DEP_DELAY").mean().alias("MEAN_DEP_DELAY"),
    pl.col("DEP_DELAY").median().alias("MEDIAN_DEP_DELAY"),
    pl.col("ARR_DELAY").mean().alias("MEAN_ARR_DELAY"),
    pl.col("ARR_DELAY").median().alias("MEDAIN_ARR_DELAY"),
)

Now, on `.explain`, PROJECT shows 2/109 COLUMNS:

In [None]:
df_lazy.select(
    pl.col("DEP_DELAY").mean().alias("MEAN_DEP_DELAY"),
    pl.col("DEP_DELAY").median().alias("MEDIAN_DEP_DELAY"),
    pl.col("ARR_DELAY").mean().alias("MEAN_ARR_DELAY"),
    pl.col("ARR_DELAY").median().alias("MEDAIN_ARR_DELAY"),
).explain()

`.collect()` perform the computation and returns the result eagerly:

In [None]:
df_lazy.select(
    pl.col("DEP_DELAY").mean().alias("MEAN_DEP_DELAY"),
    pl.col("DEP_DELAY").median().alias("MEDIAN_DEP_DELAY"),
    pl.col("ARR_DELAY").mean().alias("MEAN_ARR_DELAY"),
    pl.col("ARR_DELAY").median().alias("MEDAIN_ARR_DELAY"),
).collect()

Finally, you can also `.profile()` the compute to find areas for optimization:

In [None]:
df_lazy.select(
    pl.col("DEP_DELAY").mean().alias("MEAN_DEP_DELAY"),
    pl.col("DEP_DELAY").median().alias("MEDIAN_DEP_DELAY"),
    pl.col("ARR_DELAY").mean().alias("MEAN_ARR_DELAY"),
    pl.col("ARR_DELAY").median().alias("MEDAIN_ARR_DELAY"),
).profile()

## Limitations

### Polars is design for efficient columnar operations

Hence, `axis=1` operations have limited support, and may be efficient than array libraries.

If you have only numerical data, convert to NumPy (`.to_numpy()`)

### Incomplete support for full pandas API

You may find some functionality which is only available in pandas, and not (yet) in Polars.

In that case, you can easily convert to pandas:

- `.to_pandas()`: probably copies data
- `.to_pandas(use_pyarrow_extension_array=True)`: zero-copy, but pandas support for pyarrow is still flaky

---

## Next →

Let's learn about [DuckDB](11-duckdb.ipynb)!