# Python Polars: The Definitive Crash Course

<div style="width: 80%; margin: 0 auto;">

![](img/title.png)

</div>

<div style="width: 50%; margin: 0 auto;">

![](img/jeroen.png)

</div>

## Outline

- Reading and Writing Data
- Expressions
- Eager and Lazy APIs
- Selecting and Creating Columns
- Filtering and Sorting Rows
- Working with Textual, Temporal, and Nested Data Types
- Joining and Concatenating
- Reshaping 

<div style="width: 50%; margin: 0 auto;">

![](img/yoda.png)

</div>

In [None]:
import polars as pl

pl.__version__  # The book is built with Polars version 1.20.0

# Reading and Writing Data

## Format Overview

<div style="width: 50%; margin: 0 auto;">

![](img/formats-table.png)

</div>

## Reading CSV Files

When you’re handed a file with the extension .csv,
there’s no knowing what’s inside:

- Is the delimiter a comma, a tab, a semicolon, or something else?
- Is the character encoding UTF-8, ASCII, or something else?
- Is there a header with column names? How many lines is it?
- How are missing values represented?
- Are values properly quoted?

In [None]:
! cat data/penguins.csv

In [None]:
penguins = pl.read_csv("data/penguins.csv")
penguins

## Parsing Missing Values Correctly

Unfortunately for plain-text
formats such as CSV, there’s no standard way to represent these. Representations that
we’ve seen in the wild include NULL, Nil, None, NA, N/A, NaN, 999999, and the
empty String.
By default, Polars only interprets empty Strings as missing values.

In [None]:
penguins = pl.read_csv("data/penguins.csv", null_values="NA")
penguins

In [None]:
penguins.null_count().transpose(  
    include_header=True, column_names=["null_count"]
)

## Working with Multiple Files

Globbing patterns can contain special characters which act as wildcards, such as:
- Asterisks (*), which match zero or more characters in a String. For example, the pattern *.csv will match any filename that ends in .csv.
- Question marks (?), which match exactly one character. For example, the pattern file?.csv will match files like file1.csv or fileA.csv but not file12.csv.
- Square brackets ([]), 

In [None]:
pl.read_csv("data/stock/nvda/201?.csv")

In [None]:
all_stocks = pl.read_csv("data/stock/**/*.csv")
all_stocks

In [None]:
import calendar

filenames = [
    f"data/stock/asml/{year}.csv"
    for year in range(1999, 2024)
    if calendar.isleap(year)
]

filenames

In [None]:
pl.concat(pl.read_csv(f) for f in filenames)

## Reading Parquet

Parquet is a columnar storage format for big data frameworks that's more efficient than CSV/Excel. Key benefits: faster column-specific queries, supports nested data structures, includes schema information to prevent errors, and works seamlessly with in-memory formats like Apache Arrow.

In [None]:
%%time
trips = pl.read_parquet("data/taxi/yellow_tripdata_*.parquet")
trips

## Reading JSON and NDJSON

### JSON

In [None]:
! head -n 33 data/pokedex.json

In [None]:
pokedex = pl.read_json("data/pokedex.json")
pokedex

Notice how everything is read as a single value? That’s because the JSON object has
only one key, called pokemon, whose value is a list of objects. 

Polars reads nested JSON as-is without auto-flattening. To manually flatten data, use `df.explode()` to convert list items into rows, and `df.unnest()` to convert object keys into columns.

In [None]:
(
    pokedex.explode("pokemon")
    .unnest("pokemon")
    .select("id", "name", "type", "height", "weight")
)

### NDJSON

In [None]:
! head data/wikimedia.ndjson

In [None]:
from json import loads
from pprint import pprint

with open("data/wikimedia.ndjson") as f:
    pprint(loads(f.readline()))

In [None]:
wikimedia = pl.read_ndjson("data/wikimedia.ndjson")
wikimedia

In [None]:
(
    wikimedia.rename({"id": "edit_id"})
    .unnest("meta")
    .select("timestamp", "title", "user", "comment")
)

## Other File Formats

If you have a file that’s not supported by Polars, then perhaps pandas can lend a
hand. pandas has been around for over 14 years, so it’s not surprising that it supports
more formats. 

In [None]:
import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_Latin_abbreviations"
pl.from_pandas(pd.read_html(url, storage_options={"User-Agent": "Mozilla/5.0"})[0])

## Querying Databases

Polars also supports retrieving data from
various relational databases, including Postgres, MS SQL, MySQL, Oracle, SQLite,
and BigQuery.

## Writing Data

In [None]:
all_stocks.write_csv("data/all_stocks.csv")

In [None]:
all_stocks.write_excel("data/all_stocks.xlsx")

In [None]:
all_stocks.write_parquet("data/all_stocks.parquet")

#  Data Structures and Data Types

Series is a one-dimensional structure holding same-type values (integers, floats, strings). It can stand alone but is typically used as a column in a DataFrame.a




## Series, DataFrames, and LazyFrames

In [None]:
sales_series = pl.Series("sales", [150.00, 300.00, 250.00])

sales_series

DataFrame is a two-dimensional table with rows and columns, internally represented as multiple Series of equal length.

In [None]:
sales_df = pl.DataFrame(
    {
        "sales": sales_series,
        "customer_id": [24, 25, 26],
    }
)

sales_df

LazyFrame holds no data, only instructions for reading and processing. Operations aren't executed immediately (lazy evaluation). Instead, they build a query graph that gets optimized before execution. It's essentially a blueprint for generating a DataFrame efficiently.

In [None]:
lazy_df = pl.scan_csv("data/fruit.csv").with_columns(
    is_heavy=pl.col("weight") > 200
)

lazy_df.show_graph()

## Data Types

Polars uses the Apache Arrow memory specification, a columnar format optimized for efficient analytic operations on modern hardware (CPUs/GPUs). Polars' data types are mostly based on Arrow's specification, with multiple bit sizes available to minimize memory usage while fitting your data range.


<div style="width: 50%; margin: 0 auto;">

![](img/datatypes.png)

</div>

### Nested Data Types

Polars has three nested data types: Array, List, and Struct. An Array is a collection of elements that are of the same data type. Within a Series,
each Array must have the same shape. 

In [None]:
coordinates = pl.DataFrame(
    [
        pl.Series("point_2d", [[1, 3], [2, 5]]),
        pl.Series("point_3d", [[1, 7, 3], [8, 1, 0]]),
    ],
    schema={
        "point_2d": pl.Array(shape=2, inner=pl.Int64),
        "point_3d": pl.Array(shape=3, inner=pl.Int64),
    },
)

coordinates

In contrast to the Array, a List does not have to have the same
length on every row. 

In [None]:
weather_readings = pl.DataFrame(
    {
        "temperature": [[72.5, 75.0, 77.3], [68.0, 70.2]],
        "wind_speed": [[15, 20], [10, 12, 14, 16]],
    }
)

weather_readings

A Struct is often used to work multiple Series
at once.

In [None]:
rating_series = pl.Series(
    "ratings",
    [
        {"Movie": "Cars", "Theatre": "NE", "Avg_Rating": 4.5},
        {"Movie": "Toy Story", "Theatre": "ME", "Avg_Rating": 4.9},
    ],
)
rating_series

# Beginning Expressions

## Expressions by Example

In [None]:
fruit = pl.read_csv("data/fruit.csv")
fruit

### Selecting Columns with Expressions

In [None]:
fruit.select(
    pl.col("name"),  
    pl.col("^.*or.*$"),  
    pl.col("weight") / 1000,  
    "is_round", 
)

### Creating New Columns with Expressions

In [None]:
fruit.with_columns(
    pl.lit(True).alias("is_fruit"),  
    is_berry=pl.col("name").str.ends_with("berry"),  
)

### Filtering Rows with Expressions

In [None]:
fruit.filter(
    (pl.col("weight") > 1000)  
    & pl.col("is_round")  
)

### Aggregating with Expressions

In [None]:
fruit.group_by(pl.col("origin").str.split(" ").list.last()).agg(  
    pl.len(),  
    average_weight=pl.col("weight").mean()  
)

### Sorting Rows with Expressions

In [None]:
fruit.sort(
    pl.col("name").str.len_bytes(),  # <1> <2>
    descending=True,  
)

## The Definition of an Expression

<div style="width: 50%; margin: 0 auto;">

![](img/def.png)

</div>

- Series
- Tree of operations



- Describe
- Construct
- One or more

In [None]:
(pl.lit(3).add(5) / pl.lit(1).add(5)).meta.tree_format()

In [None]:
(
    pl.DataFrame({"a": [1, 2, 3], "b": [0.4, 0.5, 0.6]}).with_columns(
        pl.all().mul(10).name.suffix("_times_10")
    )
)

In [None]:
pl.all().mul(10).name.suffix("_times_10").meta.has_multiple_outputs()

### Properties of Expressions

- Lazy
- Function and data dependent (context)

In [None]:
is_orange = (pl.col("color") == "orange").alias("is_orange")

fruit.with_columns(is_orange)

In [None]:
fruit.filter(is_orange)

In [None]:
fruit.group_by(is_orange).len()

- Resuable

In [None]:
flowers = pl.DataFrame(
    {
        "name": ["Tiger lily", "Blue flag", "African marigold"],
        "latin": ["Lilium columbianum", "Iris versicolor", "Tagetes erecta"],
        "color": ["orange", "purple", "orange"],
    }
)

flowers.filter(is_orange)

## Creating Expressions

### From Existing Columns

In [None]:
fruit.select(pl.col("color")).columns

In [None]:
# This raises a ColumnNotFoundError:
# fruit.select(pl.col("is_smelly")).columns

In [None]:
fruit.select(pl.col("^.*or.*$")).columns

In [None]:
fruit.select(pl.all()).columns

In [None]:
fruit.select(pl.col(pl.String)).columns

In [None]:
fruit.select(pl.col(pl.Boolean, pl.Int64)).columns

In [None]:
fruit.select(pl.col(["name", "color"])).columns

### From Literal Values

In [None]:
pl.select(pl.lit(42))

In [None]:
pl.select(pl.lit(42).alias("answer"))

In [None]:
pl.select(answer=pl.lit(42))

When you execute an expression to a
nonempty DataFrame, the length of the Series will be equal to the number of rows.

In [None]:
fruit.with_columns(planet=pl.lit("Earth"))

In [None]:
# This raises a ShapeError:
fruit.with_columns(pl.lit(pl.Series([False, True])).alias("row_is_even"))

In [None]:
fruit.with_columns(row_is_even=pl.lit([False, True]))

In [None]:
pl.select(pl.repeat("Ella", 3).alias("umbrella"), pl.zeros(3), pl.ones(3))

### From Ranges

In [None]:
pl.select(
    start=pl.int_range(0, 5), end=pl.arange(0, 10, 2).pow(2)
).with_columns(int_range=pl.int_ranges("start", "end")).with_columns(
    range_length=pl.col("int_range").list.len()
)

In [None]:
pl.select(
    start=pl.date_range(pl.date(1985, 10, 21), pl.date(1985, 10, 26)),
    end=pl.repeat(pl.date(2021, 10, 21), 6),
).with_columns(range=pl.datetime_ranges("start", "end", interval="1h"))

## Renaming Expressions

In [None]:
df = pl.DataFrame({"text": "value", "An integer": 5040, "BOOLEAN": True})
df

In [None]:
df.select(
    pl.col("text").name.to_uppercase(),
    pl.col("An integer").alias("int"),
    pl.col("BOOLEAN").name.to_lowercase(),
)

In [None]:
# This raises an InvalidOperationError:
# df.select(
#     pl.all()
#     .name.to_lowercase()
#     .name.map(lambda s: s.replace(" ", "_"))
# )

In [None]:
df.select(
     pl.all()
     .name.to_lowercase()
     .name.map(lambda s: s.replace(" ", "_"))
)

In [None]:
df.select(
    pl.all().name.map(lambda s: s.lower().replace(" ", "_"))
)

## Expressions Are Idiomatic

In [None]:
fruit.filter((fruit["weight"] > 1000) & fruit["is_round"])

In [None]:
(
    fruit.lazy()
    .filter((pl.col("weight") > 1000) & pl.col("is_round"))
    .with_columns(is_berry=pl.col("name").str.ends_with("berry"))
    .collect()
)

In [None]:
# This raises a ShapeError:
(
    fruit
    .lazy()
    .filter((fruit["weight"] > 1000) & fruit["is_round"])
    .with_columns(is_berry=fruit["name"].str.ends_with("berry"))
    .collect()
)

# Continuing Expressions

In [None]:
import math
import numpy as np

rng = np.random.default_rng(1729)

## Types of Operations

<div style="width: 50%; margin: 0 auto;">

![](img/ppdg_0801.png)

</div>

### Example A: Element-Wise Operations

In [None]:
penguins = pl.read_csv("data/penguins.csv", null_values="NA").select(
    "species",
    "island",
    "sex",
    "year",
    mass=pl.col("body_mass_g") / 1000,
)
penguins.with_columns(
    mass_sqrt=pl.col("mass").sqrt(),  
    mass_exp=pl.col("mass").exp(),
)

### Example B: Operations That Summarize to One

In [None]:
penguins.select(pl.col("mass").mean(), pl.col("island").first())

### Example C: Operations That Summarize to One or More

In [None]:
penguins.select(pl.col("island").unique())

### Example D: Operations That Extend

In [None]:
penguins.select(
    pl.col("species")
    .unique()  
    .repeat_by(3000)  
    .explode()  
    .extend_constant("Saiyan", n=1)  
)

## Element-Wise Operations

Each element
is computed independently, and the order in which they appear doesn’t matter.

### Operations That Perform Mathematical Transformations

In [None]:
(
    pl.DataFrame({"x": [-2.0, 0.0, 0.5, 1.0, math.e, 1000.0]}).with_columns(
        abs=pl.col("x").abs(),
        exp=pl.col("x").exp(),
        log2=pl.col("x").log(2),  
        log10=pl.col("x").log10(),
        log1p=pl.col("x").log1p(),
        sign=pl.col("x").sign(),
        sqrt=pl.col("x").sqrt(),
    )
)

### Operations Related to Trigonometry

In [None]:
(
    pl.DataFrame(
        {"x": [-math.pi, 0.0, 1.0, math.pi, 2 * math.pi, 90.0, 180.0, 360.0]}
    ).with_columns(
        arccos=pl.col("x").arccos(),  
        cos=pl.col("x").cos(),
        degrees=pl.col("x").degrees(),
        radians=pl.col("x").radians(),
        sin=pl.col("x").sin(),
    )
)

### Operations That Round and Categorize

In [None]:
(
    pl.DataFrame(
        {"x": [-6.0, -0.5, 0.0, 0.5, math.pi, 9.9, 9.99, 9.999]}
    ).with_columns(
        ceil=pl.col("x").ceil(),
        clip=pl.col("x").clip(-1, 1),
        cut=pl.col("x").cut([-1, 1], labels=["bad", "neutral", "good"]),  
        floor=pl.col("x").floor(),
        qcut=pl.col("x").qcut([0.5], labels=["below median", "above median"]),
        round2=pl.col("x").round(2),
        round0=pl.col("x").round(0),  
    )
)

### Operations for Missing or Infinite Values

In [None]:
x = [42.0, math.nan, None, math.inf, -math.inf]
(
    pl.DataFrame({"x": x}).with_columns(
        fill_nan=pl.col("x").fill_nan(999),
        fill_null=pl.col("x").fill_null(0),  
        is_finite=pl.col("x").is_finite(),
        is_infinite=pl.col("x").is_infinite(),
        is_nan=pl.col("x").is_nan(),
        is_null=pl.col("x").is_null(),
    )
)

## Nonreducing Series-Wise Operations

### Operations That Accumulate

In [None]:
(
    pl.DataFrame(
        {"x": [0.0, 1.0, 2.0, None, 2.0, np.nan, -1.0, 2.0]}
    ).with_columns(
        cum_count=pl.col("x").cum_count(),  
        cum_max=pl.col("x").cum_max(),
        cum_min=pl.col("x").cum_min(),
        cum_prod=pl.col("x").cum_prod(reverse=True),  
        cum_sum=pl.col("x").cum_sum(),
        diff=pl.col("x").diff(),
        pct_change=pl.col("x").pct_change(),
    )
)

### Operations That Fill and Shift

In [None]:
(
    pl.DataFrame(
        {"x": [-1.0, 0.0, 1.0, None, None, 3.0, 4.0, math.nan, 6.0]}
    ).with_columns(
        backward_fill=pl.col("x").backward_fill(),  
        forward_fill=pl.col("x").forward_fill(limit=1),
        interp1=pl.col("x").interpolate(method="linear"),  
        interp2=pl.col("x").interpolate(method="nearest"),
        shift1=pl.col("x").shift(1),
        shift2=pl.col("x").shift(-2),
    )
)

### Operations Related to Duplicate Values

In [None]:
(
    pl.DataFrame({"x": ["A", "C", "D", "C"]}).with_columns(  
        is_duplicated=pl.col("x").is_duplicated(),
        is_first_distinct=pl.col("x").is_first_distinct(),
        is_last_distinct=pl.col("x").is_last_distinct(),
        is_unique=pl.col("x").is_unique(),
    )
)

### Operations That Compute Rolling Statistics

In [None]:
stock = (
    pl.read_csv("data/stock/nvda/2023.csv", try_parse_dates=True)
    .select("date", "close")
    .with_columns(
        ewm_mean=pl.col("close").ewm_mean(com=7, ignore_nulls=True),  
        rolling_mean=pl.col("close").rolling_mean(window_size=7),
        rolling_min=pl.col("close").rolling_min(window_size=7),
    )
)

stock

In [None]:
from plotnine import *

(
    ggplot(stock.unpivot(index="date"), aes("date", "value", color="variable"))
    + geom_line(size=1)
    + labs(x="Date", y="Value", color="Method")
    + theme_tufte(base_size=14)
    + theme(figure_size=(8, 5), dpi=200)
)

### Operations That Sort

In real-world datasets, a row often represents an observation or
event. For that reason, you’ll most likely want to sort entire rows so
that the measurements of each observation or event stay together.

In [None]:
(
    pl.DataFrame(
        {
            "x": [1, 3, None, 3, 7],
            "y": ["D", "I", "S", "C", "O"],
        }
    ).with_columns(
        arg_sort=pl.col("x").arg_sort(),
        shuffle=pl.col("x").shuffle(seed=7),
        sort=pl.col("x").sort(nulls_last=True),
        sort_by=pl.col("x").sort_by("y"),
        reverse=pl.col("x").reverse(),
        rank=pl.col("x").rank(),
    )
)

## Series-Wise Operations That Summarize to One

### Operations That Are Quantifiers

Using quantifiers allows you to summarize multiple Boolean values into one.

In [None]:
df = pl.DataFrame(
    {
        "x": [True, False, False],
        "y": [True, True, True],
        "z": [False, False, False],
    }
)
print(df)
print(
    df.select(
        pl.all().all().name.suffix("_all"),
        pl.all().any().name.suffix("_any"),
    ),
)

### Operations That Compute Statistics

In [None]:
samples = rng.normal(loc=5, scale=3, size=1_000_000)

(
    pl.DataFrame({"x": samples}).select(
        max=pl.col("x").max(),
        mean=pl.col("x").mean(),
        quantile=pl.col("x").quantile(quantile=0.95),
        skew=pl.col("x").skew(),
        std=pl.col("x").std(),
        sum=pl.col("x").sum(),
        var=pl.col("x").var(),
    )
)

### Operations That Count

In [None]:


samples = pl.Series(rng.integers(low=0, high=10_000, size=1_729))
samples[403] = None  
df_ints = pl.DataFrame({"x": samples}).with_row_index()  
df_ints.slice(400, 6)  

In [None]:
df_ints.select(
    approx_n_unique=pl.col("x").approx_n_unique(),
    count=pl.col("x").count(),
    len=pl.col("x").len(),
    n_unique=pl.col("x").n_unique(),
    null_count=pl.col("x").null_count(),
)

In [None]:
large_df_ints = pl.DataFrame(
    {"x": rng.integers(low=0, high=10_000, size=10_000_000)}
)

In [None]:
%%time
large_df_ints.select(pl.col("x").n_unique())

In [None]:
%%time
large_df_ints.select(pl.col("x").approx_n_unique())

### Other Operations

In [None]:
df_ints.select(
    arg_min=pl.col("x").arg_min(),
    first=pl.col("x").first(),
    get=pl.col("x").get(403),  
    implode=pl.col("x").implode(),
    last=pl.col("x").last(),
    upper_bound=pl.col("x").upper_bound(),
)

## Series-Wise Operations That Summarize to One or More

### Operations Related to Unique Values

In [None]:
(
    pl.DataFrame({"x": ["A", "C", "D", "C"]}).select(
        arg_unique=pl.col("x").arg_unique(),
        unique=pl.col("x").unique(maintain_order=True),  
        unique_counts=pl.col("x").unique_counts(),
        value_counts=pl.col("x").value_counts(sort=True),  
    )
)

### Operations That Select

In [None]:
df_ints.select(
    bottom_k=pl.col("x").bottom_k(7),  
    head=pl.col("x").head(7),
    sample=pl.col("x").sample(7),
    slice=pl.col("x").slice(400, 7),
    gather=pl.col("x").gather([1, 1, 2, 3, 5, 8, 13]),
    gather_every=pl.col("x").gather_every(247),  
    top_k=pl.col("x").top_k(7),
)

### Operations That Drop Missing Values

In [None]:
x = [None, 1.0, 2.0, 3.0, np.nan]
(
    pl.DataFrame({"x": x}).select(
        drop_nans=pl.col("x").drop_nans(), drop_nulls=pl.col("x").drop_nulls()
    )
)

## Series-Wise Operations That Extend

In [None]:
(
    pl.DataFrame(
        {
            "x": [["a", "b"], ["c", "d"]],
        }
    ).select(explode=pl.col("x").explode())
)

# Combining Expressions

- Through arithmetic, such as adding and multiplying
- By comparing, such as greater than and equals
- With Boolean algebra, such as conjunction and negation
- Via bitwise operations, such as AND and XOR
- Using a variety of module-level functions

In [None]:
fruit = pl.read_csv("data/fruit.csv")
fruit.filter(pl.col("is_round") & (pl.col("weight") > 1000))

## Inline Operators Versus Methods

<div style="width: 50%; margin: 0 auto;">

![](img/ppdg_0901.png)

</div>

In [None]:
(
    pl.DataFrame({"i": [6.0, 0, 2, 2.5], "j": [7.0, 1, 2, 3]}).with_columns(
        (pl.col("i") * pl.col("j")).alias("*"),
        pl.col("i").mul(pl.col("j")).alias("Expr.mul()"),
    )
)

## Arithmetic Operations

In [None]:
fruit.select(pl.col("name"), (pl.col("weight") / 1000))

In [None]:
(
    pl.DataFrame({"i": [0.0, 2, 2, -2, -2], "j": [1, 2, 3, 4, -5]}).with_columns(
        (pl.col("i") + pl.col("j")).alias("i + j"),
        (pl.col("i") - pl.col("j")).alias("i - j"),
        (pl.col("i") * pl.col("j")).alias("i * j"),
        (pl.col("i") / pl.col("j")).alias("i / j"),
        (pl.col("i") // pl.col("j")).alias("i // j"),
        (pl.col("i") ** pl.col("j")).alias("i ** j"),
        (pl.col("j") % 2).alias("j % 2"),  
        pl.col("i").dot(pl.col("j")).alias("i ⋅ j"),  
    )
)

## Comparison Operations

- "Which of these experiments produced a significant result?"
- "Which movies released in the ’90s have an IMDB score of 8.7 or higher?"
- "Are these voltages within the allowed range?"


In [None]:
pl.select(pl.lit("a") > pl.lit("b"))

In [None]:
(
    fruit.select(
        pl.col("name"),
        pl.col("weight"),
    ).filter(pl.col("weight") >= 1000)
)

In Python itself, you can chain inline operators:

In [None]:
x = 4
3 < x < 5

With Polars, however, if you do this, you get an error:

In [None]:
# This raises a TypeError:
pl.select(pl.lit(3) < pl.lit(x) < pl.lit(5))

In [None]:
pl.select((pl.lit(3) < pl.lit(x)) & (pl.lit(x) < pl.lit(5))).item()

In [None]:
pl.select(pl.lit(x).is_between(3, 5)).item()

In [None]:
(
    pl.DataFrame(
        {"a": [-273.15, 0, 42, 100], "b": [1.4142, 2.7183, 42, 3.1415]}
    ).with_columns(
        (pl.col("a") == pl.col("b")).alias("a == b"),
        (pl.col("a") <= pl.col("b")).alias("a <= b"),
        (pl.all() > 0).name.suffix(" > 0"),
        ((pl.col("b") - pl.lit(2).sqrt()).abs() < 1e-3).alias("b ≈ √2"),  
        ((1 < pl.col("b")) & (pl.col("b") < 3)).alias("1 < b < 3"),
    )
)

In [None]:
pl.select(
    bool_num=pl.lit(True) > 0,
    time_time=pl.time(23, 58) > pl.time(0, 0),
    datetime_date=pl.datetime(1969, 7, 21, 2, 56) < pl.date(1976, 7, 20),
    str_num=pl.lit("5") < pl.lit(3).cast(pl.String),  
    datetime_time=pl.datetime(1999, 1, 1).dt.time() != pl.time(0, 0),  
).transpose(  
    include_header=True, header_name="comparison", column_names=["allowed"]
)

## Boolean Algebra Operations

In [None]:
x = 7
p = pl.lit(3) < pl.lit(x)  # True
q = pl.lit(x) < pl.lit(5)  # False
pl.select(p & q).item()

In [None]:
(
    pl.DataFrame(
        {"p": [True, True, False, False], "q": [True, False, True, False]}
    ).with_columns(
        (pl.col("p") & pl.col("q")).alias("p & q"),
        (pl.col("p") | pl.col("q")).alias("p | q"),
        (~pl.col("p")).alias("~p"),
        (pl.col("p") ^ pl.col("q")).alias("p ^ q"),
        (~(pl.col("p") & pl.col("q"))).alias("p ↑ q"),  
        ((pl.col("p").or_(pl.col("q"))).not_()).alias("p ↓ q"),  
    )
)

## Bitwise Operations

In [None]:
pl.select(pl.lit(10) | pl.lit(34)).item()

In [None]:
bits = pl.DataFrame(
    {"x": [1, 1, 0, 0, 7, 10], "y": [1, 0, 1, 0, 2, 34]},
    schema={"x": pl.UInt8, "y": pl.UInt8},
).with_columns(  
    (pl.col("x") & pl.col("y")).alias("x & y"),
    (pl.col("x") | pl.col("y")).alias("x | y"),
    (~pl.col("x")).alias("~x"),
    (pl.col("x") ^ pl.col("y")).alias("x ^ y"),
)
bits

# Eager and Lazy APIs

## Eager API: DataFrame

In [None]:
%%time
trips = pl.read_parquet("data/taxi/yellow_tripdata_*.parquet")  
sum_per_vendor = trips.group_by("VendorID").sum()  

income_per_distance_per_vendor = sum_per_vendor.select(
    "VendorID",
    income_per_distance=pl.col("total_amount") / pl.col("trip_distance"),
)

top_three = income_per_distance_per_vendor.sort(  
    by="income_per_distance", descending=True
).head(3)

top_three

## Lazy API: LazyFrame

The lazy API defers executing all selection, filtering, and manipulation until the moment it is actually needed. 

It reduces the amount of data:

- Only reading columns that are needed
- Filtering out rows that are not needed
- Only reading parts of the column that are needed for the query



## Performance Differences

In [None]:
%%time
trips = pl.scan_parquet("data/taxi/yellow_tripdata_*.parquet")
sum_per_vendor = trips.group_by("VendorID").sum()

income_per_distance_per_vendor = sum_per_vendor.select(
    "VendorID",
    income_per_distance=pl.col("total_amount") / pl.col("trip_distance"),
)

top_three = income_per_distance_per_vendor.sort(
    by="income_per_distance", descending=True
).head(3)

top_three.collect()

The lazy API can catch data type errors before processing the data. 

In [None]:
# This raises a SchemaError:
names_lf = pl.LazyFrame({"name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]})

erroneous_query = names_lf.with_columns(
    sliced_age=pl.col("age").str.slice(1, 3)
)

result_df = erroneous_query.collect()

# Selecting and Creating Columns

In [None]:
starwars = pl.read_parquet("data/starwars.parquet")
rebels = starwars.drop("films").filter(
    pl.col("name").is_in(["Luke Skywalker", "Leia Organa", "Han Solo"])
)

rebels

## Selecting Columns

<div style="width: 50%; margin: 0 auto;">

![](img/ppdg_1001.png)

</div>

In [None]:
rebels.select(
    "name",
    pl.col("homeworld"),
    pl.col("^.*_color$"),
    (pl.col("height") / 100).alias("height_m"),
)

### Introducing Selectors

In [None]:
import polars.selectors as cs

In [None]:
rebels.select(
    "name",
    cs.by_name("homeworld"),
    cs.by_name("^.*_color$"),
    (cs.by_name("height") / 100).alias("height_m"),
)

### Selecting Based on Name

In [None]:
rebels.select(cs.starts_with("birth_"))

In [None]:
rebels.select(cs.ends_with("_color"))

In [None]:
rebels.select(cs.contains("_"))

In [None]:
rebels.select(cs.matches("^[a-z]{4}$"))

### Selecting Based on Data Type

In [None]:
rebels.group_by("hair_color").agg(cs.numeric().mean())

In [None]:
rebels.select(cs.string())

In [None]:
rebels.select(cs.temporal())

In [None]:
rebels.select(cs.by_dtype(pl.List(pl.String)))

### Selecting Based on Position

In [None]:
rebels.select(cs.by_index(range(0, 999, 3), require_all=False))

In [None]:
rebels.select("name", cs.by_index(range(-2, 0)))

### Combining Selectors

In [None]:
rebels.select(cs.by_name("hair_color") | cs.numeric())

### Bring Forth the Columns

In [None]:
df = pl.DataFrame({"d": 1, "i": True, "s": True, "c": True, "o": 1.0})


In [None]:
df.select(first := cs.by_name("c", "i"), ~first)

In [None]:
df.select(first := cs.last(), ~first)

## Creating Columns

<div style="width: 50%; margin: 0 auto;">

![](img/ppdg_1002.png)

</div>

If you create a new column with the same name as an existing column, the existing
column will be overwritten. This is a common pitfall, so be careful when creating new
columns.

In [None]:
rebels.with_columns(bmi=pl.col("mass") / ((pl.col("height") / 100) ** 2))

In [None]:
df = pl.DataFrame({"a": [1, 2, 3]})
df.with_columns(pl.col("a") * 2)

In [None]:
df.with_columns(a2=pl.col("a") * 2)

In [None]:
rebels.with_columns(
    bmi=pl.col("mass") / ((pl.col("height") / 100) ** 2),
    age_destroy=(
        (pl.date(1983, 5, 25) - pl.col("birth_date")).dt.total_days() / 365
    ).cast(pl.UInt8),
)

Expressions cannot depend on each other because they are executed in parallel. 

In [None]:
# This raises a ColumnNotFoundError:
# rebels.with_columns(
#     bmi=pl.col("mass") / ((pl.col("height") / 100) ** 2),
#     bmi_cat=pl.col("bmi").cut(
#         [18.5, 25], labels=["Underweight", "Normal", "Overweight"]
#     ),
# )

In [None]:
(
    rebels.with_columns(
        bmi=pl.col("mass") / ((pl.col("height") / 100) ** 2)
    ).with_columns(
        bmi_cat=pl.col("bmi").cut(
            [18.5, 25], labels=["Underweight", "Normal", "Overweight"]
        )
    )
)

In [None]:
# This raises a SyntaxError:
# starwars.select(
#    "name",
#    bmi=(pl.col("mass") / ((pl.col("height") / 100) ** 2)),
#    "species",
#)

In [None]:
(
    starwars.select(
        "name",
        (pl.col("mass") / ((pl.col("height") / 100) ** 2)).alias("bmi"),  
        "species",
    )
    .drop_nulls()
    .top_k(5, by="bmi")  
)

## Related Column Operations

### Dropping

In [None]:
rebels.drop("name", "screen_time", strict=False)  

In [None]:
rebels.select(~cs.by_name("name", "screen_time"))

In [None]:
rebels.select(cs.exclude("name", "screen_time"))

### Renaming

In [None]:
(
    rebels.rename({"homeworld": "planet", "mass": "weight"})
    .rename(lambda s: s.removesuffix("_color"))
    .select("name", "planet", "weight", "hair", "skin", "eye")  
)

### Stacking

If you have a second DataFrame or one or more Series that have the same length as
the first DataFrame, then you can combine them by horizontally stacking them.

In [None]:
rebel_names = rebels.select("name")
rebel_colors = rebels.select(cs.ends_with("_color"))
rebel_quotes = pl.Series(
    "quote",
    [
        "You know, sometimes I amaze myself.",
        "That doesn't sound too hard.",
        "I have a bad feeling about this.",
    ],
)

(rebel_names.hstack(rebel_colors).hstack([rebel_quotes]))  

### Adding Row Indices

In [None]:
rebels.with_row_index(name="rebel_id", offset=1)

# Filtering and Sorting Rows

In [None]:
tools = pl.read_csv("data/tools.csv")
tools

## Filtering Rows

<div style="width: 50%; margin: 0 auto;">

![](img/ppdg_1101.png)

</div>

### Filtering Based on Expressions

In [None]:
tools.filter(pl.col("cordless") & (pl.col("brand") == "Makita"))  

In [None]:
tools.filter(pl.col("cordless"), pl.col("brand") == "Makita")

### Filtering Based on Column Names

In [None]:
tools.filter("cordless")

## Sorting Rows

<div style="width: 50%; margin: 0 auto;">

![](img/ppdg_1102.png)

</div>

### Sorting Based on a Single Column

In [None]:
tools.sort("price")

### Sorting in Reverse

In [None]:
tools.sort("price", descending=True)

In [None]:
# This raises a TypeError:
# tools.sort("price", ascending=False)

### Sorting Based on Multiple Columns

In [None]:
tools.sort("brand", "price")

In [None]:
tools.sort("brand", "price", descending=[False, True])

### Sorting Based on Expressions

In [None]:
tools.sort(pl.col("rpm") / pl.col("price"))

## Related Row Operations

### Filtering Missing Values

In [None]:
tools.drop_nulls("rpm").height

In [None]:
tools.filter(pl.all_horizontal(pl.all().is_not_null())).height

### Slicing

In [None]:
tools.with_row_index().gather_every(2).head(3)

### Top and Bottom

In [None]:
tools.top_k(3, by="price")

### Sampling

In [None]:
tools.sample(fraction=0.2)

### Semi-Joins

In [None]:
saws = pl.DataFrame(
    {
        "tool": [
            "Table Saw",
            "Plunge Cut Saw",
            "Miter Saw",
            "Jigsaw",
            "Bandsaw",
            "Chainsaw",
            "Seesaw",
        ]
    }
)
tools.join(saws, how="semi", on="tool")

# Working with Textual, Temporal, and Nested Data Types

## String

### String Examples

In [None]:
corpus = pl.DataFrame(
    {
        "raw_text": [
            "  Data Science is amazing ",
            "Data_analysis > Data entry",
            " Python&Polars; Fast",
        ]
    }
)

corpus

In [None]:
corpus = corpus.with_columns(
    processed_text=pl.col("raw_text")  
    .str.strip_chars()  
    .str.to_lowercase()  
    .str.replace_all("_", " ")  
)
corpus

In [None]:
corpus.with_columns(
    first_5_chars=pl.col("processed_text").str.slice(0, 5),  
    first_word=pl.col("processed_text")
    .str.split(" ")  
    .list.get(0),  
    second_word=pl.col("processed_text").str.split(" ").list.get(1),  
)

In [None]:
corpus.with_columns(
    len_chars=pl.col("processed_text").str.len_chars(),  
    len_bytes=pl.col("processed_text").str.len_bytes(),  
    count_a=pl.col("processed_text").str.count_matches("a"),  
)

In [None]:
posts = pl.DataFrame(
    {"post": ["Loving #python and #polars!", "A boomer post without a hashtag"]}
)

hashtag_regex = r"#(\w+)"  

posts.with_columns(
    hashtags=pl.col("post").str.extract_all(hashtag_regex)  
)

## Categorical

In [None]:
cats = pl.DataFrame(
    {"name": ["Persian cat", "Siamese Cat", "Lynx", "Lynx"]},
    schema={"name": pl.Categorical},
)

cats.with_columns(name_physical=pl.col("name").to_physical())

### Categorical Examples

In [None]:
more_cats = pl.DataFrame(
    {"name": ["Maine Coon Cat", "Lynx", "Lynx", "Siamese Cat"]},
    schema={"name": pl.Categorical},
)

more_cats.with_columns(name_physical=pl.col("name").to_physical())

In [None]:
cats.join(more_cats, on="name")

In [None]:
with pl.StringCache():
    left = pl.DataFrame(
        {
            "categorical_column": ["value3", "value2", "value1"],
            "other": ["a", "b", "c"],
        },
        schema={"categorical_column": pl.Categorical, "other": pl.String},
    )
    right = pl.DataFrame(
        {
            "categorical_column": ["value2", "value3", "value4"],
            "other": ["d", "e", "f"],
        },
        schema={"categorical_column": pl.Categorical, "other": pl.String},
    )

In [None]:
left.join(right, on="categorical_column")

In [None]:
pl.enable_string_cache()

In [None]:
right.select(pl.col("categorical_column").cat.get_categories())

In [None]:
sorting_comparison_df = cats.select(cat_lexical=pl.col("name")).with_columns(
    cat_physical=pl.col("cat_lexical").to_physical()
)

sorting_comparison_df

In [None]:
# sorting_comparison_df.with_columns(
#     pl.col("cat_lexical").cast(pl.Categorical("physical"))
# ).sort(by="cat_lexical")

# A Categorical with physical ordering has been deprecated in the meanwhile. Sorting is now always lexical.

In [None]:
sorting_comparison_df.with_columns(
    pl.col("cat_lexical").cast(pl.Categorical("lexical"))
).sort(by="cat_lexical")

## Enum

In [None]:
bear_enum_dtype = pl.Enum(["Polar", "Panda", "Brown"])

bear_enum_series = pl.Series(
    ["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=bear_enum_dtype
)

bear_cat_series = pl.Series(
    ["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=pl.Categorical
)

## Temporal

### Temporal Examples

#### Loading from a CSV file

In [None]:
pl.read_csv("data/all_stocks.csv", try_parse_dates=True)

#### Converting to and from a String

In [None]:
dates = pl.DataFrame({"date_str": ["2023-12-31", "2024-02-29"]}).with_columns(
    date=pl.col("date_str").str.to_date("%Y-%m-%d")
)

dates

In [None]:
dates.with_columns(formatted_date=pl.col("date").dt.to_string("%d-%m-%Y"))

#### Generating date ranges

In [None]:
pl.DataFrame(
    {
        "monday": pl.date_range(
            start=pl.date(2024, 10, 28),
            end=pl.date(2024, 12, 1),
            interval="1w",  
            eager=True,  
        ),
    }
)

#### Time zones

In [None]:
pl.DataFrame(  
    {
        "utc_mixed_offset": [
            "2021-03-27T00:00:00+0100",
            "2021-03-28T00:00:00+0100",
            "2021-03-29T00:00:00+0200",
            "2021-03-30T00:00:00+0200",
        ]
    }
).with_columns(
    parsed=pl.col("utc_mixed_offset").str.to_datetime(
        "%Y-%m-%dT%H:%M:%S%z"
    )  
).with_columns(
    converted=pl.col("parsed").dt.convert_time_zone("Europe/Amsterdam")  
)

## List

### List Examples

In [None]:
bools = pl.DataFrame({"values": [[True, True], [False, False, True], [False]]})

bools.with_columns(
    all_true=pl.col("values").list.all(),
    any_true=pl.col("values").list.any(),
)

In [None]:
groups = pl.DataFrame({"ages": [[18, 21], [30, 40, 50], [42, 69]]})

groups.with_columns(
    over_forty=pl.col("ages").list.eval(
        pl.element() > 40,  
        parallel=True,  
    )
).with_columns(  
    all_over_forty=pl.col("over_forty").list.all()  
)

In [None]:
groups.with_columns(
    ages_sorted_descending=pl.col("ages").list.sort(descending=True)
)

In [None]:
groups.explode("ages")

In [None]:
groups.select(ages=pl.col("ages").list.explode())

## Array

### Array Examples

In [None]:
events = pl.DataFrame(
    [
        pl.Series(
            "location", ["Paris", "Amsterdam", "Barcelona"], dtype=pl.String
        ),
        pl.Series(
            "temperatures",
            [
                [23, 27, 21, 22, 24, 23, 22],
                [17, 19, 15, 22, 18, 20, 21],
                [30, 32, 28, 29, 34, 33, 31],
            ],
            dtype=pl.Array(pl.Int64, shape=7),
        ),
    ]
)

events

In [None]:
events.with_columns(
    median=pl.col("temperatures").arr.median(),
    max=pl.col("temperatures").arr.max(),
    warmest_dow=pl.col("temperatures").arr.arg_max(),
)

## Struct

### Struct Examples

In [None]:
from datetime import date

orders = pl.DataFrame(
    {
        "customer_id": [2781, 6139, 5392],
        "order_details": [
            {"amount": 250.00, "date": date(2024, 1, 3), "items": 5},
            {"amount": 150.00, "date": date(2024, 1, 5), "items": 1},
            {"amount": 100.00, "date": date(2024, 1, 2), "items": 3},
        ],
    },
)

orders

In [None]:
orders.select(pl.col("order_details").struct.field("amount"))

In [None]:
order_details_df = orders.unnest("order_details")

order_details_df

In [None]:
order_details_df.select(
    "amount",
    "date",
    "items",
    order_details=pl.struct(pl.col("amount"), pl.col("date"), pl.col("items")),
)

In [None]:
basket = pl.DataFrame(
    {
        "fruit": ["cherry", "apple", "banana", "banana", "apple", "banana"],
    }
)

basket

In [None]:
basket.select(pl.col("fruit").value_counts(sort=True))

In [None]:
basket.select(pl.col("fruit").value_counts(sort=True).struct.unnest())

# Joining and Concatenating

## Joining

### Join Strategies

In [None]:
df_left = pl.DataFrame({"key": ["A", "B", "C", "D"], "value": [1, 2, 3, 4]})

df_right = pl.DataFrame({"key": ["B", "C", "D", "E"], "value": [5, 6, 7, 8]})

#### Inner

In [None]:
df_left.join(df_right, on="key", how="inner")

#### Full

In [None]:
df_left.join(df_right, on="key", how="full", suffix="_other")

#### Left

In [None]:
df_left.join(df_right, on="key", how="left")

#### Right

In [None]:
df_left.join(df_right, on="key", how="right")

#### Cross

In [None]:
df_left.join(df_right, how="cross")

#### Semi

In [None]:
df_left.join(df_right, on="key", how="semi")

#### Anti

In [None]:
df_left.join(df_right, on="key", how="anti")

### Joining on Multiple Columns

In [None]:
residences_left = pl.DataFrame(
    {
        "name": ["Alice", "Bob", "Charlie", "Dave"],
        "city": ["NY", "LA", "NY", "SF"],
        "age": [25, 30, 35, 40],
    }
)

departments_right = pl.DataFrame(
    {
        "name": ["Alice", "Bob", "Charlie", "Dave"],
        "city": ["NY", "LA", "NY", "Chicago"],
        "department": ["Finance", "Marketing", "Engineering", "Operations"],
    }
)

residences_left.join(departments_right, on=["name", "city"], how="inner")

## Vertical and Horizontal Concatenation

### Vertical

In [None]:
df1 = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "value": ["a", "b", "c"],
    }
)
df2 = pl.DataFrame(
    {
        "id": [4, 5],
        "value": ["d", "e"],
    }
)
pl.concat([df1, df2], how="vertical")

### Horizontal

In [None]:
df1 = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "value": ["a", "b", "c"],
    }
)
df2 = pl.DataFrame(
    {
        "value2": ["x", "y"],
    }
)
pl.concat([df1, df2], how="horizontal")

### Diagonal

In [None]:
df1 = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "value": ["a", "b", "c"],
    }
)
df2 = pl.DataFrame(
    {
        "value": ["d", "e"],
        "value2": ["x", "y"],
    }
)
pl.concat([df1, df2], how="diagonal")

### Align

In [None]:
df1 = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "value": ["a", "b", "c"],
    }
)
df2 = pl.DataFrame(
    {
        "value": ["a", "c", "d"],
        "value2": ["x", "y", "z"],
    }
)
pl.concat([df1, df2], how="align")

In [None]:
df1 = pl.DataFrame(
    {
        "id": [1, 2, 2],
        "value": ["a", "c", "b"],
    }
)
df2 = pl.DataFrame(
    {
        "id": [2, 2],
        "value": ["x", "y"],
    }
)
pl.align_frames(df1, df2, on="id")

### Relaxed

In [None]:
# This raises a SchemaError:
df1 = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "value": ["a", "b", "c"],
    }
)
df2 = pl.DataFrame(
    {
        "id": [4.0, 5.0],
        "value": [1, 2],
    }
)
pl.concat([df1, df2], how="vertical")

In [None]:
pl.concat([df1, df2], how="vertical_relaxed")

### Stacking

In [None]:
df1 = pl.DataFrame(
    {
        "id": [1, 2],
        "value": ["a", "b"],
    }
)
df2 = pl.DataFrame(
    {
        "id": [3, 4],
        "value": ["c", "d"],
    }
)
df1.vstack(df2)

In [None]:
df1 = pl.DataFrame(
    {
        "id": [1, 2],
        "value": ["a", "b"],
    }
)
df2 = pl.DataFrame(
    {
        "value2": ["x", "y"],
    }
)
df1.hstack(df2)

### Appending

In [None]:
series_a = pl.Series("a", [1, 2])
series_b = pl.Series("b", [3, 4])
series_a.append(series_b)

### Extending

In [None]:
df1 = pl.DataFrame(
    {
        "id": [1, 2],
        "value": ["a", "b"],
    }
)
df2 = pl.DataFrame(
    {
        "id": [3, 4],
        "value": ["c", "d"],
    }
)
df1.extend(df2)

#  Reshaping

## Wide Versus Long DataFrames

In [None]:
grades_wide = pl.DataFrame(
    {
        "student": ["Jeroen", "Thijs", "Ritchie"],
        "math": [85, 78, 92],
        "science": [90, 82, 85],
        "history": [88, 80, 87],
    }
)

grades_wide

In [None]:
grades_long = pl.DataFrame(
    {
        "student": [
            "Jeroen",
            "Jeroen",
            "Jeroen",
            "Thijs",
            "Thijs",
            "Thijs",
            "Ritchie",
            "Ritchie",
            "Ritchie",
        ],
        "subject": [
            "Math",
            "Science",
            "History",
            "Math",
            "Science",
            "History",
            "Math",
            "Science",
            "History",
        ],
        "grade": [85, 90, 88, 78, 82, 80, 92, 85, 87],
    }
)

grades_long

## Pivot to a Wider DataFrame

In [None]:
grades = pl.DataFrame(
    {
        "student": [
            "Jeroen",
            "Jeroen",
            "Jeroen",
            "Thijs",
            "Thijs",
            "Thijs",
            "Ritchie",
            "Ritchie",
            "Ritchie",
        ],
        "subject": [
            "Math",
            "Science",
            "History",
            "Math",
            "Science",
            "History",
            "Math",
            "Science",
            "History",
        ],
        "grade": [85, 90, 88, 78, 82, 80, 92, 85, 87],
    }
)

grades

In [None]:
grades.pivot(index="student", on="subject", values="grade")

In [None]:
multiple_grades = pl.DataFrame(
    {
        "student": [
            "Jeroen",
            "Jeroen",
            "Jeroen",
            "Jeroen",
            "Jeroen",
            "Jeroen",
            "Thijs",
            "Thijs",
            "Thijs",
            "Thijs",
            "Thijs",
            "Thijs",
        ],
        "subject": [
            "Math",
            "Math",
            "Math",
            "Science",
            "Science",
            "Science",
            "Math",
            "Math",
            "Math",
            "Science",
            "Science",
            "Science",
        ],
        "grade": [85, 88, 85, 60, 66, 63, 51, 79, 62, 82, 85, 82],
    }
)

multiple_grades

In [None]:
multiple_grades.pivot(
    index="student", on="subject", values="grade", aggregate_function="mean"
)

In [None]:
multiple_grades.pivot(
    index="student",
    on="subject",
    values="grade",
    aggregate_function=pl.element().max() - pl.element().min(),
)

In [None]:
lf = pl.LazyFrame(
    {
        "col1": ["a", "a", "a", "b", "b", "b"],
        "col2": ["x", "x", "x", "x", "y", "y"],
        "col3": [6, 7, 3, 2, 5, 7],
    }
)

index = pl.col("col1")
on = pl.col("col2")
values = pl.col("col3")
unique_column_values = ["x", "y"]
aggregate_function = lambda col: col.tanh().mean()

lf.group_by(index).agg(
    aggregate_function(values.filter(on == value)).alias(value)
    for value in unique_column_values
).collect()

## Unpivot to a Longer DataFrame

In [None]:
grades_wide = pl.DataFrame(
    {
        "student": ["Jeroen", "Thijs", "Ritchie"],
        "math": [85, 78, 92],
        "science": [90, 82, 85],
        "history": [88, 80, 87],
    }
)

grades_wide

In [None]:
grades_wide.unpivot(
    index=["student"],
    on=["math", "science", "history"],
    variable_name="subject",
    value_name="grade",
)

In [None]:
df = pl.DataFrame(
    {
        "student": ["Jeroen", "Thijs", "Ritchie", "Jeroen", "Thijs", "Ritchie"],
        "class": [
            "Math101",
            "Math101",
            "Math101",
            "Math102",
            "Math102",
            "Math102",
        ],
        "age": [20, 21, 22, 20, 21, 22],
        "semester": ["Fall", "Fall", "Fall", "Spring", "Spring", "Spring"],
        "math": [85, 78, 92, 88, 79, 95],
        "science": [90, 82, 85, 92, 81, 87],
        "history": [88, 80, 87, 85, 82, 89],
    }
)
df

In [None]:
df.unpivot(
    index=["student", "class", "age", "semester"],
    on=["math", "science", "history"],
    variable_name="subject",
    value_name="grade",
)

## Transposing

In [None]:
grades_wide = pl.DataFrame(
    {
        "student": ["Jeroen", "Thijs", "Ritchie"],
        "math": [85, 78, 92],
        "science": [90, 82, 85],
        "history": [88, 80, 87],
    }
)

grades_wide

In [None]:
report_columns = (f"report_{i + 1}" for i, _ in enumerate(grades_wide.columns))  

grades_wide.transpose(
    include_header=True,
    header_name="original_headers",
    column_names=report_columns,
)

## Exploding

In [None]:
grades_nested = pl.DataFrame(
    {
        "student": ["Jeroen", "Thijs", "Ritchie"],
        "math": [[85, 90, 88], [78, 82, 80], [92, 85, 87]],
    }
)

grades_nested

In [None]:
grades_nested.explode("math")

In [None]:
grades_nested = pl.DataFrame(
    {
        "student": ["Jeroen", "Thijs", "Ritchie"],
        "math": [[85, 90, 88], [78, 82, 80], [92, 85, 87]],
        "science": [[85, 90, 88], [78, 82], [92, 85, 87]],
        "history": [[85, 90, 88], [78, 82], [92, 85, 87]],
    }
)

grades_nested

In [None]:
# This raises a ShapeError:
# grades_nested.explode("math", "science", "history")

In [None]:
grades_nested_long = grades_nested.unpivot(
    index="student", variable_name="subject", value_name="grade"
)

grades_nested_long

In [None]:
grades_nested_long.explode("grade")

In [None]:
nested_lists = pl.DataFrame(
    {
        "id": [1, 2],
        "nested_value": [[["a", "b"]], [["c"], ["d", "e"]]],
    },
    strict=False,
)
nested_lists

In [None]:
nested_lists.explode("nested_value")

In [None]:
nested_lists.explode("nested_value").explode("nested_value")

## Partition into Multiple DataFrames

In [None]:
sales = pl.DataFrame(
    {
        "OrderID": [1, 2, 3, 4, 5, 6],
        "Product": ["A", "B", "A", "C", "B", "A"],
        "Quantity": [10, 5, 8, 7, 3, 12],
        "Region": ["North", "South", "North", "West", "South", "West"],
    }
)

In [None]:
sales.partition_by("Region")

In [None]:
sales_dict = sales.partition_by(["Region"], as_dict=True)

sales_dict

In [None]:
sales_dict[("North",)]