# <font color="#418FDE" size="6.5" uppercase>**Grouping and Aggregation**</font>

>Last update: 20251227.
    
By the end of this Lecture, you will be able to:
- Perform groupby operations in Polars that mirror common Pandas groupby use cases. 
- Define multiple aggregations and custom expressions within a single Polars groupby call. 
- Compare performance and readability of Polars groupby pipelines to their Pandas counterparts on sample datasets. 


## **1. Polars Groupby Essentials**

### **1.1. Multi Key Grouping**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_01_01.jpg?v=1766895377" width="250">



>* Multi key grouping uses combinations of columns
>* Each unique value combination forms a detailed group

>* Group by multiple columns to reveal patterns
>* Polars handles multi key groups naturally and efficiently

>* Multi key grouping slices data into detailed segments
>* Polars efficiently summarizes statistics across these segments



In [None]:
#@title Python Code - Multi Key Grouping

# Demonstrate multi key grouping using Polars DataFrame operations.
# Show grouping by store and product category together.
# Compare single key and multi key aggregation outputs clearly.

import polars as pl

# Create a small retail style dataset with multiple grouping keys.
data = {
    "store": ["North", "North", "South", "South", "West", "West"],
    "category": ["Tools", "Garden", "Tools", "Garden", "Tools", "Garden"],
    "day": ["Mon", "Mon", "Mon", "Tue", "Tue", "Tue"],
    "revenue_dollars": [120, 80, 150, 60, 200, 90],
}


# Build a Polars DataFrame from the dictionary data structure.
df = pl.DataFrame(data)

# Show the original data to understand available grouping keys.
print("Original data with store, category, day, revenue:")
print(df)


# Group by a single key, here only by store column.
by_store = df.group_by("store").agg(pl.col("revenue_dollars").sum().alias("total_revenue"))

# Display total revenue per store using single key grouping.
print("\nTotal revenue grouped only by store:")
print(by_store)


# Group by two keys, store and category together as composite key.
by_store_category = df.group_by(["store", "category"]).agg(
    pl.col("revenue_dollars").sum().alias("total_revenue"))

# Display revenue per store and category pair using multi key grouping.
print("\nTotal revenue grouped by store and category:")
print(by_store_category)



### **1.2. Basic Group Aggregations**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_01_02.jpg?v=1766895393" width="250">



>* Use groupby to compute basic summary statistics
>* Summaries collapse many rows into per-group totals

>* Grouped aggregations answer many real analysis questions
>* They turn raw rows into comparable summary metrics

>* Shift mindset from rows to aggregated views
>* Use grouped summaries to spot patterns and outliers



In [None]:
#@title Python Code - Basic Group Aggregations

# Demonstrate basic group aggregations using Polars on a tiny sales dataset.
# Show how to group by region and compute simple summary statistics clearly.
# Compare raw data and aggregated results to reinforce grouped aggregation concepts.

import polars as pl

# Create a small sales DataFrame with region, units, and revenue columns.
data = {
    "region": ["North", "North", "South", "South", "West", "West"],
    "units_sold": [10, 5, 8, 12, 7, 3],
    "revenue_usd": [200, 120, 160, 300, 140, 60],
}

# Build the Polars DataFrame from the dictionary data structure.
df = pl.DataFrame(data)

# Show the original row level data for context and understanding.
print("Original sales data by order row:")
print(df)

# Group by region and compute count, sum, and average aggregations.
agg_df = (
    df.groupby("region")
    .agg([
        pl.count().alias("order_count"),
        pl.col("units_sold").sum().alias("total_units"),
        pl.col("revenue_usd").sum().alias("total_revenue"),
        pl.col("revenue_usd").mean().alias("average_revenue"),
    ])
)

# Display the aggregated results, one summary row per region group.
print("\nAggregated sales metrics per region:")
print(agg_df)



### **1.3. Missing Groups and Categories**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_01_03.jpg?v=1766895426" width="250">



>* Groupby usually returns only observed categories
>* Know difference between missing groups and zero values

>* Groupby only returns categories present in data
>* Add missing categories later using reference tables

>* Distinguish missing groups from true zero values
>* Build full group grid to reveal data gaps



In [None]:
#@title Python Code - Missing Groups and Categories

# Show how missing categories disappear after grouping results.
# Demonstrate reintroducing all categories using a reference table.
# Compare grouped output before and after filling missing categories.

import polars as pl

# Create simple sales data with missing product categories.
data = pl.DataFrame({"month": ["Jan", "Jan", "Feb", "Feb"], "category": ["A", "B", "A", "A"], "sales_dollars": [100, 150, 80, 40]})

# Define all possible categories, including one never appearing in data.
all_categories = pl.DataFrame({"category": ["A", "B", "C"]})

# Group by month and category, summing sales dollars for each combination.
grouped = data.group_by(["month", "category"]).agg(pl.col("sales_dollars").sum().alias("total_sales"))

# Print grouped result, note category C never appears anywhere.
print("Grouped result without missing categories filled:")
print(grouped)

# Build full grid of months and all categories using cross join operation.
months = data.select("month").unique().sort()
full_grid = months.join(all_categories, how="cross")

# Join grouped data onto full grid, filling missing totals with zero values.
completed = full_grid.join(grouped, on=["month", "category"], how="left").with_columns(pl.col("total_sales").fill_null(0))

# Print completed result, now every month shows every category explicitly.
print("\nCompleted result with explicit zero sales categories:")
print(completed)



## **2. Multiple Polars Aggregations**

### **2.1. Multi Column Aggregations**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_02_01.jpg?v=1766895442" width="250">



>* Compute many summaries across multiple columns together
>* Get a wide table of metrics per group

>* Combine many group metrics in one step
>* Reduces mental overhead and improves engine optimization

>* Group many stakeholder metrics in one summary
>* Single table improves consistency and repeatable analysis



In [None]:
#@title Python Code - Multi Column Aggregations

# Demonstrate grouped multi column aggregations using Polars DataFrame operations.
# Show several summary metrics computed together for multiple numeric columns.
# Compare results for grouped stores and months in a compact summary table.

import polars as pl

# Create a small example dataset with store, month, and transaction details.
data = {
    "store": ["A", "A", "A", "B", "B", "B"],
    "month": ["Jan", "Jan", "Feb", "Jan", "Feb", "Feb"],
    "revenue_usd": [120.0, 80.0, 150.0, 200.0, 90.0, 110.0],
    "discount_percent": [5.0, 10.0, 0.0, 15.0, 5.0, 0.0],
    "checkout_seconds": [45, 60, 50, 55, 65, 70],
}

# Build the Polars DataFrame from the dictionary data structure.
df = pl.DataFrame(data)

# Show the original data to understand the grouped transaction records.
print("Original transactions DataFrame:")
print(df)

# Group by store and month, then compute multiple column aggregations together.
summary = (
    df.group_by(["store", "month"]).agg(
        [
            pl.col("revenue_usd").sum().alias("total_revenue_usd"),
            pl.col("revenue_usd").max().alias("max_transaction_usd"),
            pl.col("discount_percent").mean().alias("avg_discount_percent"),
            pl.col("checkout_seconds").mean().alias("avg_checkout_seconds"),
        ]
    )
)

# Display the compact summary table with multiple metrics per group.
print("\nGrouped multi column aggregation summary:")
print(summary)



### **2.2. Readable Aggregation Aliases**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_02_02.jpg?v=1766895456" width="250">



>* Use clear aliases when aggregating many metrics
>* Descriptive names make grouped results readable and shareable

>* Use specific aliases to distinguish similar metrics
>* Consistent names reduce confusion and aid collaboration

>* Name custom metrics for meaning and transparency
>* Good aliases aid debugging, extension, and collaboration



In [None]:
#@title Python Code - Readable Aggregation Aliases

# Demonstrate Polars groupby aggregations with readable aliases for clarity.
# Show difference between automatic names and explicit alias names clearly.
# Help beginners understand why descriptive aggregation aliases improve readability.

import polars as pl

# Create a small sales DataFrame with store, orders, and revenue columns.
data = {"store": ["North", "North", "South", "South"], "orders": [10, 15, 8, 12], "revenue_usd": [200, 350, 160, 300]}

# Build the Polars DataFrame from the dictionary data structure.
df = pl.DataFrame(data)

# Group by store and aggregate without aliases, using default generated column names.
no_alias_result = df.group_by("store").agg([pl.col("orders").sum(), pl.col("revenue_usd").mean()])

# Group by store and aggregate with clear aliases for each computed metric.
with_alias_result = df.group_by("store").agg([pl.col("orders").sum().alias("total_orders"), pl.col("revenue_usd").mean().alias("avg_revenue_usd")])

# Print both results to compare column names and overall readability.
print("Without readable aliases result:")
print(no_alias_result)

print("\nWith readable aliases result:")
print(with_alias_result)



### **2.3. Custom Aggregation Expressions**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_02_03.jpg?v=1766895469" width="250">



>* Use custom groupby expressions for rich logic
>* Combine filters, sorting, and math per group

>* Chain filters and transformations inside each group
>* Compute tailored metrics like medians, ranges, extremes

>* Inline group logic encodes complex business rules
>* Reduces intermediate steps and errors in summaries



In [None]:
#@title Python Code - Custom Aggregation Expressions

# Demonstrate custom aggregation expressions within Polars groupby operations.
# Show filtering, conditional logic, and arithmetic inside grouped aggregations.
# Compare multiple tailored metrics computed per customer group.

import polars as pl

# Create a small transactions DataFrame with simple customer purchases.
data = {
    "customer_id": [1, 1, 1, 2, 2, 3],
    "month": ["Jan", "Feb", "Feb", "Jan", "Mar", "Jan"],
    "amount_usd": [50, 80, 40, 100, 60, 30],
    "used_coupon": [True, False, True, False, True, False],
}


df = pl.DataFrame(data)

# Show the original data for quick reference.
print("Original transactions DataFrame:")
print(df)

# Group by customer and define several custom aggregation expressions.
result = (
    df.groupby("customer_id")
    .agg(
        [
            pl.col("amount_usd").sum().alias("total_spend_usd"),
            pl.col("amount_usd").filter(pl.col("used_coupon")).sum().alias("coupon_spend_usd"),
            pl.col("amount_usd").max().alias("max_single_purchase_usd"),
            (pl.col("amount_usd").max() - pl.col("amount_usd").min()).alias("spend_range_usd"),
            pl.when(pl.col("used_coupon").any())
            .then(True)
            .otherwise(False)
            .alias("ever_used_coupon"),
        ]
    )
)


print("\nCustom aggregated metrics per customer:")
print(result)



## **3. Polars Groupby Performance**

### **3.1. Runtime Comparison**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_03_01.jpg?v=1766895490" width="250">



>* Runtime measures speed of grouping large datasets
>* Polars groups faster than row-based tools, especially repeatedly

>* Polars scales better as datasets grow larger
>* Parallelism and less data movement control runtimes

>* Real analyses chain many groupby transformations together
>* Polars optimizes whole pipelines for faster runtimes



In [None]:
#@title Python Code - Runtime Comparison

# Compare runtime of Pandas and Polars groupby operations simply.
# Generate a synthetic dataset with many rows for timing tests.
# Show which library finishes the groupby aggregation faster overall.

import time
import numpy as np
import pandas as pd

!pip install polars --quiet
import polars as pl

np.random.seed(42)
num_rows = 2_000_000

stores = np.random.randint(1, 101, size=num_rows)


sales = np.random.uniform(5.0, 500.0, size=num_rows)

pandas_df = pd.DataFrame({"store_id": stores, "sales_amount": sales})

polars_df = pl.DataFrame({"store_id": stores, "sales_amount": sales})

start_pandas = time.time()

pandas_result = pandas_df.groupby("store_id")["sales_amount"].agg(["sum", "mean"])

end_pandas = time.time()

pandas_time = end_pandas - start_pandas

start_polars = time.time()

polars_result = polars_df.groupby("store_id").agg([
    pl.col("sales_amount").sum().alias("sum"),
    pl.col("sales_amount").mean().alias("mean"),
])

end_polars = time.time()

polars_time = end_polars - start_polars

print("Pandas groupby runtime seconds:", round(pandas_time, 4))

print("Polars groupby runtime seconds:", round(polars_time, 4))

print("Faster library for this run:", "Polars" if polars_time < pandas_time else "Pandas")



### **3.2. Memory Efficient Grouping**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_03_02.jpg?v=1766895516" width="250">



>* Grouping can use lots of memory quickly
>* Polars reuses columnar data to reduce memory

>* Naive groupby pipelines create many copies, wasting memory
>* Polars fuses operations, streaming aggregations with minimal buffers

>* Lower memory use improves speed and stability
>* Enables larger, clearer groupby analyses on one machine



In [None]:
#@title Python Code - Memory Efficient Grouping

# Demonstrate memory efficient grouping using Polars and Pandas side by side.
# Create synthetic sales data and perform similar groupby aggregations.
# Compare peak memory usage and runtime for both libraries briefly.

import time
import numpy as np
import pandas as pd

import psutil
import polars as pl

process = psutil.Process()
start_memory_mb = process.memory_info().rss / (1024 * 1024)

n_rows = 1_000_000
n_customers = 5_000

np.random.seed(42)
customer_ids = np.random.randint(0, n_customers, size=n_rows)

categories = np.random.choice(["tools", "sports", "clothes", "toys"], size=n_rows)

amounts = np.random.exponential(scale=50.0, size=n_rows)

pandas_start_time = time.time()

pdf = pd.DataFrame({"customer_id": customer_ids, "category": categories, "amount": amounts})

pandas_grouped = pdf.groupby(["customer_id", "category"], as_index=False).agg({"amount": ["sum", "mean"]})

pandas_runtime = time.time() - pandas_start_time

pandas_memory_mb = process.memory_info().rss / (1024 * 1024)

polars_start_time = time.time()

pldf = pl.DataFrame({"customer_id": customer_ids, "category": categories, "amount": amounts})

polars_grouped = pldf.group_by(["customer_id", "category"]).agg([pl.col("amount").sum().alias("total_amount"), pl.col("amount").mean().alias("avg_amount")])

polars_runtime = time.time() - polars_start_time

end_memory_mb = process.memory_info().rss / (1024 * 1024)

pandas_memory_increase = pandas_memory_mb - start_memory_mb

polars_memory_increase = end_memory_mb - pandas_memory_mb

print("Pandas groupby runtime seconds:", round(pandas_runtime, 4))

print("Polars groupby runtime seconds:", round(polars_runtime, 4))

print("Pandas approximate memory increase MB:", round(pandas_memory_increase, 2))

print("Polars approximate memory increase MB:", round(polars_memory_increase, 2))

print("Pandas grouped rows example head:")

print(pandas_grouped.head(3))

print("Polars grouped rows example head:")

print(polars_grouped.head(3))



### **3.3. Lazy Groupby Optimization**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_03_03.jpg?v=1766895537" width="250">



>* Describe full pipelines; Polars optimizes before running
>* Engine reorders work, reducing data and computations

>* Lazy queries filter and select columns early
>* Grouping runs on smaller data, saving work

>* Polars reuses work across many groupby aggregations
>* Whole pipeline optimized as one clear lazy query



In [None]:
#@title Python Code - Lazy Groupby Optimization

# Demonstrate lazy groupby optimization using Polars expressions.
# Compare eager and lazy pipelines on a simple sales dataset.
# Show how filters pushdown reduces grouped data size.

import polars as pl

# Create a small example sales DataFrame with simple columns.
data = {
    "store": ["A", "A", "B", "B", "B", "C"],
    "month": ["Jan", "Feb", "Jan", "Feb", "Mar", "Jan"],
    "revenue_usd": [100, 120, 90, 130, 80, 70],
}

sales_eager = pl.DataFrame(data)

# Eager style groups everything before filtering recent months.
eager_group = (
    sales_eager.groupby("store")
    .agg(pl.col("revenue_usd").sum().alias("total_revenue"))
)

recent_eager = eager_group.filter(pl.col("total_revenue") > 150)

print("Eager result after grouping everything:")
print(recent_eager)

# Lazy style filters months before grouping, reducing grouped rows.
sales_lazy = sales_eager.lazy()

lazy_pipeline = (
    sales_lazy
    .filter(pl.col("month").is_in(["Feb", "Mar"]))
    .groupby("store")
    .agg(pl.col("revenue_usd").sum().alias("total_revenue"))
    .filter(pl.col("total_revenue") > 100)
)

lazy_result = lazy_pipeline.collect()

print("\nLazy result with pushed down filter:")
print(lazy_result)



# <font color="#418FDE" size="6.5" uppercase>**Grouping and Aggregation**</font>


In this lecture, you learned to:
- Perform groupby operations in Polars that mirror common Pandas groupby use cases. 
- Define multiple aggregations and custom expressions within a single Polars groupby call. 
- Compare performance and readability of Polars groupby pipelines to their Pandas counterparts on sample datasets. 

In the next Lecture (Lecture B), we will go over 'Joins and Reshaping'