# <font color="#418FDE" size="6.5" uppercase>**Groupby and Aggregation**</font>

>Last update: 20260101.
    
By the end of this Lecture, you will be able to:
- Translate typical pandas groupby and aggregation code into equivalent Polars groupby expressions. 
- Use Polars window functions to replace more complex pandas groupby and rolling patterns where appropriate. 
- Validate that migrated Polars aggregations match pandas results across multiple groups and metrics. 


## **1. Core Groupby Patterns**

### **1.1. Single Key Groupby Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_01_01.jpg?v=1767314626" width="250">



>* Single column groupby splits data into groups
>* Same concept, new syntax when switching libraries

>* Pick a group column and target columns
>* Apply chosen aggregations; syntax differs across tools

>* Recognize recurring single-key groupby usage patterns
>* Separate analytical intent from tool-specific syntax



In [None]:
#@title Python Code - Single Key Groupby Basics

# Demonstrate single key groupby using pandas and polars side by side.
# Show how to group by region and summarize sales amounts clearly.
# Help beginners see conceptual similarity despite different library syntax.

# !pip install pandas polars pyarrow.

# Import required libraries for data handling and grouping.
import pandas as pd
import polars as pl

# Create a small sales dataset with regions and revenue values.
sales_data = {
    "region": ["North", "South", "North", "West", "South", "West"],
    "revenue_dollars": [1200, 800, 600, 1500, 700, 900],
}

# Build a pandas DataFrame from the sales dictionary.
pdf = pd.DataFrame(sales_data)

# Build a polars DataFrame from the same sales dictionary.
pldf = pl.DataFrame(sales_data)

# Perform single key groupby in pandas using region column.
pandas_grouped = pdf.groupby("region", as_index=False)["revenue_dollars"].sum()

# Perform single key groupby in polars using region column.
polars_grouped = pldf.group_by("region").agg(pl.col("revenue_dollars").sum())

# Print original pandas DataFrame to show raw rows.
print("Original pandas sales data:")
print(pdf)

# Print pandas groupby result showing total revenue per region.
print("\nPandas revenue by region:")
print(pandas_grouped)

# Print polars groupby result showing total revenue per region.
print("\nPolars revenue by region:")
print(polars_grouped)



### **1.2. Multi Column Groupby**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_01_02.jpg?v=1767314692" width="250">



>* Multi column groupby creates groups from key combinations
>* Polars uses expressions evaluated within these groups

>* Group by multiple columns, then define metrics
>* Plan keys and metrics, write aggregation expressions

>* Handle complex multi key groupby with many metrics
>* Use one Polars groupby plus clear aggregations



In [None]:
#@title Python Code - Multi Column Groupby

# Demonstrate multi column groupby in pandas and Polars side by side.
# Show how to group by region and sales channel together.
# Compare total and average sales metrics across both libraries.

# !pip install polars pandas.

# Import required libraries for data handling and analysis.
import pandas as pd
import polars as pl

# Create a small pandas DataFrame with sales example data.
data = {
    "region": ["North", "North", "South", "South", "West", "West"],
    "channel": ["Online", "Store", "Online", "Store", "Online", "Store"],
    "revenue_usd": [1200, 800, 600, 400, 500, 700],
    "discount_percent": [10, 5, 0, 15, 5, 10],
}

# Build the pandas DataFrame using the example dictionary.
df_pd = pd.DataFrame(data)

# Perform pandas groupby using two keys region and channel.
result_pd = (
    df_pd.groupby(["region", "channel"], as_index=False)
    .agg({"revenue_usd": "sum", "discount_percent": "mean"})
)

# Convert the pandas DataFrame into a Polars DataFrame.
df_pl = pl.from_pandas(df_pd)

# Perform Polars groupby using two keys with expression aggregations.
result_pl = (
    df_pl.group_by(["region", "channel"])
    .agg([
        pl.col("revenue_usd").sum().alias("revenue_usd_sum"),
        pl.col("discount_percent").mean().alias("discount_percent_mean"),
    ])
)

# Print pandas multi column groupby aggregation result clearly.
print("Pandas multi column groupby result:\n", result_pd)

# Print Polars multi column groupby aggregation result clearly.
print("\nPolars multi column groupby result:\n", result_pl)



### **1.3. Aggregation Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_01_03.jpg?v=1767314716" width="250">



>* Grouping stays the same; aggregations change style
>* Define clear expressions for each group metric

>* Use explicit aggregation expressions instead of dictionaries
>* Name metrics clearly and adjust them easily

>* Explicitly define each metric and involved columns
>* Clear expressions prevent mistakes and aid maintenance



In [None]:
#@title Python Code - Aggregation Basics

# Demonstrate basic groupby aggregations in pandas and equivalent expressions in polars.
# Show multiple metrics per group with clear output column names.
# Help beginners see explicit aggregation expressions replacing pandas aggregation dictionaries.

# !pip install pandas polars.

# Import required libraries for pandas and polars examples.
import pandas as pd
import polars as pl

# Create simple sales data with store, transactions, and revenue columns.
data = {
    "store": ["North", "North", "South", "South", "West", "West"],
    "transactions": [10, 15, 8, 12, 9, 11],
    "revenue_usd": [200, 330, 160, 250, 180, 220],
}

# Build pandas DataFrame from the sales data dictionary.
df_pd = pd.DataFrame(data)

# Perform pandas groupby with dictionary based aggregations.
summary_pd = df_pd.groupby("store").agg({
    "transactions": ["sum", "mean"],
    "revenue_usd": ["sum", "mean"],
})

# Print pandas grouped summary to compare with polars results.
print("Pandas groupby summary:\n", summary_pd)

# Build polars DataFrame from the same sales data dictionary.
df_pl = pl.DataFrame(data)

# Perform polars groupby using explicit aggregation expressions list.
summary_pl = (
    df_pl
    .group_by("store")
    .agg([
        pl.col("transactions").sum().alias("total_transactions"),
        pl.col("transactions").mean().alias("avg_transactions"),
        pl.col("revenue_usd").sum().alias("total_revenue_usd"),
        pl.col("revenue_usd").mean().alias("avg_revenue_usd"),
    ])
)

# Print polars grouped summary showing clear metric names per store.
print("\nPolars groupby summary:\n", summary_pl)



## **2. Advanced Polars Aggregations**

### **2.1. Custom Aggregation Expressions**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_02_01.jpg?v=1767314743" width="250">



>* Define custom aggregations using declarative Polars expressions
>* Chain filters, conditionals, arithmetic for optimized metrics

>* Combine custom aggregations with window functions efficiently
>* Express filters and rolling logic in one plan

>* Compose vectorized pieces instead of custom functions
>* Combine filters, weights, windows for domain metrics



In [None]:
#@title Python Code - Custom Aggregation Expressions

# Demonstrate custom aggregation expressions using Polars groupby and window functions.
# Show conditional filtering and arithmetic inside a single aggregation expression.
# Compare simple average with custom conditional average per customer group.

# !pip install polars pyarrow.

# Import required Polars library for DataFrame operations.
import polars as pl

# Create a small example dataset with customers and purchases.
data = {
    "customer_id": ["A", "A", "A", "B", "B", "C"],
    "order_number": [1, 2, 3, 1, 2, 1],
    "amount_usd": [20.0, 55.0, 80.0, 15.0, 120.0, 40.0],
}

# Build a Polars DataFrame from the example dictionary.
df = pl.DataFrame(data)

# Define a threshold for high value purchases in US dollars.
threshold = 50.0

# Compute simple and custom aggregations per customer using expressions.
agg_df = df.group_by("customer_id").agg(
    [
        pl.col("amount_usd").mean().alias("avg_amount"),
        pl.col("amount_usd")
        .filter(pl.col("amount_usd") > threshold)
        .mean()
        .alias("avg_high_value"),
        (pl.col("amount_usd").max() - pl.col("amount_usd").min()).alias("range_amount"),
    ]
)

# Add a window expression showing rolling high value average per customer.
window_df = df.with_columns(
    [
        pl.when(pl.col("amount_usd") > threshold)
        .then(pl.col("amount_usd"))
        .otherwise(None)
        .rolling_mean(window_size=2)
        .over("customer_id")
        .alias("rolling_high_avg"),
    ]
)

# Print the original data, aggregated results, and windowed custom metric.
print("Original purchases DataFrame:\n", df)
print("\nCustom aggregations per customer:\n", agg_df)
print("\nWindowed custom rolling high value average:\n", window_df)



### **2.2. Multi Column Aggregations**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_02_02.jpg?v=1767314772" width="250">



>* Window functions add multi-column metrics per row
>* Shared grouping keys enable compact, efficient complex logic

>* Window functions capture relationships between columns
>* Add many rolling metrics in one window

>* Align individual metrics with group-level benchmarks
>* Layer multi column windows into one coherent pipeline



In [None]:
#@title Python Code - Multi Column Aggregations

# Demonstrate multi column window aggregations using Polars expressions.
# Show customer level metrics computed per transaction row context.
# Compare multiple windowed aggregations sharing identical grouping keys.

# !pip install polars pyarrow.

# Import required libraries for dataframe creation and manipulation.
import polars as pl

# Create a small retail style dataset with multiple numeric columns.
data = {
    "customer_id": ["A", "A", "A", "B", "B", "C"],
    "transaction_id": [1, 2, 3, 1, 2, 1],
    "amount_usd": [40.0, 25.0, 35.0, 60.0, 20.0, 15.0],
    "discount_percent": [5.0, 10.0, 0.0, 15.0, 5.0, 0.0],
    "category": ["Books", "Games", "Books", "Electronics", "Books", "Games"],
}

# Build a Polars DataFrame from the dictionary data structure.
df = pl.DataFrame(data)

# Define a window expression grouped by customer identifier column.
customer_window = pl.col("customer_id").over("customer_id")

# Compute multi column aggregations within the same customer window.
result = df.with_columns([
    pl.col("amount_usd").sum().over("customer_id").alias("customer_total_spend"),
    pl.col("category").n_unique().over("customer_id").alias("customer_unique_categories"),
    pl.col("discount_percent").mean().over("customer_id").alias("customer_avg_discount"),
    (pl.col("amount_usd") * (1 - pl.col("discount_percent") / 100)).alias("net_amount_usd"),
])

# Select and print a compact view showing original and derived metrics.
print(result.select([
    "customer_id",
    "transaction_id",
    "amount_usd",
    "discount_percent",
    "category",
    "customer_total_spend",
    "customer_unique_categories",
    "customer_avg_discount",
    "net_amount_usd",
]))



### **2.3. Missing Data Strategies**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_02_03.jpg?v=1767314792" width="250">



>* Window functions must handle unavoidable data gaps
>* Different gap treatments change meaning of aggregations

>* Separate missing-data handling from aggregation logic
>* Control window construction to match analytical intent

>* Use windows to handle informative missing values
>* Compare parallel metrics to detect gaps versus change



In [None]:
#@title Python Code - Missing Data Strategies

# Demonstrate Polars window functions handling missing values in rolling aggregations.
# Compare filling missing values with zeros versus ignoring missing values in windows.
# Show how missing data strategies change rolling average temperature interpretations.

# !pip install polars pyarrow matplotlib.

# Import required libraries for data handling and plotting.
import polars as pl
import matplotlib.pyplot as plt

# Create simple temperature data with intentional missing values.
data = pl.DataFrame({"day":[1,2,3,4,5,6,7],"temp_f":[70,None,75,80,None,85,90]})

# Define a rolling window size for three day averages.
window_size = 3

# Compute rolling average treating missing values as zeros explicitly.
filled_avg = data.with_columns([
    pl.col("temp_f").fill_null(0).rolling_mean(window_size).alias("avg_fill_zero")
])

# Compute rolling average ignoring missing values within each window.
ignore_avg = data.with_columns([
    pl.col("temp_f").rolling_mean(window_size).alias("avg_ignore_null")
])

# Join both strategies into one comparison DataFrame.
comparison = data.join(filled_avg.select(["day","avg_fill_zero"]),on="day").join(ignore_avg.select(["day","avg_ignore_null"]),on="day")

# Print concise comparison to observe different missing data strategies.
print(comparison)

# Plot both rolling averages to visualize strategy differences clearly.
plt.plot(comparison["day"],comparison["avg_fill_zero"],marker="o",label="fill_missing_zero")

# Add second line for ignoring missing values in rolling averages.
plt.plot(comparison["day"],comparison["avg_ignore_null"],marker="x",label="ignore_missing")

# Label axes and add legend for clarity.
plt.xlabel("Day index")
plt.ylabel("Temperature Fahrenheit")
plt.legend()
plt.tight_layout()
plt.show()



## **3. Validating window aggregations**

### **3.1. Group Based Window Setup**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_03_01.jpg?v=1767314817" width="250">



>* Match groups, sort columns, and window ranges
>* Keep filters and preprocessing identical between systems

>* Plan how windows handle irregular group edges
>* Create varied test groups and compare window outputs

>* Ensure stable, deterministic sorting within each group
>* Align grouping, ordering, and tie-breaking across systems



In [None]:
#@title Python Code - Group Based Window Setup

# Demonstrate group based window setup using pandas and polars side by side.
# Show grouping keys, ordering columns, and stable tie breaking within groups.
# Compare cumulative sums to validate identical group window definitions.

# !pip install pandas polars pyarrow quietly in Colab environment.

# Import required libraries for pandas and polars usage.
import pandas as pd
import polars as pl

# Create small sales dataset with store, date, and sales columns.
data = {
    "store_id": ["A", "A", "A", "B", "B", "B"],
    "date": ["2024-01-01", "2024-01-01", "2024-01-02", "2024-01-01", "2024-01-02", "2024-01-02"],
    "event_id": [1, 2, 3, 1, 2, 3],
    "sales_dollars": [100, 50, 80, 60, 40, 30],
}

# Build pandas DataFrame and sort by grouping and ordering columns.
df_pd = pd.DataFrame(data).sort_values(["store_id", "date", "event_id"])

# Compute pandas cumulative sales within each store ordered by date and event.
df_pd["cum_sales_store"] = df_pd.groupby("store_id")["sales_dollars"].cumsum()

# Convert pandas DataFrame into polars DataFrame for comparison.
df_pl = pl.from_pandas(df_pd)

# Define polars expression for cumulative sum over group based window.
expr_cum = pl.col("sales_dollars").cum_sum().over("store_id")

# Apply expression and select relevant columns for clear comparison.
df_pl_result = df_pl.select([
    pl.col("store_id"),
    pl.col("date"),
    pl.col("event_id"),
    pl.col("sales_dollars"),
    expr_cum.alias("cum_sales_store_polars"),
])

# Merge pandas and polars results on keys to validate identical windows.
merged = df_pd.merge(
    df_pl_result.to_pandas(),
    on=["store_id", "date", "event_id", "sales_dollars"],
)

# Print merged comparison showing both cumulative columns side by side.
print(merged[["store_id", "date", "event_id", "cum_sales_store", "cum_sales_store_polars"]])



### **3.2. Cumulative Metrics Validation**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_03_02.jpg?v=1767314843" width="250">



>* Treat cumulative metrics as full time-based sequences
>* Match every step; ordering and grouping must align

>* Design small, focused test cases for cumulatives
>* Compare rowwise across systems, stressing tricky edge cases

>* Derived cumulative metrics magnify small counting mismatches
>* Validate components and final rates under messy data



In [None]:
#@title Python Code - Cumulative Metrics Validation

# Demonstrate validating cumulative metrics between pandas and Polars stepwise.
# Show identical cumulative sales trajectories for each store and date ordering.
# Highlight how mismatches appear when ordering or grouping definitions differ.

# !pip install pandas polars matplotlib seaborn.

import pandas as pd
import polars as pl
import numpy as np

# Create simple transaction data for two example stores.
data = {
    "store_id": ["A", "A", "A", "B", "B", "B"],
    "date": ["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-01", "2024-01-02", "2024-01-03"],
    "sales_dollars": [100, 50, 150, 80, 120, 200],
}

# Build pandas DataFrame with parsed dates and sorted rows.
df_pd = pd.DataFrame(data)
df_pd["date"] = pd.to_datetime(df_pd["date"])
df_pd = df_pd.sort_values(["store_id", "date"])

# Compute pandas cumulative sales per store ordered by date.
df_pd["cum_sales_pd"] = df_pd.groupby("store_id")["sales_dollars"].cumsum()

# Build Polars DataFrame with parsed dates and sorted rows.
df_pl = pl.DataFrame(data).with_columns(
    pl.col("date").str.strptime(pl.Date, strict=False).alias("date")
).sort(["store_id", "date"])

# Compute Polars cumulative sales using groupby and cumulative sum.
df_pl = df_pl.with_columns(
    pl.col("sales_dollars").cum_sum().over("store_id").alias("cum_sales_pl")
)

# Convert Polars result to pandas for aligned comparison.
df_pl_pd = df_pl.to_pandas()

# Merge pandas and Polars cumulative sequences for rowwise validation.
merged = df_pd.merge(
    df_pl_pd[["store_id", "date", "cum_sales_pl"]],
    on=["store_id", "date"],
    how="inner",
)

# Add boolean flag showing whether cumulative values match exactly.
merged["match_flag"] = np.isclose(merged["cum_sales_pd"], merged["cum_sales_pl"])

# Select and print key columns to inspect cumulative trajectories.
print(merged[["store_id", "date", "sales_dollars", "cum_sales_pd", "cum_sales_pl", "match_flag"]])



### **3.3. Replacing pandas rolling patterns**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_A/image_03_03.jpg?v=1767314867" width="250">



>* Carefully replicate original rolling window definitions
>* Validate new windows across groups and edge cases

>* Compare pandas and Polars rolling results rowwise
>* Use edge-case users to reveal subtle mismatches

>* Scale validation to full data and complexity
>* Use stats and deep dives to confirm equivalence



# <font color="#418FDE" size="6.5" uppercase>**Groupby and Aggregation**</font>


In this lecture, you learned to:
- Translate typical pandas groupby and aggregation code into equivalent Polars groupby expressions. 
- Use Polars window functions to replace more complex pandas groupby and rolling patterns where appropriate. 
- Validate that migrated Polars aggregations match pandas results across multiple groups and metrics. 

<font color='yellow'>Congratulations on completing this course!</font>