# <font color="#418FDE" size="6.5" uppercase>**Joins and Reshaping**</font>

>Last update: 20260101.
    
By the end of this Lecture, you will be able to:
- Implement Polars joins that replicate common pandas merge and join patterns across multiple tables. 
- Apply Polars reshaping operations such as pivot and melt to reproduce pandas pivot_table and stack or unstack behavior. 
- Design an end-to-end Polars pipeline that combines joins and reshaping to match the output of an existing pandas workflow. 


## **1. Core Join Techniques**

### **1.1. Core Join Types**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_01_01.jpg?v=1767316373" width="250">



>* Inner, left, right joins mirror familiar pandas patterns
>* They differ in which rows each join keeps

>* Full outer joins keep all rows from both
>* Great for complete views, accepting missing values

>* Semi joins keep matching rows without extra columns
>* Anti joins keep only rows with no matches



In [None]:
#@title Python Code - Core Join Types

# Demonstrate core Polars join types with tiny example tables.
# Show inner, left, outer, semi, and anti joins clearly.
# Keep output small, readable, and beginner friendly.

# pip install polars if running outside Colab environment.

# Import polars library for DataFrame operations.
import polars as pl

# Create left customers table with simple customer identifiers.
customers = pl.DataFrame({"customer_id": [1, 2, 3], "name": ["Alice", "Bob", "Carol"]})

# Create right orders table with matching and non matching customers.
orders = pl.DataFrame({"customer_id": [2, 3, 4], "order_total_usd": [50, 75, 20]})

# Show original customers table for reference context.
print("Customers table:\n", customers)

# Show original orders table for reference context.
print("\nOrders table:\n", orders)

# Perform inner join keeping only matching customer identifiers.
inner_join = customers.join(orders, on="customer_id", how="inner")

# Perform left join keeping all customers from left table.
left_join = customers.join(orders, on="customer_id", how="left")

# Perform outer join keeping all customers and orders rows.
outer_join = customers.join(orders, on="customer_id", how="outer")

# Perform semi join filtering customers that have at least one order.
semi_join = customers.join(orders, on="customer_id", how="semi")

# Perform anti join finding customers without any matching orders.
anti_join = customers.join(orders, on="customer_id", how="anti")

# Print all join results in compact labeled form.
print("\nInner join:\n", inner_join, "\n\nLeft join:\n", left_join, "\n\nOuter join:\n", outer_join, "\n\nSemi join:\n", semi_join, "\n\nAnti join:\n", anti_join)



### **1.2. Key Based Joins**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_01_02.jpg?v=1767316394" width="250">



>* Carefully choose columns that uniquely identify records
>* Join keys must clearly map related rows across tables

>* Handle joins with renamed or multi-column keys
>* Preserve original key meaning when migrating workflows

>* Join keys must reflect data quality, meaning
>* Clear key semantics prevent subtle, hard-to-find errors



In [None]:
#@title Python Code - Key Based Joins

# Demonstrate Polars key based joins with renamed and composite key columns.
# Show joining sales facts to products and calendar using clear join keys.
# Compare pandas style thinking with Polars syntax for multiple join keys.

# pip install polars.

# Import Polars for DataFrame creation and joins.
import polars as pl

# Create a sales fact table with product and date identifiers.
sales_df = pl.DataFrame({"store_id": [1, 1, 2], "product_id": [101, 102, 101], "sale_date": ["2024-01-01", "2024-01-01", "2024-01-02"], "units_sold": [5, 3, 7]})

# Create a product dimension table with a differently named key column.
products_df = pl.DataFrame({"client_key": [101, 102], "product_name": ["Widget", "Gadget"], "category": ["Tools", "Electronics"]})

# Create a calendar table keyed by date for additional attributes.
calendar_df = pl.DataFrame({"calendar_date": ["2024-01-01", "2024-01-02"], "day_name": ["Monday", "Tuesday"]})

# Join sales to products using differently named key columns.
sales_products = sales_df.join(products_df, left_on="product_id", right_on="client_key", how="left")

# Join the result to calendar using a composite key of store and date.
final_joined = sales_products.join(calendar_df, left_on=["sale_date"], right_on=["calendar_date"], how="left")

# Select and print a few columns to show joined result.
print(final_joined.select(["store_id", "product_id", "product_name", "sale_date", "day_name", "units_sold"]))



### **1.3. Handling duplicate and missing keys**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_01_03.jpg?v=1767316416" width="250">



>* Real data has duplicate and missing join keys
>* These keys change join results and need preprocessing

>* Duplicate keys cause many-to-many join expansion
>* Polars preserves all combinations; aggregate or validate keys

>* Join type controls dropping or keeping missing keys
>* Polars treats missing keys as non-matching values



In [None]:
#@title Python Code - Handling duplicate and missing keys

# Demonstrate Polars joins with duplicate and missing keys clearly.
# Show how many-to-many joins expand row counts significantly.
# Show how missing keys behave differently across join types.

# !pip install polars pyarrow --quiet.

# Import required Polars library for DataFrame operations.
import polars as pl

# Create customers DataFrame with unique customer identifiers.
customers = pl.DataFrame({"customer_id": [1, 2, 3], "state": ["NY", "CA", "TX"]})

# Create orders DataFrame with duplicate and missing customer identifiers.
orders = pl.DataFrame({"order_id": [101, 102, 103, 104], "customer_id": [1, 1, 2, None]})

# Perform inner join to show duplicate key expansion clearly.
inner_join = customers.join(orders, on="customer_id", how="inner")

# Perform left join to keep all customers including unmatched ones.
left_join = customers.join(orders, on="customer_id", how="left")

# Perform right join to keep all orders including missing customer identifiers.
right_join = customers.join(orders, on="customer_id", how="right")

# Print inner join result showing duplicated customer one rows.
print("Inner join result with duplicate customer one rows:")
print(inner_join)

# Print left join result showing all customers preserved.
print("\nLeft join result preserving all customers including unmatched:")
print(left_join)

# Print right join result showing order with missing customer identifier.
print("\nRight join result preserving all orders including missing customer:")
print(right_join)



## **2. Polars Data Reshaping**

### **2.1. Wide To Long Pivoting**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_02_01.jpg?v=1767316436" width="250">



>* Convert many measurement columns into stacked rows
>* Use variable and value columns for analysis

>* Pivoting creates one value column plus indicator
>* Long format simplifies comparisons, analysis, and visualization

>* Choose identifier columns and stack measurement columns
>* Long format enables flexible, reusable Polars pipelines



In [None]:
#@title Python Code - Wide To Long Pivoting

# Demonstrate wide to long pivoting using Polars DataFrame example.
# Show simple store revenue data reshaped from wide to long format.
# Compare original wide table and long tidy table for clarity.

# !pip install polars pyarrow --quiet.

# Import polars library for DataFrame operations.
import polars as pl

# Create a simple wide DataFrame with quarterly revenue columns.
stores_wide = pl.DataFrame({
    "store": ["North", "South"],
    "revenue_q1_usd": [1200, 900],
    "revenue_q2_usd": [1500, 1100],
})

# Print the original wide DataFrame for reference.
print("Wide format DataFrame:\n", stores_wide)

# Use melt to pivot quarterly columns into long format rows.
stores_long = stores_wide.melt(
    id_vars=["store"],
    value_vars=["revenue_q1_usd", "revenue_q2_usd"],
    variable_name="quarter",
    value_name="revenue_usd",
)

# Print the resulting long DataFrame showing one row per store quarter.
print("\nLong format DataFrame:\n", stores_long)



### **2.2. Column Melting Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_02_02.jpg?v=1767316458" width="250">



>* Melting converts wide tables into long format
>* Stacks measurement columns, repeats identifiers per row

>* Identifier columns stay fixed and repeat per row
>* Value columns melt into variable and value pairs

>* Melting converts wide, human-friendly tables to long
>* Long format supports trends, comparisons, and reshaping



In [None]:
#@title Python Code - Column Melting Basics

# Demonstrate basic column melting from wide to long format in Polars.
# Show difference between identifier columns and value columns clearly.
# Compare original wide table and melted long table visually.

# !pip install polars pyarrow --quiet.

# Import Polars library for DataFrame operations.
import polars as pl

# Create a simple wide DataFrame with student scores.
students_df = pl.DataFrame({
    "student_id": [1, 2],
    "school": ["North High", "South High"],
    "math_score": [88, 92],
    "science_score": [91, 85],
    "history_score": [79, 87],
})

# Print the original wide DataFrame for reference.
print("Original wide DataFrame with separate subject columns:")
print(students_df)

# Melt subject score columns into long tidy format.
long_df = students_df.melt(
    id_vars=["student_id", "school"],
    value_vars=["math_score", "science_score", "history_score"],
    variable_name="subject",
    value_name="score",
)

# Print the melted long DataFrame showing subject and score columns.
print("\nMelted long DataFrame with subject and score columns:")
print(long_df)



### **2.3. Pivot Table Equivalents**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_02_03.jpg?v=1767316483" width="250">



>* Pivot tables summarize long data into matrices
>* Group, aggregate, then reshape summaries into rows-columns

>* Group long-form data and aggregate key values
>* Reshape grouped results into matrix-style comparison table

>* Support multiple measures, custom aggregations, hierarchies
>* Group, summarize, then reshape categories into columns



In [None]:
#@title Python Code - Pivot Table Equivalents

# Demonstrate Polars pivot table style reshaping from long transactional data.
# Show grouping, aggregation, and pivoting to create a summary matrix.
# Compare original long data with pivoted wide summary output.

# !pip install polars pyarrow.

# Import required Polars library for DataFrame operations.
import polars as pl

# Create simple long format sales data with regions and product categories.
data = {
    "region": ["North", "North", "South", "South", "West", "West"],
    "product": ["Gadget", "Widget", "Gadget", "Widget", "Gadget", "Widget"],
    "revenue_dollars": [1200, 800, 1500, 600, 900, 700],
}

# Build Polars DataFrame from the dictionary data structure.
df_sales = pl.DataFrame(data)

# Print original long format sales data for reference understanding.
print("Original long sales data:")
print(df_sales)

# Group by region and product then aggregate total revenue dollars.
df_grouped = df_sales.group_by(["region", "product"]).agg(
    pl.col("revenue_dollars").sum().alias("total_revenue"),
)

# Pivot grouped data so products become columns like spreadsheet pivot tables.
df_pivot = df_grouped.pivot(
    values="total_revenue",
    index="region",
    columns="product",
)

# Print pivoted wide summary showing revenue by region and product.
print("\nPivot style revenue summary:")
print(df_pivot)



## **3. Polars Pipeline Design**

### **3.1. Combining Joins And Transforms**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_03_01.jpg?v=1767316506" width="250">



>* Plan joins and reshaping as one pipeline
>* Anticipate final shape to avoid awkward intermediates

>* Plan from final table and work backward
>* Use joins for context, reshaping for layout

>* Join type and keys shape later reshaping
>* Coordinate joins and pivots for robust pipelines



In [None]:
#@title Python Code - Combining Joins And Transforms

# Demonstrate combining joins and reshaping using Polars pipeline style.
# Show how multiple tables join then pivot into final summary.
# Keep example small, clear, and beginner friendly.

# !pip install polars.

# Import polars for DataFrame operations.
import polars as pl

# Create small customers table with simple attributes.
customers = pl.DataFrame({"customer_id": [1, 2], "state": ["CA", "NY"]})

# Create products table with categories and prices.
products = pl.DataFrame({"product_id": [10, 20, 30], "category": ["Tools", "Toys", "Tools"], "price_usd": [15.0, 8.0, 22.0]})

# Create transactions table linking customers and products.
transactions = pl.DataFrame({"customer_id": [1, 1, 2, 2], "product_id": [10, 20, 20, 30], "quantity": [2, 1, 3, 1]})

# Join transactions with products to attach categories and prices.
trx_with_products = transactions.join(products, on="product_id", how="left")

# Compute revenue dollars per transaction line.
trx_with_revenue = trx_with_products.with_columns((pl.col("quantity") * pl.col("price_usd")).alias("revenue_usd"))

# Join enriched transactions with customers for state information.
full_trx = trx_with_revenue.join(customers, on="customer_id", how="left")

# Aggregate revenue by customer and category before reshaping.
agg = full_trx.group_by(["customer_id", "category"]).agg(pl.col("revenue_usd").sum().alias("total_revenue_usd"))

# Pivot tall aggregated table into wide customer by category matrix.
customer_category_matrix = agg.pivot(values="total_revenue_usd", index="customer_id", columns="category").sort("customer_id")

# Print final matrix showing combined joins and transforms.
print(customer_category_matrix)



### **3.2. LazyFrame Pipeline Patterns**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_03_02.jpg?v=1767316524" width="250">



>* Treat LazyFrames as blueprints for data pipelines
>* Deferred execution lets Polars optimize all transformations

>* Break pipelines into ingestion, joins, reshaping stages
>* Chain LazyFrame steps for clarity and optimization

>* Keep all intermediate steps inside one LazyFrame
>* This improves performance, memory use, and reproducibility



### **3.3. Validating Pipeline Equivalence**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_03_03.jpg?v=1767316550" width="250">



>* Define what “same result” means clearly
>* Set precise criteria to guide validation strategy

>* Test equivalence in layers using small datasets
>* Compare pandas and Polars outputs on key metrics

>* Validate with real data using summary metrics
>* Monitor and recheck pipelines continuously as things change



In [None]:
#@title Python Code - Validating Pipeline Equivalence

# Demonstrate validating pandas and Polars pipeline equivalence with simple metrics.
# Show how to define equivalence criteria and compare aggregated outputs.
# Use small customer order data to validate joins and reshaping equivalence.

# !pip install polars pandas.

# Import required libraries for pandas and Polars usage.
import pandas as pd
import polars as pl

# Create small pandas DataFrames representing customers and their orders.
customers_pd = pd.DataFrame({"customer_id": [1, 2, 3], "region": ["East", "West", "East"]})
orders_pd = pd.DataFrame({"order_id": [10, 11, 12, 13], "customer_id": [1, 1, 2, 3], "revenue_usd": [100.0, 50.0, 80.0, 120.0]})

# Build original pandas pipeline joining tables and aggregating revenue by region.
pandas_result = (
    orders_pd.merge(customers_pd, on="customer_id", how="left")
    .groupby("region", as_index=False)["revenue_usd"].sum()
    .sort_values("region")
)

# Convert pandas DataFrames into Polars DataFrames for equivalent processing.
customers_pl = pl.from_pandas(customers_pd)
orders_pl = pl.from_pandas(orders_pd)

# Build Polars pipeline using join and groupby to mirror pandas behavior.
polars_result = (
    orders_pl.join(customers_pl, on="customer_id", how="left")
)
polars_result = (
    polars_result
    .group_by("region")
    .agg(pl.col("revenue_usd").sum())
    .sort("region")
)

# Define equivalence criteria focusing on region totals rather than row ordering.
print("Pandas region revenue totals:")
print(pandas_result.to_string(index=False))

# Print Polars result converted to pandas for easier side by side comparison.
print("Polars region revenue totals:")
print(polars_result.to_pandas().to_string(index=False))

# Check numeric equivalence by comparing sorted region and revenue columns.
match = pandas_result.reset_index(drop=True).equals(polars_result.to_pandas().reset_index(drop=True))
print("Pipelines produce equivalent regional revenue:", match)



# <font color="#418FDE" size="6.5" uppercase>**Joins and Reshaping**</font>


In this lecture, you learned to:
- Implement Polars joins that replicate common pandas merge and join patterns across multiple tables. 
- Apply Polars reshaping operations such as pivot and melt to reproduce pandas pivot_table and stack or unstack behavior. 
- Design an end-to-end Polars pipeline that combines joins and reshaping to match the output of an existing pandas workflow. 

<font color='yellow'>Congratulations on completing this course!</font>