# <font color="#418FDE" size="6.5" uppercase>**Joins and Reshaping**</font>

>Last update: 20251227.
    
By the end of this Lecture, you will be able to:
- Execute common join types in Polars that correspond to Pandas merge operations. 
- Reshape Polars DataFrames using pivot and melt to match existing Pandas wide‑to‑long and long‑to‑wide transformations. 
- Diagnose and resolve typical join and reshape issues that arise during migration, such as key mismatches and duplicated columns. 


## **1. Core Join Operations**

### **1.1. Core Join Types**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_01_01.jpg?v=1766897130" width="250">



>* Polars joins mirror familiar database join types
>* Inner and left joins control which rows remain

>* Right joins keep all rows from right table
>* Outer joins keep every row from both tables

>* Cross joins create all row combinations across tables
>* Used rarely for full Cartesian scenarios and comparisons



In [None]:
#@title Python Code - Core Join Types

# Demonstrate core join types using simple Polars DataFrames.
# Compare inner, left, right, outer, and cross joins visually.
# Keep outputs small and readable for beginner friendly exploration.

import polars as pl

# Create a small left table with customer identifiers and states.
left_customers = pl.DataFrame({"customer_id": [1, 2, 3], "state": ["NY", "CA", "TX"]})

# Create a small right table with customer identifiers and total orders.
right_orders = pl.DataFrame({"customer_id": [2, 3, 4], "orders": [5, 2, 7]})

# Show the original tables to understand starting information clearly.
print("Left table customers with states:")
print(left_customers)
print("\nRight table customers with orders:")
print(right_orders)

# Perform an inner join keeping only matching customer identifiers.
inner_join = left_customers.join(right_orders, on="customer_id", how="inner")
print("\nInner join keeps overlapping customers only:")
print(inner_join)

# Perform a left join keeping all left customers and matching orders.
left_join = left_customers.join(right_orders, on="customer_id", how="left")
print("\nLeft join keeps all left customers:")
print(left_join)

# Perform a right join keeping all right customers and matching states.
right_join = left_customers.join(right_orders, on="customer_id", how="right")
print("\nRight join keeps all right customers:")
print(right_join)

# Perform an outer join keeping every customer from both tables.
outer_join = left_customers.join(right_orders, on="customer_id", how="outer")
print("\nOuter join keeps all customers from both:")
print(outer_join)

# Perform a cross join showing every possible customer and order pairing.
cross_join = left_customers.join(right_orders, how="cross")
print("\nCross join shows every possible pairing:")
print(cross_join)



### **1.2. Key Based Joins**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_01_02.jpg?v=1766897144" width="250">



>* Choose join keys based on business meaning
>* Use correct column combinations to uniquely match rows

>* Different tables may name key columns differently
>* Explicitly map mismatched key names during joins

>* Use composite keys carefully to avoid misjoins
>* Check missing keys, uniqueness, and row counts



In [None]:
#@title Python Code - Key Based Joins

# Demonstrate key based joins using Polars DataFrames in simple examples.
# Show single key joins where column names match across both tables.
# Show composite and renamed keys where column names differ between tables.

import polars as pl

# Create a customers table with a single key column customer_id.
customers = pl.DataFrame({"customer_id": [1, 2, 3], "name": ["Alice", "Bob", "Carol"]})

# Create an orders table using the same key column name customer_id.
orders = pl.DataFrame({"order_id": [101, 102, 103], "customer_id": [1, 2, 2]})

# Join on the shared key column customer_id using a left join type.
single_key_join = customers.join(orders, on="customer_id", how="left")

# Print the result to show how rows align using the single key.
print("Single key join using customer_id column:")
print(single_key_join)

# Create a daily_sales table using a composite key with customer_id and order_date.
daily_sales = pl.DataFrame({"customer_id": [1, 1, 2], "order_date": ["2025-01-01", "2025-01-02", "2025-01-01"], "dollars": [50, 75, 20]})

# Create a promotions table where the key columns use different names.
promotions = pl.DataFrame({"cust_id": [1, 2], "promo_date": ["2025-01-01", "2025-01-01"], "discount": [5, 3]})

# Join using a composite key mapping different column names between both tables.
composite_join = daily_sales.join(promotions, left_on=["customer_id", "order_date"], right_on=["cust_id", "promo_date"], how="left")

# Print the result to show how composite and renamed keys control matching.
print("\nComposite key join using customer_id and order_date columns:")
print(composite_join)



### **1.3. Column Suffix Strategies**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_01_03.jpg?v=1766897159" width="250">



>* Overlapping column names in joins cause confusion
>* Use clear, consistent suffixes to distinguish sources

>* Generic suffixes become confusing in complex joins
>* Use descriptive suffixes to clarify column origins

>* Consistent suffixes improve maintenance, collaboration, debugging
>* Source-based suffix patterns reveal origins and issues



In [None]:
#@title Python Code - Column Suffix Strategies

# Demonstrate Polars join column suffix behavior clearly.
# Show default suffixes for overlapping column names visually.
# Show custom descriptive suffixes for overlapping column names.

import polars as pl

# Create a simple customers DataFrame with overlapping column names.
customers = pl.DataFrame({"customer_id": [1, 2], "region": ["East", "West"], "created_at": ["2024-01-01", "2024-02-01"]})

# Create a simple orders DataFrame with overlapping column names.
orders = pl.DataFrame({"customer_id": [1, 2], "region": ["North", "South"], "created_at": ["2024-03-01", "2024-04-01"]})

# Perform a join using default Polars suffixes for overlapping columns.
joined_default = customers.join(orders, on="customer_id", how="inner")

# Perform a join using custom descriptive suffixes for overlapping columns.
joined_custom = customers.join(orders, on="customer_id", how="inner", suffix="_order")

# Print the original DataFrames to understand overlapping column names.
print("Customers DataFrame:\n", customers)

# Print the orders DataFrame to compare overlapping column names.
print("\nOrders DataFrame:\n", orders)

# Print the joined result with default suffixes for overlapping columns.
print("\nJoined with default suffixes:\n", joined_default)

# Print the joined result with custom descriptive suffixes for overlapping columns.
print("\nJoined with custom '_order' suffix:\n", joined_custom)



## **2. Pivot and Melt Basics**

### **2.1. Wide to Long Melt**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_02_01.jpg?v=1766897187" width="250">



>* Melt converts many related columns into pairs
>* Id columns stay fixed while values stack vertically

>* Melt turns visit columns into visit, value pairs
>* Each patient gets one row per visit observation

>* Many domains store repeated measures in columns
>* Melting creates tidy long data for flexible analysis



In [None]:
#@title Python Code - Wide to Long Melt

# Demonstrate wide to long melt transformation using Polars DataFrame example.
# Show patient blood pressure readings reshaped from wide format to long format.
# Help beginners see how id columns stay fixed while value columns stack vertically.

import polars as pl

# Create a simple wide DataFrame with patient blood pressure readings.
wide_df = pl.DataFrame({
    "patient_id": [1, 2],
    "age_years": [45, 60],
    "bp_visit1": [120, 135],
    "bp_visit2": [118, 140],
})

# Show the original wide DataFrame for comparison clarity.
print("Original wide DataFrame:")
print(wide_df)

# Melt the DataFrame from wide format to long format using Polars melt.
long_df = wide_df.melt(
    id_vars=["patient_id", "age_years"],
    value_vars=["bp_visit1", "bp_visit2"],
    variable_name="visit_label",
    value_name="blood_pressure",
)

# Show the resulting long DataFrame after melting transformation.
print("\nLong DataFrame after melt:")
print(long_df)



### **2.2. Pivoting Long To Wide**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_02_02.jpg?v=1766897201" width="250">



>* Turn long tables into wide column layouts
>* Choose id, pivot, and value columns explicitly

>* Pivot sales so categories become separate columns
>* Same idea applies to many measurement domains

>* Check uniqueness of index and pivot combinations
>* Choose aggregation that matches your business meaning



In [None]:
#@title Python Code - Pivoting Long To Wide

# Demonstrate long to wide pivot using Polars DataFrame pivot operation.
# Show how identifier columns stay rows while category values become columns.
# Aggregate duplicate combinations using sum to handle repeated measurements.

import polars as pl

# Create a simple long format sales table with repeated category entries.
data = {
    "store_id": ["A", "A", "A", "B", "B", "B"],
    "month": ["Jan", "Jan", "Jan", "Jan", "Jan", "Jan"],
    "category": ["electronics", "clothing", "electronics", "clothing", "groceries", "groceries"],
    "sales_dollars": [1200, 300, 800, 500, 200, 150],
}

# Build the Polars DataFrame from the dictionary data structure.
long_df = pl.DataFrame(data)

# Show the original long layout where categories appear as multiple rows.
print("Long layout sales table:")
print(long_df)

# Pivot to wide layout keeping store and month as identifier columns.
wide_df = long_df.pivot(
    values="sales_dollars",
    index=["store_id", "month"],
    columns="category",
    aggregate_fn="sum",
)

# Show the wide layout where categories become separate columns.
print("\nWide layout after pivot:")
print(wide_df)



### **2.3. Pivot Aggregation Strategies**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_02_03.jpg?v=1766897215" width="250">



>* Pivoting requires combining overlapping data values
>* Choose sum, average, min, or max thoughtfully

>* Different contexts need different pivot aggregations
>* Chosen aggregation changes how the wide table’s interpreted

>* Match pivot aggregations to legacy tool behavior
>* Validate key cells to preserve meaning and trust



In [None]:
#@title Python Code - Pivot Aggregation Strategies

# Demonstrate pivot aggregation strategies using simple Polars examples.
# Compare sum and mean aggregations for repeated store month revenue values.
# Highlight how aggregation choices change the meaning of pivoted tables.

import polars as pl

# Create a simple long format sales DataFrame with repeated store month entries.
data = {
    "store": ["North", "North", "North", "South", "South", "South"],
    "month": ["Jan", "Jan", "Feb", "Jan", "Jan", "Feb"],
    "revenue_dollars": [100.0, 150.0, 200.0, 80.0, 120.0, 160.0],
}

sales_long = pl.DataFrame(data)

# Show the original long format data to understand repeated store month combinations.
print("Long format sales data with repeated store month rows:")
print(sales_long)

# Pivot using sum aggregation to combine repeated store month revenue values.
sales_pivot_sum = sales_long.pivot_table(
    values="revenue_dollars",
    index="store",
    columns="month",
    aggregate_function="sum",
)

print("\nPivoted sales data using sum aggregation strategy:")
print(sales_pivot_sum)

# Pivot using mean aggregation to compare with the sum based pivot result.
sales_pivot_mean = sales_long.pivot_table(
    values="revenue_dollars",
    index="store",
    columns="month",
    aggregate_function="mean",
)

print("\nPivoted sales data using mean aggregation strategy:")
print(sales_pivot_mean)



## **3. Join Migration Pitfalls**

### **3.1. Key Mismatches and Nulls**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_03_01.jpg?v=1766897236" width="250">



>* Mismatched join keys and nulls cause errors
>* Standardize types, formatting, and null handling before joining

>* Different ID formats stop matching related records
>* Unstandardized nulls silently exclude or misclassify data

>* Carefully inspect join keys and null patterns
>* Standardize types and null rules to prevent errors



In [None]:
#@title Python Code - Key Mismatches and Nulls

# Demonstrate join key mismatches between Pandas and Polars joins.
# Show how types and null encodings affect join results.
# Illustrate simple fixes for mismatched keys and null values.

import pandas as pd
import polars as pl

# Create small Pandas DataFrames with subtle key differences.
left_pd = pd.DataFrame({"id": [101, 102, 103], "value_left": ["A", "B", "C"]})
right_pd = pd.DataFrame({"id": ["101 ", "102", None], "value_right": [10, 20, 30]})

# Show Pandas inner join result with mismatched types and whitespace.
merged_pd = left_pd.merge(right_pd, on="id", how="inner")
print("Pandas inner join rows:", len(merged_pd))

# Convert Pandas DataFrames to Polars DataFrames for comparison.
left_pl = pl.from_pandas(left_pd)
right_pl = pl.from_pandas(right_pd)

# Show Polars inner join result before cleaning keys and nulls.
merged_pl = left_pl.join(right_pl, on="id", how="inner")
print("Polars inner join rows before cleaning:", merged_pl.height)

# Clean keys in Polars by casting types and trimming whitespace.
right_pl_clean = right_pl.with_columns([
    pl.col("id").cast(pl.Utf8).str.strip().alias("id")
])

# Replace placeholder null like strings with real nulls if needed.
right_pl_clean = right_pl_clean.with_columns([
    pl.when(pl.col("id") == "N A").then(None).otherwise(pl.col("id")).alias("id")
])

# Perform Polars inner join again after cleaning keys and nulls.
merged_pl_clean = left_pl.join(right_pl_clean, on="id", how="inner")
print("Polars inner join rows after cleaning:", merged_pl_clean.height)

# Display final cleaned Polars join result for verification.
print("Final cleaned Polars join:")
print(merged_pl_clean)



### **3.2. Managing Duplicate Join Keys**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_03_02.jpg?v=1766897316" width="250">



>* Duplicate join keys create many-to-many joins
>* They inflate rows, distort metrics, and performance

>* Check assumed unique keys for hidden duplicates
>* Compare key counts and join rows to locate issues

>* Shape, filter, or aggregate to control duplicates
>* Document choices to keep joins meaningful and efficient



In [None]:
#@title Python Code - Managing Duplicate Join Keys

# Demonstrate duplicate join keys causing row multiplication in Polars joins.
# Compare Pandas style expectations with Polars join behavior using simple tables.
# Show how grouping can enforce uniqueness before performing a safer join.

import polars as pl

# Create a left table with unique customer identifiers and simple order counts.
left_orders = pl.DataFrame({"customer_id": [1, 2, 3], "orders": [2, 1, 4]})

# Create a right table with duplicate customer identifiers representing multiple records.
right_customers = pl.DataFrame({"customer_id": [1, 1, 2], "state": ["CA", "NY", "TX"]})

# Perform a join that multiplies rows because of duplicate keys on the right side.
joined_many = left_orders.join(right_customers, on="customer_id", how="inner")

# Aggregate the right table to enforce uniqueness before joining again safely.
right_unique = right_customers.group_by("customer_id").agg(pl.col("state").first())

# Perform a second join using the deduplicated right table for controlled row counts.
joined_unique = left_orders.join(right_unique, on="customer_id", how="inner")

# Print row counts and joined tables to compare duplicate key effects clearly.
print("Rows before join left:", left_orders.height)
print("Rows before join right:", right_customers.height)
print("Rows after join many:", joined_many.height)
print("Rows after join unique:", joined_unique.height)
print("Joined with duplicates keys:\n", joined_many)



### **3.3. Index Based Join Alignment**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_03/Lecture_B/image_03_03.jpg?v=1766897331" width="250">



>* Implicit index-based row alignment no longer happens
>* Always join using explicit keys to prevent mismatches

>* Row-based joins break after sorting or filtering
>* Always join on explicit, stable identifier keys

>* Trace table history to spot broken alignments
>* Rebuild explicit keys and make alignment verifiable



In [None]:
#@title Python Code - Index Based Join Alignment

# Demonstrate index based alignment differences between Pandas and Polars joins.
# Show how implicit index alignment can silently misalign related customer rows.
# Emphasize using explicit customer_id keys for safe predictable joins.

import pandas as pd
import polars as pl

# Create a simple Pandas DataFrame with customer_id and churn_score columns.
customers_pd = pd.DataFrame({"customer_id": [101, 102, 103], "churn_score": [0.2, 0.8, 0.5]})

# Create another Pandas DataFrame with the same index but shuffled customer_id order.
responses_pd = pd.DataFrame({"customer_id": [103, 101, 102], "responded": ["yes", "no", "yes"]})

# Show both Pandas tables to highlight that index positions match but customers differ.
print("Pandas customers table with index alignment:")
print(customers_pd)

# Perform a Pandas join that aligns only on the shared index positions, not customer_id.
joined_pd = customers_pd.join(responses_pd[["responded"]])

# Display the incorrect Pandas join where churn_score pairs with wrong responded values.
print("\nPandas join using implicit index alignment only:")
print(joined_pd)

# Convert the original Pandas tables into Polars DataFrames for explicit key joins.
customers_pl = pl.from_pandas(customers_pd)
responses_pl = pl.from_pandas(responses_pd)

# Perform a Polars join using explicit customer_id keys, ignoring any hidden index ideas.
joined_pl = customers_pl.join(responses_pl, on="customer_id", how="inner")

# Display the correct Polars join where churn_score aligns with matching customer_id.
print("\nPolars join using explicit customer_id key:")
print(joined_pl)



# <font color="#418FDE" size="6.5" uppercase>**Joins and Reshaping**</font>


In this lecture, you learned to:
- Execute common join types in Polars that correspond to Pandas merge operations. 
- Reshape Polars DataFrames using pivot and melt to match existing Pandas wide‑to‑long and long‑to‑wide transformations. 
- Diagnose and resolve typical join and reshape issues that arise during migration, such as key mismatches and duplicated columns. 

In the next Module (Module 4), we will go over 'Lazy Queries and Performance'