# <font color="#418FDE" size="6.5" uppercase>**Refactoring Patterns**</font>

>Last update: 20251228.
    
By the end of this Lecture, you will be able to:
- Identify common Pandas coding styles that hinder straightforward migration to Polars. 
- Refactor Pandas pipelines into clearer, expression‑oriented Polars code using established patterns. 
- Plan incremental migration steps that allow coexistence of Pandas and Polars during transition. 


## **1. Pandas Anti Patterns**

### **1.1. Chained Indexing Pitfalls**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_A/image_01_01.jpg?v=1766902339" width="250">



>* Chained indexing stacks multiple selections and assignments
>* Creates ambiguity about views, copies, and transformations

>* Chained indexing silently updates only temporary slices
>* This causes hidden errors and complicates later migration

>* Chained indexing hides separate steps and intent
>* Unpacking chains is slow, error‑prone, blocks migration



In [None]:
#@title Python Code - Chained Indexing Pitfalls

# Demonstrate chained indexing pitfalls with simple hospital style example.
# Show difference between chained assignment and direct loc assignment.
# Help beginners see why chained indexing complicates Polars migration.

import pandas as pd

# Create simple DataFrame representing hospital admissions data.
data = {"patient_id": [1, 2, 3, 4], "ward": ["A", "A", "B", "B"], "trial": [False, False, False, False]}

df = pd.DataFrame(data)

# Show original DataFrame before any chained indexing operations.
print("Original DataFrame before any updates:")
print(df)

# Perform chained indexing assignment that may not update original DataFrame.
subset = df[df["ward"] == "A"]["trial"]
subset[:] = True

# Show DataFrame after chained indexing assignment attempt.
print("\nDataFrame after chained indexing assignment attempt:")
print(df)

# Perform correct loc based assignment that reliably updates original DataFrame.
df.loc[df["ward"] == "A", "trial"] = True

# Show DataFrame after correct loc based assignment.
print("\nDataFrame after correct loc based assignment:")
print(df)



### **1.2. Index Centric Pitfalls**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_A/image_01_02.jpg?v=1766902356" width="250">



>* Overusing the index hides key data assumptions
>* Hidden index behavior causes bugs during migration

>* Index-based thinking hides how rows are identified
>* Frameworks needing explicit columns expose tangled logic

>* Index-heavy code creates fragile, error-prone pipelines
>* Make keys and ordering explicit columns for robustness



In [None]:
#@title Python Code - Index Centric Pitfalls

# Demonstrate index centric pitfalls using simple customer purchase data.
# Show how relying on index alignment can silently misalign customer rows.
# Contrast index based operations with explicit column based joins for safety.

import pandas as pd

# Create first DataFrame with customer id as index.
customers = pd.DataFrame({"customer_id": [101, 102, 103], "spend_usd": [50, 80, 40]}).
set_index("customer_id")

# Create second DataFrame with duplicate customer id index.
visits = pd.DataFrame({"customer_id": [101, 101, 103], "visits": [3, 2, 5]}).
set_index("customer_id")

# Show both tables to highlight duplicate index values problem.
print("Customers table with index based identifiers:\n", customers)
print("\nVisits table with duplicate index identifiers:\n", visits)

# Multiply DataFrames directly, relying on index alignment silently.
wrong_result = customers["spend_usd"] * visits["visits"]

# Show misaligned result where index duplicates cause confusing outputs.
print("\nResult using index alignment only, potentially misleading:\n", wrong_result)

# Reset index to use explicit customer id column for safe merge.
customers_reset = customers.reset_index()
visits_reset = visits.reset_index()

# Perform explicit merge on customer id column to avoid hidden assumptions.
correct_result = customers_reset.merge(visits_reset, on="customer_id", how="inner")

# Compute total revenue using explicit columns instead of hidden index semantics.
correct_result["total_revenue_usd"] = correct_result["spend_usd"] * correct_result["visits"]

print("\nExplicit merge using customer_id column, clearly correct:")
print(correct_result)



### **1.3. In place Mutation Pitfalls**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_A/image_01_03.jpg?v=1766902377" width="250">



>* Stepwise in-place edits feel natural but opaque
>* Side effects hide pipeline logic and block migration

>* Shared in place edits create hidden dependencies
>* Mutations obscure logic and block automated migration

>* In place changes hide optimization and parallelism opportunities
>* Need non-mutating expressions before engines can optimize



In [None]:
#@title Python Code - In place Mutation Pitfalls

# Demonstrate inplace mutation pitfalls with Pandas data frames.
# Show how stepwise changes hide original data meaning.
# Contrast inplace style with clear nonmutating pipeline style.

import pandas as pd

# Create simple sales data frame with three toy orders.
data = {"order_id": [1, 2, 3], "price_usd": [10.0, 20.0, 30.0]}

sales_df = pd.DataFrame(data)

print("Original sales data frame:")
print(sales_df)

# Inplace mutation style that overwrites important original information.
sales_mut = sales_df.copy()  # Copy to avoid mutating original directly.

sales_mut["price_usd"] = sales_mut["price_usd"] * 1.1  # Add ten percent tax.

sales_mut.drop(columns=["order_id"], inplace=True)  # Drop identifier column inplace.

print("\nAfter inplace style mutations:")
print(sales_mut)

# Clear nonmutating style that keeps original data frame unchanged.
result_df = (
    sales_df.assign(price_with_tax_usd=sales_df["price_usd"] * 1.1)
    [["order_id", "price_usd", "price_with_tax_usd"]]
)

print("\nNonmutating pipeline result data frame:")
print(result_df)



## **2. Polars Refactor Patterns**

### **2.1. Expression Oriented Transformations**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_A/image_02_01.jpg?v=1766902431" width="250">



>* Describe columns with reusable, combined expressions
>* Single expressions clarify logic and enable optimization

>* Think of transformations as connected expression tree nodes
>* One expression shows full logic, reducing mental load

>* Separate logic definition from execution for optimization
>* Reuse composable expressions to build efficient shared workflows



In [None]:
#@title Python Code - Expression Oriented Transformations

# Demonstrate expression oriented transformations using Polars expressions together in one step.
# Compare stepwise Pandas style with declarative Polars expression based style.
# Show how one combined expression replaces several intermediate transformation steps.

import pandas as pd
import polars as pl

# Create small customer data with total spend and discount percent columns.
# This simple dataset keeps printed output short and easy to understand.
# Revenue band will depend on spend, discount, and customer segment values.
customers_pd = pd.DataFrame({"customer": ["A", "B", "C"], "spend_usd": [120.0, 45.0, 260.0], "discount_pct": [10.0, 0.0, 20.0], "segment": ["standard", "budget", "premium"]})

# Imperative style example using Pandas with multiple intermediate mutation steps.
# Each step mutates the table and hides overall transformation intent.
customers_stepwise = customers_pd.copy()
customers_stepwise["net_spend"] = customers_stepwise["spend_usd"] * (1 - customers_stepwise["discount_pct"] / 100)

# Continue stepwise logic by defining revenue band using conditional operations.
# This style spreads logic across several lines and intermediate columns.
customers_stepwise["revenue_band"] = pd.cut(customers_stepwise["net_spend"], bins=[0, 50, 150, 1000], labels=["low", "medium", "high"])

# Expression oriented style using Polars with one combined transformation expression.
# No intermediate columns are required because expressions compose together.
customers_pl = pl.from_pandas(customers_pd)
result_pl = customers_pl.with_columns([
    (
        (pl.col("spend_usd") * (1 - pl.col("discount_pct") / 100))
        .alias("net_spend")
    ),
    (
        pl.when(pl.col("spend_usd") * (1 - pl.col("discount_pct") / 100) < 50)
        .then("low")
        .when(pl.col("spend_usd") * (1 - pl.col("discount_pct") / 100) < 150)
        .then("medium")
        .otherwise("high")
        .alias("revenue_band")
    ),
])

# Print both tables to compare stepwise Pandas and expression oriented Polars results.
# Notice Polars keeps full transformation logic visible inside expression definitions.
print("Stepwise Pandas style table:\n", customers_stepwise)

# Print Polars result showing same columns computed using expression oriented transformations.
# Output remains short and readable for quick beginner friendly comparison.
print("\nExpression oriented Polars table:\n", result_pl)



### **2.2. Expression Based Replacements**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_A/image_02_02.jpg?v=1766902453" width="250">



>* Centralize column replacement rules into one expression
>* Improves clarity, testing, and performance of transformations

>* Treat each column as a replacement contract
>* One expression encodes all rules, ensuring transparency

>* Coordinate multi-column updates with conditional expressions
>* Avoid loops by applying clear replacements once



In [None]:
#@title Python Code - Expression Based Replacements

# Show expression based replacements using Polars expressions for one column transformation.
# Compare scattered stepwise replacements with one clear expression based replacement pattern.
# Demonstrate coordinated replacements using conditions referencing multiple related columns together.

import polars as pl

# Create a small DataFrame with messy customer tenure and discount values.
raw_data = {
    "customer_id": [1, 2, 3, 4],
    "tenure_months": [-3, None, 500, 12],
    "customer_type": ["corporate", "individual", "corporate", "individual"],
    "discount_percent": [10.0, 5.0, 15.0, None],
}

# Build the Polars DataFrame from the raw dictionary data structure.
df = pl.DataFrame(raw_data)

# Show the original messy data before any replacement logic is applied.
print("Original data with messy values:")
print(df)

# Define a single expression for tenure replacements using clear ordered rules.
tenure_clean_expr = (
    pl.when(pl.col("tenure_months").is_null())
    .then(0)
    .when(pl.col("tenure_months") < 0)
    .then(0)
    .when(pl.col("tenure_months") > 240)
    .then(240)
    .otherwise(pl.col("tenure_months"))
    .cast(pl.Int32)
)

# Define a discount replacement expression depending on customer type relationships.
discount_clean_expr = (
    pl.when(pl.col("customer_type") == "corporate")
    .then(0.0)
    .when(pl.col("discount_percent").is_null())
    .then(0.0)
    .otherwise(pl.col("discount_percent"))
)

# Apply both expressions in one pass using with_columns replacement pattern.
clean_df = df.with_columns(
    tenure_clean=tenure_clean_expr,
    discount_clean=discount_clean_expr,
)

# Show the cleaned columns that result from expression based replacements.
print("\nCleaned data with expression replacements:")
print(clean_df.select(["customer_id", "tenure_clean", "discount_clean"]))



### **2.3. Column Selection Patterns**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_A/image_02_03.jpg?v=1766902473" width="250">



>* Legacy scripts have messy, ad hoc column choices
>* Use clear, declarative selections for essential columns

>* Select columns using naming patterns and groups
>* Keep selection rules stable while transformations change

>* Use column selection to manage temporary derived columns
>* Stage-based selections keep pipelines tidy and debuggable



In [None]:
#@title Python Code - Column Selection Patterns

# Demonstrate column selection patterns using Pandas and Polars together.
# Show selecting column groups using prefixes and suffixes clearly.
# Show cleaning temporary columns using explicit final column selection.

import pandas as pd
import polars as pl

# Create simple sales data with naming patterns for columns.
data = {
    "order_id": [1, 2, 3, 4],
    "customer_id": [10, 20, 10, 30],
    "sales_day1": [100.0, 80.0, 60.0, 40.0],

    "sales_day2": [90.0, 70.0, 50.0, 30.0],
    "tax_rate_pct": [8.0, 8.0, 8.0, 8.0],
    "discount_pct": [10.0, 0.0, 5.0, 0.0],
}

pdf = pd.DataFrame(data)
print("Original Pandas DataFrame columns:")
print(list(pdf.columns))

# Convert Pandas DataFrame into Polars DataFrame for refactoring.
pldf = pl.from_pandas(pdf)
print("\nOriginal Polars DataFrame preview:")
print(pldf.head(3))

# Select all sales columns using a prefix based selection pattern.
sales_cols = pl.col("sales_*")
selected_sales = pldf.select(["order_id", "customer_id", sales_cols])

print("\nSelected sales columns using prefix pattern:")
print(selected_sales)

# Add temporary derived columns then select only final useful columns.
with_temp = pldf.with_columns([
    (pl.col("sales_day1") + pl.col("sales_day2")).alias("gross_sales"),
    (pl.col("discount_pct") / 100).alias("discount_rate"),
])

final_selected = with_temp.select([
    "order_id",
    "customer_id",
    pl.col("gross_sales"),
    (pl.col("gross_sales") * (1 - pl.col("discount_rate"))).alias("net_sales"),
])

print("\nFinal selected columns after cleaning temporaries:")
print(final_selected)



## **3. Incremental Migration Strategy**

### **3.1. Stepwise Module Migration**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_A/image_03_01.jpg?v=1766902498" width="250">



>* Migrate one small, well-defined module at first
>* Use clear interfaces to swap Pandas for Polars

>* Break the pilot module into smaller stages
>* Migrate and validate each stage incrementally, safely

>* Prioritize modules and migrate them in sequence
>* Test, document, and keep changes reversible throughout



### **3.2. Parallel Pandas and Polars**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_A/image_03_02.jpg?v=1766902509" width="250">



>* Introduce Polars gradually in selected pipeline stages
>* Define data boundaries as contracts between libraries

>* Define and document consistent dataframe conversion points
>* Make coexistence strategy explicit to avoid confusion

>* Run both pipelines in shadow mode, compare outputs
>* Investigate mismatches, then gradually retire Pandas



In [None]:
#@title Python Code - Parallel Pandas and Polars

# Demonstrate parallel Pandas and Polars usage with clear conversion boundaries.
# Show a simple pipeline where Pandas and Polars coexist safely together.
# Compare results to build confidence before fully migrating existing workflows.

import pandas as pd
import polars as pl

# Create a small Pandas DataFrame representing daily sales in dollars.
data_pandas = pd.DataFrame({"day": ["Mon", "Tue", "Wed", "Thu"], "sales_usd": [120, 150, 90, 200]})

# Convert Pandas data into Polars at a clear pipeline boundary.
data_polars = pl.from_pandas(data_pandas)

# Use Polars for a performance critical transformation or aggregation step.
summary_polars = data_polars.select([pl.col("sales_usd").sum().alias("total_sales_usd"), pl.col("sales_usd").mean().alias("average_sales_usd")])

# Convert Polars result back into Pandas for legacy modeling code usage.
summary_pandas = summary_polars.to_pandas()

# Print both results to verify correctness and build trust in Polars.
print("Original Pandas data frame:")
print(data_pandas)

print("\nPolars summary frame:")
print(summary_polars)

print("\nConverted Pandas summary frame:")
print(summary_pandas)



### **3.3. Team Communication Practices**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_A/image_03_03.jpg?v=1766902534" width="250">



>* Clarify what changes, why, and for whom
>* Define migration boundaries to protect critical code

>* Create shared terms and living migration documentation
>* Record examples, trade‑offs, and equivalent framework steps

>* Use regular check-ins to surface migration issues
>* Share decisions early and adjust plans from feedback



# <font color="#418FDE" size="6.5" uppercase>**Refactoring Patterns**</font>


In this lecture, you learned to:
- Identify common Pandas coding styles that hinder straightforward migration to Polars. 
- Refactor Pandas pipelines into clearer, expression‑oriented Polars code using established patterns. 
- Plan incremental migration steps that allow coexistence of Pandas and Polars during transition. 

<font color='yellow'>Congratulations on completing this course!</font>