# <font color="#418FDE" size="6.5" uppercase>**Pandas vs Polars**</font>

>Last update: 20251231.
    
By the end of this Lecture, you will be able to:
- Compare the core data structures and execution models of Pandas and Polars. 
- Explain how Polars’ lazy and eager APIs relate to typical Pandas workflows. 
- Identify common areas where Pandas habits may not map directly to Polars. 


## **1. Pandas and Polars structures**

### **1.1. Pandas Core Structures**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_01_01.jpg?v=1767231027" width="250">



>* Pandas uses labeled in-memory DataFrame and Series
>* Labels enable spreadsheet-like access, alignment, and joins

>* DataFrames are column-based Series backed by NumPy
>* Fast vectorized operations but limited by RAM

>* Pandas runs operations immediately, creating intermediates
>* Index-driven alignment shapes behavior, performance, limitations



In [None]:
#@title Python Code - Pandas Core Structures

# Demonstrate Pandas DataFrame and Series core labeled structures.
# Show column oriented storage with row oriented access patterns.
# Illustrate eager execution and index based automatic alignment.

# !pip install pandas matplotlib seaborn  # Colab already includes these libraries.

# Import pandas library for DataFrame and Series structures.
import pandas as pd

# Create simple patient records as a Python dictionary.
patient_data = {"age_years": [30, 45, 60], "days_stayed": [3, 5, 2]}

# Build a DataFrame with custom index labels for patients.
patients_df = pd.DataFrame(patient_data, index=["patient_A", "patient_B", "patient_C"])

# Select a single labeled column as a Series object.
age_series = patients_df["age_years"]

# Print the full DataFrame showing labeled rows and columns.
print("Full patients DataFrame with labeled index and columns:")
print(patients_df)

# Print the age Series to highlight one dimensional labeled structure.
print("\nAge Series extracted from DataFrame structure:")
print(age_series)

# Create another Series with overlapping index labels for alignment.
extra_days = pd.Series({"patient_A": 1, "patient_C": 2})

# Add Series objects to show index based automatic alignment behavior.
adjusted_stay = patients_df["days_stayed"] + extra_days

# Print the result showing aligned addition and missing value handling.
print("\nAdjusted stay length after adding extra days by index:")
print(adjusted_stay)



### **1.2. Polars Core Structures**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_01_02.jpg?v=1767231050" width="250">



>* Polars DataFrame is column-based and Arrow-backed
>* Column layout enables fast, parallel analytics at scale

>* Series are typed, column-based units of computation
>* Arrow-backed Series enable fast, reliable large-scale operations

>* Expressions describe column operations before running
>* Engine optimizes whole plans for scalable performance



In [None]:
#@title Python Code - Polars Core Structures

# Demonstrate Polars DataFrame, Series, and expressions basics.
# Show columnar operations on small tabular customer transactions.
# Compare direct Series math with expression based lazy style.

# !pip install polars pyarrow.

# Import polars library for DataFrame and Series structures.
import polars as pl

# Create a small DataFrame with typed columns and rows.
transactions_df = pl.DataFrame({"customer_id": [1, 2, 1], "amount_usd": [20.0, 35.5, 15.0], "items": [2, 3, 1]})

# Show the DataFrame structure and column types briefly.
print("Polars DataFrame structure and data:")
print(transactions_df)

# Access a single column as a Series structure object.
amount_series = transactions_df["amount_usd"]

# Show the Series to highlight columnar typed data behavior.
print("\nAmount Series values and type:")
print(amount_series)

# Perform a vectorized Series operation using columnar contiguous memory.
amount_with_tax = amount_series * 1.07

# Print the computed Series to show efficient column wide math.
print("\nAmount Series with seven_percent tax:")
print(amount_with_tax)

# Build an expression that computes revenue per item lazily.
revenue_per_item_expr = (pl.col("amount_usd") / pl.col("items")).alias("revenue_per_item")

# Use select with expression to create optimized computation plan.
result_df = transactions_df.select(["customer_id", revenue_per_item_expr])

# Print final DataFrame showing expression based computation result.
print("\nRevenue per item computed using expressions:")
print(result_df)



### **1.3. Columnar Arrow Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_01_03.jpg?v=1767231096" width="250">



>* Columnar layout stores each column in blocks
>* Arrow standardizes columnar memory, enabling zero-copy sharing

>* Columnar Arrow layout enables fast, cache-friendly scans
>* Metadata helps Polars optimize analytical operations efficiently

>* Arrow lets tools share data without copying
>* Enables declarative pipelines and Polars’ lazy execution



## **2. Polars Execution Model**

### **2.1. Rowwise Versus Columnar**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_02_01.jpg?v=1767231115" width="250">



>* Shift mindset from row-based to column-based thinking
>* Column operations enable faster, optimized Polars execution

>* Rowwise thinking processes one record at once
>* Polars optimizes whole-column operations using lazy planning

>* Row-based custom loops don’t translate efficiently
>* Use column expressions to exploit Polars optimizations



In [None]:
#@title Python Code - Rowwise Versus Columnar

# Demonstrate rowwise thinking versus columnar thinking using Polars expressions.
# Show how per row loops differ from whole column operations conceptually.
# Keep the example small, clear, and beginner friendly for Colab users.

# !pip install polars pyarrow fsspec connectorx if needed in local environments.

# Import Polars for columnar DataFrame operations.
import polars as pl

# Create a tiny transactions DataFrame with three simple columns.
transactions = pl.DataFrame({"customer_id": [1, 1, 2, 2], "amount_usd": [5, 20, 50, 7], "is_online": [True, False, True, False]})

# Show the original data to understand the starting point clearly.
print("Original transactions DataFrame:\n", transactions)

# Imagine rowwise thinking as checking each record one by one conceptually.
print("\nConceptual rowwise thinking: check each transaction record individually.")

# Columnar thinking uses expressions that operate on whole columns at once.
flagged = transactions.with_columns([
    pl.when((pl.col("amount_usd") > 10) & (pl.col("is_online") == True)).then(True).otherwise(False).alias("flag_suspicious")
])

# Show the new column created using columnar expressions instead of Python loops.
print("\nColumnar result with suspicious flag column:\n", flagged)

# Group by customer to aggregate flagged transactions using columnar operations.
summary = flagged.group_by("customer_id").agg([
    pl.col("flag_suspicious").sum().alias("suspicious_count"),
    pl.col("amount_usd").sum().alias("total_amount_usd")
])

# Display the summary to highlight efficient column based aggregation behavior.
print("\nColumnar customer summary with counts and totals:\n", summary)



### **2.2. Lazy Query Planning**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_02_02.jpg?v=1767231137" width="250">



>* Polars records a plan instead of executing
>* Engine optimizes whole pipeline before running once

>* Lazy pipelines replace many intermediate DataFrames
>* Engine reorders steps to cut work and memory

>* Design full pipelines, then run them once
>* Trust lazy optimization, think declaratively for scalability



In [None]:
#@title Python Code - Lazy Query Planning

# Demonstrate Polars lazy planning versus eager Pandas style operations.
# Show that Polars builds a plan before executing transformations.
# Compare when work actually happens and what gets printed.

# !pip install polars pandas.

# Import required libraries for dataframes and lazy queries.
import pandas as pd
import polars as pl

# Create a small Pandas DataFrame with simple transaction data.
pd_df = pd.DataFrame({"customer":["A","A","B","B"],"amount":[10,20,5,40]})

# Perform eager Pandas operations that run immediately and create intermediates.
pd_filtered = pd_df[pd_df["amount"] > 10]
pd_grouped = pd_filtered.groupby("customer")["amount"].sum()

# Print Pandas result showing eager execution and intermediate materialization.
print("Pandas eager result:")
print(pd_grouped)

# Create an equivalent Polars DataFrame from the same data dictionary.
pl_df = pl.DataFrame({"customer":["A","A","B","B"],"amount":[10,20,5,40]})

# Build a lazy query that only describes the transformation pipeline steps.
lazy_query = (
    pl_df.lazy()
    .filter(pl.col("amount") > 10)
    .group_by("customer")
    .agg(pl.col("amount").sum())
)

# Print the lazy plan to show recorded operations without actual execution.
print("\nPolars lazy plan only:")
print(lazy_query.explain())

# Trigger execution explicitly to materialize the final Polars result.
result = lazy_query.collect()

# Print the executed Polars result after the lazy plan runs once.
print("\nPolars executed result:")
print(result)



### **2.3. Pandas Style Eager Operations**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_02_03.jpg?v=1767231159" width="250">



>* Eager Polars runs commands immediately, like Pandas
>* Great for stepwise exploration with fast feedback

>* Everyday analysis maps easily to eager Polars
>* Prototype transformations interactively using a data scratchpad

>* Eager workflows can be converted into lazy queries
>* Helps transition from interactive exploration to scalable pipelines



In [None]:
#@title Python Code - Pandas Style Eager Operations

# Demonstrate Polars eager operations similar to familiar Pandas workflows.
# Show immediate feedback when filtering, selecting, and creating new columns.
# Contrast eager DataFrame with equivalent lazy query execution later.
# !pip install polars --quiet.

# Import Polars library for columnar DataFrame operations.
import polars as pl

# Create small sales DataFrame directly inside the script.
sales_df = pl.DataFrame({"day": ["Mon", "Tue", "Tue", "Wed"], "country": ["US", "US", "CA", "US"], "device": ["mobile", "desktop", "mobile", "mobile"], "revenue_usd": [120.0, 80.0, 50.0, 200.0]})

# Show original eager DataFrame to understand starting point.
print("Original eager DataFrame:")
print(sales_df)

# Filter rows eagerly for United States mobile visitors only.
filtered_df = sales_df.filter((pl.col("country") == "US") & (pl.col("device") == "mobile"))

# Show filtered result immediately after applying conditions.
print("\nFiltered eager DataFrame (US mobile only):")
print(filtered_df)

# Add new eager column converting revenue from dollars to cents.
with_cents_df = filtered_df.with_columns((pl.col("revenue_usd") * 100).alias("revenue_cents"))

# Show updated DataFrame with new derived column included.
print("\nEager DataFrame with derived cents column:")
print(with_cents_df)

# Build equivalent lazy query using same transformation steps conceptually.
lazy_query = sales_df.lazy().filter((pl.col("country") == "US") & (pl.col("device") == "mobile")).with_columns((pl.col("revenue_usd") * 100).alias("revenue_cents"))

# Collect lazy query to execute all steps together efficiently.
result_lazy = lazy_query.collect()

# Show lazy result matching eager pipeline output for comparison.
print("\nLazy query result matching eager pipeline:")
print(result_lazy)



## **3. Ecosystem and Integration**

### **3.1. NumPy and Arrow Bridges**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_03_01.jpg?v=1767231191" width="250">



>* Array conversions aren’t always cheap or simple
>* Be deliberate choosing when and how to convert

>* Old workflows treat tables as array staging
>* Modern systems favor Arrow-style columnar data sharing

>* Tabular engines can replace many array workflows
>* Choose array or columnar formats deliberately for integration



### **3.2. Data IO Differences**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_03_02.jpg?v=1767231227" width="250">



>* Pandas and Polars handle data IO differently
>* Polars favors clean, columnar, high performance pipelines

>* Polars favors fast columnar formats like Parquet
>* Work with partitioned datasets and immutable outputs

>* Reading becomes a deferred, memory efficient plan
>* IO steps integrate into one optimized pipeline



In [None]:
#@title Python Code - Data IO Differences

# Demonstrate different data reading patterns between Pandas and Polars.
# Show eager in memory CSV loading versus lazy Parquet style planning.
# Highlight how lazy pipelines delay reading until explicit collection or writing.

# !pip install pandas polars pyarrow.

# Import required libraries for data handling and comparison.
import pandas as pd
import polars as pl
import pyarrow as pa

# Create a small Pandas DataFrame representing weekly sales data.
pd_sales = pd.DataFrame({"week":[1,2,3],"store":["A","A","B"],"dollars":[120.0,150.5,99.9]})

# Save the Pandas DataFrame as a CSV file on local storage.
pd_sales.to_csv("weekly_sales.csv",index=False)

# Save the same data as a Parquet file using Pandas convenience wrapper.
pd_sales.to_parquet("weekly_sales.parquet",engine="pyarrow",index=False)

# Read CSV eagerly with Pandas, data immediately loaded into memory.
pd_eager = pd.read_csv("weekly_sales.csv")

# Print a short message showing Pandas eager CSV reading result.
print("Pandas eager CSV rows:",len(pd_eager))

# Read Parquet lazily with Polars, building a deferred scan plan.
pl_lazy = pl.scan_parquet("weekly_sales.parquet")

# Show that Polars lazy object represents a plan, not materialized data.
print("Polars lazy plan type:",type(pl_lazy))

# Add a transformation lazily, filtering weeks with dollars above threshold.
pl_filtered = pl_lazy.filter(pl.col("dollars")>110.0)

# Collect materialized result, triggering actual Parquet reading and filtering.
pl_result = pl_filtered.collect()

# Print number of rows after lazy filter, demonstrating deferred execution.
print("Polars filtered lazy rows:",pl_result.height)

# Show that original CSV file remains unchanged, emphasizing immutable style.
print("Original CSV still rows:",len(pd.read_csv("weekly_sales.csv")))



### **3.3. Tooling Compatibility Shifts**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_03_03.jpg?v=1767231258" width="250">



>* Many tools assume inputs are Pandas DataFrames
>* Polars objects may need explicit conversion for compatibility

>* Polars pipelines need explicit conversion between tools
>* Plan boundaries, data types, and workflow memory carefully

>* Lazy execution breaks some Pandas-based tools
>* Plan conversions, materialization, and Polars-aware alternatives



In [None]:
#@title Python Code - Tooling Compatibility Shifts

# Demonstrate Pandas friendly tooling expectations with simple plotting example.
# Show Polars requiring explicit conversion before using same plotting tool.
# Highlight mental shift around DataFrame types and ecosystem boundaries.

# !pip install polars pandas matplotlib.

# Import required libraries for data handling and plotting.
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt

# Create simple Pandas DataFrame representing car speeds in miles per hour.
pd_df = pd.DataFrame({"car_model": ["A", "B", "C"], "speed_mph": [30, 45, 60]})

# Plot directly from Pandas DataFrame using built in plotting support.
pd_df.plot(x="car_model", y="speed_mph", kind="bar", title="Pandas direct plotting example")

# Show plot before moving to Polars conversion and compatibility shift.
plt.show()

# Create equivalent Polars DataFrame showing same car speed information.
pl_df = pl.DataFrame({"car_model": ["A", "B", "C"], "speed_mph": [30, 45, 60]})

# Convert Polars DataFrame into Pandas DataFrame for plotting compatibility.
pl_to_pd_df = pl_df.to_pandas()

# Plot converted DataFrame to highlight explicit conversion requirement.
pl_to_pd_df.plot(x="car_model", y="speed_mph", kind="bar", title="Polars converted plotting example")

# Show second plot and emphasize tooling compatibility shift visually.
plt.show()



# <font color="#418FDE" size="6.5" uppercase>**Pandas vs Polars**</font>


In this lecture, you learned to:
- Compare the core data structures and execution models of Pandas and Polars. 
- Explain how Polars’ lazy and eager APIs relate to typical Pandas workflows. 
- Identify common areas where Pandas habits may not map directly to Polars. 

In the next Lecture (Lecture C), we will go over 'Setting Up Polars'