# <font color="#418FDE" size="6.5" uppercase>**Pandas vs Polars**</font>

>Last update: 20260101.
    
By the end of this Lecture, you will be able to:
- Explain the key architectural differences between pandas and Polars. 
- Map common pandas objects and operations to their closest Polars equivalents. 
- Evaluate when pandas remains sufficient and when Polars is likely to provide clear benefits. 


## **1. Pandas vs Polars Architecture**

### **1.1. Memory Models Compared**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_01_01.jpg?v=1767307182" width="250">



>* Older library uses flexible, loosely organized memory
>* Fragmented layout increases overhead, hurting scalability

>* Uses compact, typed column buffers for storage
>* Improves cache use, speed, and memory efficiency

>* Traditional model risks memory bloat and errors
>* Columnar model stays compact and scales better



In [None]:
#@title Python Code - Memory Models Compared

# Show memory usage differences between pandas and Polars tables.
# Compare object heavy columns with compact typed columnar storage.
# Connect memory model ideas with simple printed size measurements.

# !pip install pandas polars pyarrow.

# Import required libraries for data handling and measurement.
import pandas as pd
import polars as pl
import numpy as np

# Create a small example size representing millions of rows.
n_rows = 1_000_000

# Build pandas DataFrame using flexible object string column.
pdf_object = pd.DataFrame({"city": ["New York"] * n_rows})

# Build pandas DataFrame using efficient categorical encoded column.
pdf_categorical = pd.DataFrame({"city": pd.Categorical(["New York"] * n_rows)})

# Build Polars DataFrame using compact UTF8 typed column.
pldf = pl.DataFrame({"city": ["New York"] * n_rows})

# Define helper function printing memory usage in megabytes.
def show_size(label, size_bytes):
    size_mb = size_bytes / (1024 * 1024)
    print(f"{label}: {size_mb:.2f} MB")

# Measure pandas object column memory including index overhead.
size_pdf_object = pdf_object.memory_usage(deep=True).sum()

# Measure pandas categorical column memory including index overhead.
size_pdf_categorical = pdf_categorical.memory_usage(deep=True).sum()

# Measure Polars DataFrame memory using estimated size method.
size_pldf = pldf.estimated_size()

# Print clear header describing upcoming memory comparison results.
print("Memory usage for one million city rows:")

# Print memory usage for pandas object based column layout.
show_size("pandas object column", size_pdf_object)

# Print memory usage for pandas categorical encoded column layout.
show_size("pandas categorical column", size_pdf_categorical)

# Print memory usage for Polars compact columnar layout.
show_size("polars UTF8 column", size_pldf)



### **1.2. Row vs Columnar**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_01_02.jpg?v=1767307204" width="250">



>* Pandas stores data by rows, record-focused
>* Row layout slows column-only operations across many rows

>* Polars stores each column together in memory
>* This speeds up analytics and reduces memory use

>* Row-based pandas suits transactional, record-focused workflows
>* Columnar Polars excels at large, analytical workloads



### **1.3. Parallel Execution Model**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_01_03.jpg?v=1767307232" width="250">



>* Pandas runs steps eagerly on one core
>* Polars plans, optimizes, and runs steps in parallel

>* Single-threaded processing underuses available CPU cores
>* Parallel model splits data, runs tasks simultaneously

>* Engine optimizes whole query plan before execution
>* Parallelism reduces work, boosts speed and predictability



In [None]:
#@title Python Code - Parallel Execution Model

# Show simple timing difference between pandas and Polars operations.
# Illustrate single threaded versus parallel execution behavior conceptually.
# Use a medium sized dataset to keep runtime friendly.

# !pip install pandas polars pyarrow.

# Import required libraries for data handling and timing.
import time as time_module
import pandas as pandas_module
import polars as polars_module

# Create a reasonably large row count for demonstration.
row_count = 2_000_000

# Build a pandas DataFrame with simple numeric columns.
pandas_df = pandas_module.DataFrame({"a": range(row_count), "b": range(row_count)})

# Define a function performing chained operations using pandas.
def run_pandas_operations(input_df):
    filtered = input_df[input_df["a"] % 2 == 0]
    enriched = filtered.assign(c=filtered["a"] * 1.5 + filtered["b"])
    grouped = enriched.groupby(enriched["a"] % 10).agg({"c": "mean"})
    return grouped

# Time the pandas operations using a simple wall clock measurement.
start_pandas = time_module.time()
result_pandas = run_pandas_operations(pandas_df)
elapsed_pandas = time_module.time() - start_pandas

# Build an equivalent Polars DataFrame from the pandas DataFrame.
polars_df = polars_module.from_pandas(pandas_df)

# Define a function performing similar operations using Polars expressions.
def run_polars_operations(input_df):
    lazy_frame = input_df.lazy()
    lazy_filtered = lazy_frame.filter(polars_module.col("a") % 2 == 0)
    lazy_enriched = lazy_filtered.with_columns((polars_module.col("a") * 1.5 + polars_module.col("b")).alias("c"))
    lazy_grouped = lazy_enriched.group_by(polars_module.col("a") % 10).agg(polars_module.col("c").mean())
    return lazy_grouped.collect()

# Time the Polars operations which may use multiple cores internally.
start_polars = time_module.time()
result_polars = run_polars_operations(polars_df)
elapsed_polars = time_module.time() - start_polars

# Print concise timing comparison and small result samples.
print("Pandas elapsed seconds:", round(elapsed_pandas, 3))
print("Polars elapsed seconds:", round(elapsed_polars, 3))
print("Pandas result rows count:", len(result_pandas))
print("Polars result rows count:", result_polars.height)
print("Pandas head preview:\n", result_pandas.head())
print("Polars head preview:\n", result_polars.head())



## **2. Pandas to Polars Mapping**

### **2.1. DataFrame Concepts Compared**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_02_01.jpg?v=1767307257" width="250">



>* Data frames are shared table-like core objects
>* Same workflow, different internal design and optimization

>* Common table operations feel similar in both
>* Syntax differs, but high-level transformations match

>* Joins, concatenation, reshaping work similarly in both
>* Reuse existing workflow habits across both libraries



### **2.2. Series Index and Columns**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_02_02.jpg?v=1767307267" width="250">



>* Traditional library relies heavily on implicit indexes
>* New library uses explicit columns as row keys

>* Indexes become explicit date and category columns
>* You manually specify join keys and filter conditions

>* Pandas series act independently, relying on indexes
>* Polars treats series as columns with explicit keys



In [None]:
#@title Python Code - Series Index and Columns

# Show how pandas index differs from Polars columns focus.
# Compare row labeling, selection, and alignment using both libraries.
# Emphasize explicit column keys instead of hidden index behavior.

# !pip install pandas polars.

# Import pandas and polars libraries for comparison.
import pandas as pd
import polars as pl

# Create simple sales data with dates as index in pandas.
pd_sales = pd.DataFrame({"date": ["2024-01-01", "2024-01-02"], "store": ["A", "A"], "sales": [100, 150]}).set_index("date")

# Show pandas DataFrame where date acts as hidden index.
print("Pandas DataFrame with date index:")
print(pd_sales)

# Select one row using index label in pandas.
print("\nPandas select by index label:")
print(pd_sales.loc["2024-01-02"])

# Create equivalent Polars DataFrame keeping date as normal column.
pl_sales = pl.DataFrame({"date": ["2024-01-01", "2024-01-02"], "store": ["A", "A"], "sales": [100, 150]})

# Show Polars DataFrame where date remains explicit column.
print("\nPolars DataFrame with date column:")
print(pl_sales)

# Select one row using explicit column filter in Polars.
row_pl = pl_sales.filter(pl.col("date") == "2024-01-02")

# Display Polars selection emphasizing explicit column based filtering.
print("\nPolars select by date column:")
print(row_pl)



### **2.3. LazyFrame Query Plans**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_02_03.jpg?v=1767307288" width="250">



>* Pandas runs each step immediately, creating intermediates
>* Polars LazyFrame stores and optimizes the whole plan

>* Pandas runs each step immediately, creating intermediates
>* Polars builds one optimized lazy query, then executes

>* LazyFrame is a fully optimized pandas-like pipeline
>* Express whole workflows as one lazy integrated query



## **3. Choosing Pandas or Polars**

### **3.1. Quick Exploratory Workflows**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_03_01.jpg?v=1767307315" width="250">



>* Pandas suits quick exploration on smaller datasets
>* Familiar tools and ecosystem outweigh Polarsâ€™ speed

>* Polars keeps exploration fast on very large data
>* Optimized, lazy engine supports many quick experiments

>* Match tool choice to project tempo, scale
>* Start in pandas, switch to Polars when constrained



### **3.2. Scaling Complex Pipelines**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_03_02.jpg?v=1767307327" width="250">



>* Large, repeatable pipelines expose tool performance limits
>* Polars optimizes complex workflows; pandas can bottleneck

>* Growing pandas pipelines hit memory and speed limits
>* Polars optimizes, parallelizes, and scales complex workflows

>* Small, infrequent pipelines can stay on pandas
>* Choose Polars when growth, speed, reliability matter



In [None]:
#@title Python Code - Scaling Complex Pipelines

# Demonstrate scaling complex pipelines using pandas and Polars side by side.
# Show eager versus lazy execution on a repeatable transformation pipeline.
# Highlight runtime differences when pipeline complexity and data volume increase.

# !pip install polars pandas.

# Import required libraries for data handling and timing.
import pandas as pd
import polars as pl
import time

# Define a helper function creating synthetic daily sales data.
def create_sales_data(num_days, rows_per_day):
    dates = pd.date_range("2024-01-01", periods=num_days, freq="D")
    stores = [f"store_{i}" for i in range(10)]
    products = [f"product_{j}" for j in range(20)]
    data = {
        "date": pd.Series(pd.NA, index=range(num_days * rows_per_day)),
        "store": pd.Series(pd.NA, index=range(num_days * rows_per_day)),
        "product": pd.Series(pd.NA, index=range(num_days * rows_per_day)),
        "units_sold": pd.Series(0, index=range(num_days * rows_per_day)),
        "unit_price_usd": pd.Series(0.0, index=range(num_days * rows_per_day)),
    }

    idx = 0
    for d in dates:
        for _ in range(rows_per_day):
            data["date"][idx] = d
            data["store"][idx] = stores[idx % len(stores)]
            data["product"][idx] = products[idx % len(products)]
            data["units_sold"][idx] = (idx % 50) + 1
            data["unit_price_usd"][idx] = float((idx % 30) + 5)
            idx += 1

    return pd.DataFrame(data)

# Create a moderately sized dataset representing daily sales logs.
pandas_df = create_sales_data(num_days=60, rows_per_day=2000)

# Cast columns to simple numpy-backed dtypes so Polars conversion does not require pyarrow.
pandas_df = pandas_df.astype({
    "date": "datetime64[ns]",
    "store": "object",
    "product": "object",
    "units_sold": "int64",
    "unit_price_usd": "float64",
})

# Convert pandas DataFrame into Polars DataFrame for comparison.
polars_df = pl.from_pandas(pandas_df)

# Define a complex pandas pipeline executed eagerly step by step.
def run_pandas_pipeline(df):
    step1 = df[df["units_sold"] > 10]
    step2 = step1.assign(revenue_usd=step1["units_sold"] * step1["unit_price_usd"])
    step3 = step2.assign(revenue_usd=step2["revenue_usd"] * 1.05)
    step4 = step3.assign(revenue_usd=step3["revenue_usd"] * 0.97)
    step5 = step4.assign(revenue_usd=step4["revenue_usd"] * 1.02)

    grouped = step5.groupby(["date", "store"], as_index=False)["revenue_usd"].sum()
    result = grouped.sort_values(["date", "store"]).head(5)
    return result

# Define a comparable Polars lazy pipeline optimized before execution.
def run_polars_pipeline(df):
    lazy = df.lazy()
    lazy = lazy.filter(pl.col("units_sold") > 10)
    lazy = lazy.with_columns((pl.col("units_sold") * pl.col("unit_price_usd")).alias("revenue_usd"))
    lazy = lazy.with_columns((pl.col("revenue_usd") * 1.05).alias("revenue_usd"))
    lazy = lazy.with_columns((pl.col("revenue_usd") * 0.97).alias("revenue_usd"))

    lazy = lazy.with_columns((pl.col("revenue_usd") * 1.02).alias("revenue_usd"))
    lazy = lazy.group_by(["date", "store"]).agg(pl.col("revenue_usd").sum())
    lazy = lazy.sort(["date", "store"]).limit(5)
    result = lazy.collect()
    return result

# Time the pandas pipeline to observe eager execution cost.
start_pandas = time.perf_counter()
pandas_result = run_pandas_pipeline(pandas_df)
pandas_time = time.perf_counter() - start_pandas

# Time the Polars pipeline to observe lazy optimized execution.
start_polars = time.perf_counter()
polars_result = run_polars_pipeline(polars_df)
polars_time = time.perf_counter() - start_polars

# Print concise comparison of runtimes and sample outputs.
print("Pandas pipeline runtime seconds:", round(pandas_time, 4))
print("Polars pipeline runtime seconds:", round(polars_time, 4))
print("\nPandas aggregated sample rows:")
print(pandas_result)
print("\nPolars aggregated sample rows:")
print(polars_result)



### **3.3. Hybrid Pandas Polars Workflows**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_03_03.jpg?v=1767307367" width="250">



>* Combine pandas and Polars to exploit strengths
>* Offload heavy transformations to Polars, keep pandas

>* Use Polars mid-pipeline for heavy transformations
>* Convert between pandas and Polars to balance performance

>* Use hybrid workflows when Polars removes bottlenecks
>* Keep light tasks in pandas, heavy in Polars



# <font color="#418FDE" size="6.5" uppercase>**Pandas vs Polars**</font>


In this lecture, you learned to:
- Explain the key architectural differences between pandas and Polars. 
- Map common pandas objects and operations to their closest Polars equivalents. 
- Evaluate when pandas remains sufficient and when Polars is likely to provide clear benefits. 

In the next Lecture (Lecture C), we will go over 'Setting Up Polars'