# <font color="#418FDE" size="6.5" uppercase>**Pandas vs Polars**</font>

>Last update: 20251227.
    
By the end of this Lecture, you will be able to:
- Compare the core data structures and execution models of Pandas and Polars. 
- Explain how Polars’ lazy and eager APIs relate to typical Pandas workflows. 
- Identify common areas where Pandas habits may not map directly to Polars. 


## **1. Pandas and Polars Structures**

### **1.1. Pandas Core Structures**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_01_01.jpg?v=1766890070" width="250">



>* Series are labeled one-dimensional data arrays
>* DataFrames are indexed, spreadsheet-like tables of Series

>* DataFrames use NumPy columns plus rich metadata
>* Work in memory with immediate, stepwise transformations

>* Pandas runs each DataFrame step immediately in memory
>* Easy to inspect steps, but can waste resources



In [None]:
#@title Python Code - Pandas Core Structures

# Demonstrate Pandas Series and DataFrame core structures clearly.
# Show labeled data with indexes and column names using small examples.
# Highlight eager execution by creating and transforming DataFrames stepwise.

import pandas as pd

# Create a simple Series representing daily high temperatures in Fahrenheit.
temps_series = pd.Series([72, 75, 70], index=["Mon", "Tue", "Wed"])

# Print the Series to show values and index labels together clearly.
print("Series: daily high temperatures (Fahrenheit)")
print(temps_series)

# Create a DataFrame representing small store sales with labeled columns.
data = {"item": ["Soda", "Chips", "Candy"], "units_sold": [30, 45, 25]}

sales_df = pd.DataFrame(data)

# Print the DataFrame to show table structure with rows and columns.
print("\nDataFrame: small store daily sales table")
print(sales_df)

# Perform an eager operation adding a new revenue column immediately.
sales_df["revenue_dollars"] = sales_df["units_sold"] * 2.5

# Print updated DataFrame to show new column created by eager execution.
print("\nUpdated DataFrame with revenue_dollars column")
print(sales_df)



### **1.2. Polars Core Structures**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_01_02.jpg?v=1766890086" width="250">



>* Polars DataFrame is column-based using Arrow arrays
>* Strictly typed Series enable fast, vectorized operations

>* Same structures support eager and lazy execution
>* Lazy mode builds optimizable graphs from operations

>* Immutable DataFrames make each transformation state clear
>* Structural clarity enables parallelism and efficient data processing



In [None]:
#@title Python Code - Polars Core Structures

# Demonstrate Polars DataFrame and Series core structures simply.
# Show eager versus lazy execution using the same table structures.
# Highlight column types and immutability using small customer purchase data.

import polars as pl

# Create a small Polars DataFrame with typed columns.
customers_df = pl.DataFrame({"customer_id": [1, 2, 3], "state": ["CA", "NY", "TX"], "dollars_spent": [120.5, 89.0, 150.0]})

# Show the DataFrame and its schema to highlight columnar typed structure.
print("Eager DataFrame view and schema:")
print(customers_df)
print(customers_df.schema)

# Demonstrate that a Series is a single typed column from the DataFrame.
spent_series = customers_df["dollars_spent"]
print("\nSingle Series column and its data type:")
print(spent_series)
print(spent_series.dtype)

# Show that transformations create new DataFrames instead of mutating originals.
with_discount_df = customers_df.with_columns((pl.col("dollars_spent") * 0.9).alias("discounted_dollars"))
print("\nOriginal dollars_spent column remains unchanged:")
print(customers_df["dollars_spent"])
print("Discounted dollars column in new DataFrame:")
print(with_discount_df["discounted_dollars"])

# Build a lazy query using the same core structures and then collect results.
lazy_query = customers_df.lazy().select([pl.col("state"), pl.col("dollars_spent").mean().alias("average_dollars")])
result_df = lazy_query.collect()
print("\nLazy query result DataFrame:")
print(result_df)



### **1.3. Columnar Arrow Foundations**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_01_03.jpg?v=1766890101" width="250">



>* Columnar layout stores each column in contiguous memory
>* Arrow standardizes columns, enabling fast, shareable analytics

>* Polars stores columns as Arrow-style arrays
>* Enables fast analytics and zero-copy interoperability

>* Non-Arrow tools add indirection and memory overhead
>* Arrow columns encourage whole-column, vectorized thinking



In [None]:
#@title Python Code - Columnar Arrow Foundations

# Demonstrate columnar thinking using Polars and Arrow style arrays.
# Compare whole column operations with row style mental models.
# Show efficient aggregations on contiguous numeric transaction columns.

import polars as pl
import numpy as np

# Create simple transaction data with amounts and categories.
amounts_dollars = [10.0, 25.5, 7.0, 40.0, 15.5]
merchant_categories = ["grocery", "gas", "grocery", "electronics", "gas"]

# Build a Polars DataFrame that stores columns contiguously.
transactions = pl.DataFrame({"amount_usd": amounts_dollars, "category": merchant_categories})

# Show the DataFrame structure, emphasizing column based layout.
print("Polars transactions DataFrame with columnar layout:")
print(transactions)

# Compute total amount per category using whole column aggregation.
category_totals = transactions.group_by("category").agg(pl.col("amount_usd").sum().alias("total_usd"))

# Display aggregated results that use contiguous numeric column processing.
print("\nTotal amount per category using columnar aggregation:")
print(category_totals)



## **2. Polars Execution Model**

### **2.1. Rowwise to Columnar Mindset**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_02_01.jpg?v=1766890117" width="250">



>* Shift from row-by-row thinking to columns
>* Column focus changes transformations, performance, and reasoning

>* Think in whole columns, not individual rows
>* Use vectorized filters and aggregations for optimization

>* Describe relationships between whole columns, not rows
>* Engine fuses expressions, enabling optimization and parallelism



In [None]:
#@title Python Code - Rowwise to Columnar Mindset

# Show rowwise thinking versus columnar thinking using simple customer transactions.
# Compare manual per row loops with vectorized column operations using Polars expressions.
# Highlight how columnar mindset matches Polars execution and improves clarity and performance.

import polars as pl

# Create a small transactions DataFrame with simple customer purchase information.
# Columns include customer identifier, purchase price dollars, and purchase month name.
transactions = pl.DataFrame({
    "customer_id": [1, 2, 3, 4],
    "price_usd": [20.0, 35.5, 12.0, 50.0],
    "month": ["November", "December", "December", "October"],
})

# Simulate rowwise mindset using a Python loop over individual transaction rows.
# We manually check each row month and accumulate December revenue using imperative style.
loop_total = 0.0
for row in transactions.iter_rows():
    customer, price, month = row
    if month == "December":
        loop_total += price

# Now use columnar mindset with a single expression over the entire price column.
# We filter December rows using a boolean mask and aggregate prices in one vectorized step.
columnar_total = (
    transactions
    .filter(pl.col("month") == "December")
    .select(pl.col("price_usd").sum())
    .item()
)

# Print both results to show they match while using very different mental models.
# Columnar approach describes transformations on columns instead of iterating individual rows.
print("Loop based December revenue total:", loop_total)
print("Columnar December revenue total:", columnar_total)



### **2.2. Lazy Query Planning**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_02_02.jpg?v=1766890133" width="250">



>* Describe transformations first; delay actual computation
>* Deferred pipeline enables global optimization and speed

>* Pandas runs each step immediately, creating intermediates
>* Polars builds one lazy plan, then optimizes execution

>* Plan the whole query like a roadmap
>* Engine reorders, pushes filters, shares work efficiently



In [None]:
#@title Python Code - Lazy Query Planning

# Demonstrate Polars lazy query planning versus eager Pandas style execution.
# Show that Polars builds a plan and executes only when collecting results.
# Compare printed outputs to highlight when work actually happens in each library.

import pandas as pd
import polars as pl

print("Creating small sales dataset for demonstration only.")

sales_data = {
    "store": ["A", "A", "B", "B", "C", "C"],
    "day": ["Mon", "Tue", "Mon", "Tue", "Mon", "Tue"],
    "revenue_usd": [120, 130, 200, 210, 90, 95],
}

pdf = pd.DataFrame(sales_data)

print("Pandas eager style executes each step immediately.")

pdf_filtered = pdf[pdf["revenue_usd"] > 100]

pdf_grouped = pdf_filtered.groupby("store")["revenue_usd"].mean()

print("Pandas result average revenue by store:")

print(pdf_grouped)

print("Now build equivalent Polars lazy query without immediate execution.")

pldf_lazy = pl.DataFrame(sales_data).lazy()

lazy_plan = (
    pldf_lazy
    .filter(pl.col("revenue_usd") > 100)
    .group_by("store")
    .agg(pl.col("revenue_usd").mean().alias("avg_revenue_usd"))
)

print("Polars has built a lazy plan but not executed yet.")

print("Trigger execution now by collecting final lazy result.")

result = lazy_plan.collect()

print("Polars lazy result average revenue by store:")

print(result)



### **2.3. Pandas Style Eager Use**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_02_03.jpg?v=1766890148" width="250">



>* Eager Polars feels like interactive Pandas workflows
>* Immediate feedback builds intuition for columnar operations

>* Eager Polars uses the same fast engine
>* Each step runs immediately with efficient performance

>* Start with eager prototypes on small samples
>* Then convert to lazy pipelines for optimization



In [None]:
#@title Python Code - Pandas Style Eager Use

# Demonstrate Polars eager mode with familiar stepwise data exploration.
# Show immediate results after each transformation using a small toy dataset.
# Compare original and transformed data to highlight conversational analysis style.

import sys
import subprocess
import importlib

# Ensure Polars is installed in the current environment.
# This works in Google Colab and standard notebook environments.
# Installation happens only when Polars is missing.
# Users do not need to run any extra commands.
if importlib.util.find_spec("polars") is None:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "polars", "--quiet"])

import polars as pl

# Create a tiny survey style dataset using a Polars DataFrame.
# Distances are in miles, and durations are in minutes.
# This small dataset keeps printed output short and readable.
# Column names are simple and descriptive for beginners.
df = pl.DataFrame({"name": ["Ann", "Bob", "Cara", "Dan"], "distance_miles": [1.2, 3.5, 0.8, 2.0], "duration_min": [15, 40, 10, 25]})

# Show the original data to emphasize immediate eager evaluation behavior.
# This print call displays the full DataFrame in a compact table.
# Learners can see raw values before any transformations.
# Output remains under the fifteen line requirement.
print("Original data frame (eager mode):")
print(df)

# Add a new derived column using whole column operations, not row loops.
# Here we compute average walking speed in miles per hour units.
# The with_columns method executes immediately in eager mode.
# The result is another Polars DataFrame object.
df_speed = df.with_columns((pl.col("distance_miles") / (pl.col("duration_min") / 60)).alias("speed_mph"))

# Filter rows interactively to keep only relatively fast walkers.
# This chained operation still runs eagerly, step by step.
# The filter condition uses the new derived speed column.
# The result is a smaller, immediately available DataFrame.
fast_walkers = df_speed.filter(pl.col("speed_mph") > 3.0)

# Show the transformed data to highlight conversational analysis style.
# Each transformation produced visible results without delayed execution.
# Learners can imagine iterating with more questions and filters.
# This demonstrates Polars eager mode as notebook friendly.
print("\nFast walkers with derived speed column:")
print(fast_walkers)



## **3. Polars Ecosystem Integration**

### **3.1. NumPy and Arrow Bridges**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_03_01.jpg?v=1766890170" width="250">



>* Pandas-style dataframe-to-NumPy conversion no longer automatic
>* Polars uses columnar Arrow model, requiring deliberate conversions

>* Whole-table array conversion is often inefficient
>* Select needed numeric columns; preserve rich types

>* Arrow tables become the shared data language
>* Minimize array conversions to keep pipelines efficient



In [None]:
#@title Python Code - NumPy and Arrow Bridges

# Demonstrate converting Polars data to NumPy arrays carefully.
# Show Arrow as an efficient bridge between tools.
# Highlight selecting numeric columns before numerical conversion.

import polars as pl
import numpy as np

# Create a small Polars DataFrame with mixed column types.
# Include numeric, string, and datetime columns for illustration.
# This mimics a realistic customer behavior style dataset.

df = pl.DataFrame({"customer_id": [1, 2, 3, 4], "state": ["CA", "NY", "TX", "WA"], "spend_usd": [120.5, 80.0, 150.25, 60.75], "signup_date": pl.date_range(low=pl.datetime(2024, 1, 1), high=pl.datetime(2024, 1, 4), interval="1d", eager=True)})

# Show the original Polars DataFrame structure and column types.
# This highlights the rich typed, columnar representation.
# We keep the printed output compact and readable.

print("Original Polars DataFrame structure:")
print(df)

# Convert the entire DataFrame directly to a NumPy array.
# This forces mixed types into a single object dtype array.
# Important information like nullability or categories can be lost.

full_numpy = df.to_numpy()
print("\nNumPy array from full DataFrame:")
print(full_numpy)

# Select only numeric columns before conversion to NumPy.
# This matches better with numerical modeling expectations.
# The resulting array has a clean floating point dtype.

numeric_df = df.select(["spend_usd"])
numeric_numpy = numeric_df.to_numpy()
print("\nNumPy array from numeric column only:")
print(numeric_numpy)

# Use Arrow as an intermediate bridge representation.
# Convert the Polars DataFrame to an Arrow table efficiently.
# Then convert selected columns to NumPy when truly necessary.

arrow_table = df.to_arrow()
arrow_spend_array = arrow_table["spend_usd"].to_numpy()
print("\nArrow to NumPy for spend_usd column:")
print(narrow_spend_array)



### **3.2. Data Formats and IO**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_03_02.jpg?v=1766890185" width="250">



>* Pandas-style CSV-first thinking can mislead you
>* Polars favors columnar formats for real performance

>* Polars prefers partitioned columnar datasets over single CSVs
>* This enables pushdown, less IO, and memory savings

>* On-disk columnar files participate directly in computation
>* Lazy Parquet workflows avoid materializing full intermediates



In [None]:
#@title Python Code - Data Formats and IO

# Demonstrate Polars reading CSV and Parquet formats efficiently.
# Show lazy scanning of partitioned Parquet files on local storage.
# Compare row counts and highlight minimal memory materialization.

import polars as pl
import os as operating_system

# Create small example dataframe representing daily sales records.
example_df = pl.DataFrame({"day": ["2025-01-01", "2025-01-02"], "sales_dollars": [120.0, 150.0]})

# Save dataframe as CSV text file and Parquet columnar file formats.
example_df.write_csv("sales_example.csv")
example_df.write_parquet("sales_example.parquet")

# Read entire CSV eagerly into memory and show resulting row count.
csv_df = pl.read_csv("sales_example.csv")
print("CSV eager rows:", csv_df.height)

# Read Parquet lazily using scan_parquet without immediate materialization.
parquet_lazy = pl.scan_parquet("sales_example.parquet")
filtered_lazy = parquet_lazy.filter(pl.col("sales_dollars") > 130.0)

# Collect filtered lazy result and show resulting row count.
filtered_parquet_df = filtered_lazy.collect()
print("Parquet lazy filtered rows:", filtered_parquet_df.height)

# Create directory for partitioned Parquet files representing separate days.
partition_directory = "sales_partitions"
operating_system.makedirs(partition_directory, exist_ok=True)

# Write separate Parquet files for each day partition inside directory.
for index, row in enumerate(example_df.iter_rows()):
    day_value, sales_value = row
    partition_path = operating_system.path.join(partition_directory, f"day_{index}.parquet")
    pl.DataFrame({"day": [day_value], "sales_dollars": [sales_value]}).write_parquet(partition_path)

# Lazily scan all partitioned Parquet files using wildcard pattern.
partition_lazy = pl.scan_parquet(operating_system.path.join(partition_directory, "*.parquet"))
month_lazy = partition_lazy.filter(pl.col("sales_dollars") >= 130.0).select(["day", "sales_dollars"])

# Collect final result and print concise summary of filtered partitions.
month_df = month_lazy.collect()
print("Partitioned Parquet filtered rows:", month_df.height, "rows with strong sales.")



### **3.3. Tooling and Library Integration**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_03_03.jpg?v=1766890202" width="250">



>* Many tools assume Pandas DataFrame inputs
>* Polars often needs translation layers or conversions

>* Notebook previews and debugging feel different in Polars
>* You must explicitly materialize and inspect intermediate results

>* Production systems often assume Pandas-like DataFrames
>* Plan where Polars fits and needs bridges



# <font color="#418FDE" size="6.5" uppercase>**Pandas vs Polars**</font>


In this lecture, you learned to:
- Compare the core data structures and execution models of Pandas and Polars. 
- Explain how Polars’ lazy and eager APIs relate to typical Pandas workflows. 
- Identify common areas where Pandas habits may not map directly to Polars. 

In the next Lecture (Lecture C), we will go over 'Setting Up Polars'