# <font color="#418FDE" size="6.5" uppercase>**Pandas vs Polars**</font>

>Last update: 20251228.
    
By the end of this Lecture, you will be able to:
- Compare the core data structures and execution models of Pandas and Polars. 
- Explain how Polars’ lazy and eager APIs relate to typical Pandas workflows. 
- Identify common areas where Pandas habits may not map directly to Polars. 


## **1. Core Data Structures**

### **1.1. Pandas Core Structures**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_01_01.jpg?v=1766974710" width="250">



>* Pandas centers on Series and DataFrames
>* Aligned Series make table-like analysis operations easy

>* Index labels rows with meaningful identifiers
>* Index drives alignment, joins, and reshaping behavior

>* Pandas uses NumPy arrays and vectorized operations
>* Fast in-memory, single-threaded, eager but limited scalability



In [None]:
#@title Python Code - Pandas Core Structures

# Demonstrate Pandas Series and DataFrame core structures clearly.
# Show labeled index behavior and automatic alignment during operations.
# Highlight eager execution and NumPy backed column oriented storage.

# pip install pandas numpy matplotlib seaborn scikit-learn xgboost lightgbm.

# Import pandas library for Series and DataFrame structures.
import pandas as pd

# Create a simple Series with labeled index values.
sales_series = pd.Series([100, 150, 200], index=["Mon", "Tue", "Wed"])

# Create another Series with overlapping and missing index labels.
returns_series = pd.Series([5, 7], index=["Mon", "Wed"])

# Create a DataFrame from multiple aligned Series columns.
store_df = pd.DataFrame({"sales_dollars": sales_series, "returns_dollars": returns_series})

# Print the Series objects to show labeled one dimensional arrays.
print("Sales Series with index labels:\n", sales_series)

# Print the DataFrame to show aligned Series forming a table.
print("\nStore DataFrame with aligned columns:\n", store_df)

# Show automatic index based alignment during arithmetic operations.
net_series = sales_series - returns_series

# Print the result highlighting alignment and missing value handling.
print("\nNet sales after returns by day:\n", net_series)



### **1.2. Polars Series and DataFrames**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_01_02.jpg?v=1766974736" width="250">



>* Series are typed columns; DataFrames group Series
>* Column-first design enables fast, whole-column operations

>* Polars stores columns in contiguous, columnar memory
>* Columnar layout enables fast vectorized, column-wise operations

>* Series can store nested, complex column values
>* Keeps hierarchical data intact for efficient analytics



In [None]:
#@title Python Code - Polars Series and DataFrames

# Demonstrate basic Polars Series and DataFrame usage clearly.
# Show strongly typed columns and column wise operations simply.
# Compare Series and DataFrame views for beginner understanding.

# pip install polars.

# Import polars library with conventional alias pl.
import polars as pl

# Create a Polars Series representing customer ages.
ages_series = pl.Series(name="age_years", values=[25, 32, 40, 28])

# Create another Series representing purchase amounts in dollars.
amount_series = pl.Series(name="purchase_usd", values=[19.5, 45.0, 13.0, 27.5])

# Build a DataFrame from the two Series objects together.
df = pl.DataFrame({"age_years": ages_series, "purchase_usd": amount_series})

# Print the Series to show single column structure.
print("Polars Series example:\n", ages_series)

# Print the DataFrame to show multiple aligned Series columns.
print("\nPolars DataFrame example:\n", df)

# Add a new column using whole column arithmetic operations.
df_with_tax = df.with_columns((pl.col("purchase_usd") * 1.07).alias("with_tax_usd"))

# Show resulting DataFrame emphasizing column oriented transformation.
print("\nDataFrame with tax column:\n", df_with_tax)



### **1.3. Arrow Columnar Foundations**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_01_03.jpg?v=1766974755" width="250">



>* Polars uses Apache Arrow’s columnar memory layout
>* Column-wise storage speeds common analytical data operations

>* Shared Arrow format enables low-copy data exchange
>* Column layout boosts CPU efficiency and vectorization

>* Think in typed columns with clear representations
>* Shared Arrow model boosts scalability, speed, interoperability



In [None]:
#@title Python Code - Arrow Columnar Foundations

# Show how columnar data speeds column operations using Polars and Arrow foundations.
# Compare column wise operations with row wise style using simple numeric example.
# Help beginners connect Arrow column layout with faster analytical style computations.

# !pip install polars pyarrow.

# Import required libraries for Polars DataFrame creation and timing demonstration.
import polars as pl
import time

# Create small example data representing daily temperatures in Fahrenheit degrees.
data = {"day": ["Mon", "Tue", "Wed", "Thu", "Fri"], "temp_f": [70, 72, 68, 75, 71]}

# Build Polars DataFrame which internally uses Arrow style columnar memory layout.
df = pl.DataFrame(data)

# Show DataFrame so learners see columns stored and processed together conceptually.
print("Polars DataFrame with columnar layout:")
print(df)

# Compute average temperature using vectorized column operation over contiguous memory.
start_column = time.time()
avg_temp_column = df["temp_f"].mean()
end_column = time.time()

# Simulate row wise style by looping through values and summing manually in Python.
start_row = time.time()
manual_sum = 0
for value in df["temp_f"]:
    manual_sum += value
avg_temp_row = manual_sum / len(df["temp_f"])
end_row = time.time()

# Print both averages to confirm identical numerical results from both computation styles.
print("\nAverage temperature using column operation:", avg_temp_column)
print("Average temperature using row style loop:", avg_temp_row)

# Print rough timing comparison to highlight efficiency of column wise vectorized operations.
print("\nColumn operation time seconds:", round(end_column - start_column, 7))
print("Row loop operation time seconds:", round(end_row - start_row, 7))

# Show that selecting a column returns Series representing contiguous Arrow backed values.
print("\nTemperature column as Polars Series:")
print(df["temp_f"])



## **2. Polars Execution Model**

### **2.1. Rowwise to columnar shift**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_02_01.jpg?v=1766974781" width="250">



>* Shift mindset from row-based to column-based thinking
>* Column storage boosts performance, parallelism, and clarity

>* Row mindset loops through individual sales transactions
>* Column mindset applies vectorized operations for performance

>* Avoid row-by-row loops; think in columns
>* Use column expressions for faster, clearer transformations



In [None]:
#@title Python Code - Rowwise to columnar shift

# Demonstrate shifting from rowwise thinking to columnwise thinking using Polars.
# Compare a manual loop style with a vectorized column expression style.
# Show how whole column operations replace per row calculations efficiently.
# !pip install polars.

# Import required Polars library for columnar data operations.
import polars as pl

# Create small sales dataset with store, quantity, and unit price columns.
data = {"store": ["A", "A", "B", "B"], "quantity": [2, 5, 1, 3], "price_usd": [10.0, 8.0, 12.0, 7.5]}

# Build Polars DataFrame from the dictionary data structure.
df = pl.DataFrame(data)

# Show original DataFrame to understand the starting point structure.
print("Original DataFrame:\n", df)

# Simulate rowwise mindset using explicit Python loop calculations.
row_totals = []

# Loop through each row and compute revenue manually per transaction.
for row in df.iter_rows():
    store, quantity, price_usd = row
    row_totals.append(quantity * price_usd)

# Print manual rowwise revenue results for comparison purposes.
print("\nRowwise revenue list:", row_totals)

# Use columnar mindset by defining a new revenue column expression.
df_columnar = df.with_columns((pl.col("quantity") * pl.col("price_usd")).alias("revenue_usd"))

# Print DataFrame with new revenue column computed columnwise.
print("\nColumnar revenue DataFrame:\n", df_columnar)

# Show total revenue using a single column aggregation expression.
print("\nTotal revenue using columnar aggregation:", df_columnar["revenue_usd"].sum())



### **2.2. Lazy Query Planning**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_02_02.jpg?v=1766974801" width="250">



>* Polars records steps as a logical plan
>* Execution is deferred to optimize the whole pipeline

>* Lazy workflows replace immediate, stepwise feedback loops
>* Planner reorders steps to minimize data processing

>* Planner optimizes, parallelizes, and reuses computations
>* Describe pipeline once, execute efficiently only when needed



In [None]:
#@title Python Code - Lazy Query Planning

# Demonstrate lazy query planning with simple Polars example.
# Compare building a plan versus immediately executing steps.
# Show when work actually happens during lazy and eager operations.

# !pip install polars pyarrow fsspec.

# Import required libraries for data handling and timing.
import polars as pl
import time

# Create a small example dataset representing daily miles driven.
data = {
    "day": ["Mon", "Tue", "Wed", "Thu", "Fri"],
    "miles": [30, 42, 25, 60, 55],
}

# Build an eager DataFrame that executes operations immediately.
df_eager = pl.DataFrame(data)

# Build a lazy DataFrame that records a plan only.
df_lazy = df_eager.lazy()

# Define a helper function to time any callable execution.
def time_call(label, func):
    start = time.time()
    result = func()
    duration = time.time() - start
    print(f"{label}: {duration:.6f} seconds")
    return result

# Time an eager filter and average miles computation.
print("Eager execution starts immediately below.")
result_eager = time_call(
    "Eager filter and average",
    lambda: df_eager.filter(pl.col("miles") > 40).select(pl.col("miles").mean()),
)

# Show eager result printed directly.
print("Eager result DataFrame below.")
print(result_eager)

# Build a lazy plan with multiple chained operations.
print("Building lazy plan without execution.")
lazy_plan = df_lazy.filter(pl.col("miles") > 40).select(pl.col("miles").mean())

# Time the collection step that triggers lazy execution.
print("Lazy execution happens only on collect call.")
result_lazy = time_call("Lazy collect call", lambda: lazy_plan.collect())

# Show lazy result printed directly for comparison.
print("Lazy result DataFrame below.")
print(result_lazy)



### **2.3. Eager API Parallels**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_02_03.jpg?v=1766974821" width="250">



>* Eager mode feels familiar to Pandas users
>* Supports quick, interactive work while bridging to optimization

>* Eager mode still uses column-based expressions
>* Interactive work trains you for lazy pipelines

>* Start with eager Polars as familiar replacement
>* Later refactor eager steps into lazy pipelines



In [None]:
#@title Python Code - Eager API Parallels

# Demonstrate Polars eager API parallels with familiar Pandas style operations.
# Show immediate execution when filtering, selecting, and creating new derived columns.
# Highlight column expressions that feel familiar yet use Polars efficient engine.

# !pip install polars pandas matplotlib.

# Import required libraries for data handling and plotting.
import polars as pl
import pandas as pd
import matplotlib.pyplot as plt

# Create small customer dataset using a Python dictionary.
data = {
    "customer_id": [1, 2, 3, 4],
    "visits_per_week": [1, 3, 5, 2],
    "spend_dollars": [20.0, 55.0, 120.0, 35.0],
}

# Build a Pandas DataFrame to show familiar starting point.
pd_df = pd.DataFrame(data)

# Build a Polars DataFrame using the same dictionary data.
pl_df = pl.DataFrame(data)

# Use Pandas eagerly to create a new column with weekly revenue estimate.
pd_df["weekly_revenue"] = pd_df["visits_per_week"] * pd_df["spend_dollars"]

# Use Polars eager API with column expression for same calculation.
pl_df = pl_df.with_columns(
    (pl.col("visits_per_week") * pl.col("spend_dollars")).alias("weekly_revenue")
)

# Filter frequent visitors in Pandas using boolean indexing eagerly.
pd_frequent = pd_df[pd_df["visits_per_week"] >= 3]

# Filter frequent visitors in Polars using expression based filter eagerly.
pl_frequent = pl_df.filter(pl.col("visits_per_week") >= 3)

# Print both filtered tables to compare familiar eager style outputs.
print("Pandas frequent visitors table:\n", pd_frequent)
print("\nPolars frequent visitors table:\n", pl_frequent)

# Plot weekly revenue from Polars to show immediate visual feedback.
plt.bar(pl_df["customer_id"].to_list(), pl_df["weekly_revenue"].to_list())
plt.xlabel("Customer identifier code")
plt.ylabel("Weekly revenue dollars")
plt.title("Weekly revenue per customer using Polars eager mode")
plt.show()



## **3. Polars Ecosystem Integration**

### **3.1. NumPy and Arrow Bridges**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_03_01.jpg?v=1766974859" width="250">



>* Polars columns are Arrow arrays, not NumPy
>* Convert only needed columns to NumPy endpoints

>* Polars works directly on Arrow-based columnar data
>* Keep Arrow format, convert small NumPy slices

>* Arrow enforces strict types, nulls, and nesting
>* Conversions to NumPy can lose structure and performance



In [None]:
#@title Python Code - NumPy and Arrow Bridges

# Demonstrate Polars Arrow backbone and NumPy conversion boundary clearly.
# Show efficient Polars operations before converting small slices to NumPy arrays.
# Highlight that conversion is explicit, not automatic, between ecosystems.
# !pip install polars pyarrow numpy.

# Import required libraries for Polars, Arrow, and NumPy interoperability.
import polars as pl
import pyarrow as pa
import numpy as np

# Create a small Polars DataFrame representing hourly energy loads.
data = {
    "hour": np.arange(0, 6, 1),
    "load_kw": np.array([10.0, 12.5, 11.0, 13.5, 14.0, 15.5]),
}

# Build the Polars DataFrame using Arrow friendly columnar structures.
df = pl.DataFrame(data)

# Show the Polars DataFrame to confirm Arrow backed tabular structure.
print("Polars DataFrame (Arrow backed):")
print(df)

# Perform an efficient Polars operation before any NumPy conversion.
df_summary = df.select([
    pl.col("load_kw").mean().alias("mean_load_kw"),
    pl.col("load_kw").std().alias("std_load_kw"),
])

# Display the summary to emphasize staying in Polars for analytics.
print("\nPolars summary before NumPy conversion:")
print(df_summary)

# Extract only the load column as a NumPy array for numerical solver usage.
load_numpy = df["load_kw"].to_numpy()

# Show the NumPy array, highlighting explicit boundary crossing step.
print("\nNumPy array used for numerical solver:")
print(load_numpy)

# Convert the Polars column to a PyArrow array, the natural bridge format.
load_arrow = df["load_kw"].to_arrow()

# Display Arrow array type information, showing richer metadata and structure.
print("\nArrow array type and values:")
print(load_arrow, "| type:", load_arrow.type)



### **3.2. File Formats in Polars**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_03_02.jpg?v=1766974883" width="250">



>* Pandas users often default to CSV files
>* Polars favors Parquet or Arrow for performance

>* Polars prefers files with clear, fixed schemas
>* Stricter typing exposes messy, mixed-type CSV data

>* Prefer columnar formats throughout most data pipelines
>* Avoid CSV-centric habits to unlock Polars performance



In [None]:
#@title Python Code - File Formats in Polars

# Show Polars reading CSV versus Parquet performance and schema clarity.
# Highlight how columnar formats preserve data types and improve speed.
# Encourage preferring Parquet over CSV for larger analytic workflows.

# !pip install polars pyarrow fastparquet.

# Import required libraries for data handling and timing.
import polars as pl
import time

# Create a small example DataFrame with mixed realistic columns.
df = pl.DataFrame({"city": ["Boston", "Dallas", "Denver", "Seattle"], "temp_f": [70.5, 88.2, 65.0, 59.3], "visitors": [1200, 2300, 900, 1500]})

# Define file paths for CSV and Parquet outputs.
csv_path = "example_weather.csv"
parquet_path = "example_weather.parquet"

# Save the DataFrame as a CSV text file.
df.write_csv(csv_path)

# Save the same DataFrame as a Parquet columnar file.
df.write_parquet(parquet_path)

# Time reading the CSV file using Polars read_csv function.
start_csv = time.time(); df_csv = pl.read_csv(csv_path); csv_time = time.time() - start_csv

# Time reading the Parquet file using Polars read_parquet function.
start_parquet = time.time(); df_parquet = pl.read_parquet(parquet_path); parquet_time = time.time() - start_parquet

# Print basic timing comparison for both file formats.
print("CSV read seconds:", round(csv_time, 6))
print("Parquet read seconds:", round(parquet_time, 6))

# Show inferred schema when reading from CSV text format.
print("CSV schema:", df_csv.schema)

# Show preserved schema when reading from Parquet columnar format.
print("Parquet schema:", df_parquet.schema)

# Display both DataFrames to confirm identical values and structures.
print("CSV data:", df_csv)
print("Parquet data:", df_parquet)



### **3.3. Tooling and Library Integration**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_01/Lecture_B/image_03_03.jpg?v=1766974994" width="250">



>* Most tools assume inputs are Pandas DataFrames
>* With Polars, you must manage conversions explicitly

>* ML libraries expect Pandas-specific DataFrame behaviors
>* With Polars, explicitly convert and validate model inputs

>* Pandas-focused debugging and profiling tools may break
>* Polars needs explicit execution and new inspection habits



In [None]:
#@title Python Code - Tooling and Library Integration

# Demonstrate Polars integration with plotting and modeling tools.
# Show where conversions from Polars to Pandas are needed.
# Highlight explicit data contracts at tool boundaries.

# !pip install polars pandas matplotlib scikit-learn.

# Import required libraries for data, plotting, and modeling.
import polars as pl
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Create a simple Polars DataFrame with miles and gallons columns.
data_pl = pl.DataFrame({"miles_driven":[10,20,30,40],"gallons_used":[1,2,2,3]})

# Show that Polars prints fine but some tools expect Pandas DataFrames.
print("Polars DataFrame preview:")
print(data_pl.head(3))

# Convert Polars DataFrame to Pandas for plotting compatibility.
data_pd = data_pl.to_pandas()

# Create a simple scatter plot using Pandas compatible Matplotlib.
plt.scatter(data_pd["miles_driven"],data_pd["gallons_used"],color="blue")
plt.xlabel("Miles driven (miles)")
plt.ylabel("Gallons used (gallons)")
plt.title("Fuel usage scatter plot example")
plt.show()

# Prepare feature matrix and target vector for modeling library.
X = data_pd[["miles_driven"]].values
y = data_pd["gallons_used"].values

# Fit a simple linear regression model using scikit learn.
model = LinearRegression()
model.fit(X,y)

# Print learned slope and intercept to show successful integration.
print("Model slope gallons per mile:",model.coef_[0])
print("Model intercept gallons baseline:",model.intercept_)



# <font color="#418FDE" size="6.5" uppercase>**Pandas vs Polars**</font>


In this lecture, you learned to:
- Compare the core data structures and execution models of Pandas and Polars. 
- Explain how Polars’ lazy and eager APIs relate to typical Pandas workflows. 
- Identify common areas where Pandas habits may not map directly to Polars. 

In the next Lecture (Lecture C), we will go over 'Setting Up Polars'