# <font color="#418FDE" size="6.5" uppercase>**Lazy API Concepts**</font>

>Last update: 20251228.
    
By the end of this Lecture, you will be able to:
- Describe how Polars’ lazy execution model works and how it contrasts with Pandas’ eager evaluation. 
- Build lazy query plans in Polars using scan operations and expression chains. 
- Inspect and execute lazy plans to validate results and understand optimization effects. 


## **1. Lazy vs Eager Execution**

### **1.1. Deferred Execution Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_A/image_01_01.jpg?v=1766898794" width="250">



>* Lazy execution records transformations instead of running
>* Eager execution runs each step immediately, less optimized

>* Lazy mode builds a full query plan
>* Planner optimizes steps before doing any work

>* Lazy execution delays feedback and concentrates work
>* Global optimization improves performance but changes workflow timing



In [None]:
#@title Python Code - Deferred Execution Basics

# Show how lazy execution defers real work until explicitly requested.
# Compare eager Pandas operations with lazy Polars operations side by side.
# Print when work actually happens for both eager and lazy approaches.

import pandas as pd
import polars as pl

print("Step one: creating small example dataset.")

pdf = pd.DataFrame({"city": ["Boston", "Denver", "Boston", "Miami"], "temp_f": [70, 80, 65, 90]})

print("Pandas eager example starts work immediately.")

pdf_filtered = pdf[pdf["temp_f"] > 75]
print("Pandas filtered result appears right now:")
print(pdf_filtered)

print("Polars lazy example builds plan only.")

lazy_df = pl.DataFrame({"city": ["Boston", "Denver", "Boston", "Miami"], "temp_f": [70, 80, 65, 90]}).lazy()

lazy_plan = lazy_df.filter(pl.col("temp_f") > 75)
print("After building lazy plan, no data work has run.")

print("Now we explicitly collect lazy result, work happens here.")

result = lazy_plan.collect()
print("Polars collected result appears only after collect call:")
print(result)



### **1.2. Performance Benefits of Laziness**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_A/image_01_02.jpg?v=1766898807" width="250">



>* Lazy execution plans the whole query upfront
>* Optimizer reduces work, avoiding wasted time and memory

>* Lazy queries push filters and select columns
>* Engine reuses repeated expressions, avoiding redundant work

>* Lazy execution avoids big intermediate data structures
>* Operation fusion boosts memory efficiency and parallel speed



In [None]:
#@title Python Code - Performance Benefits of Laziness

# Show lazy versus eager performance behavior using Polars and Pandas examples.
# Demonstrate reduced work from predicate pushdown and projection pruning concepts.
# Compare execution timing and printed plans for similar analytical style queries.

import time
import numpy as np
import pandas as pd

import polars as pl

# Create a moderately large synthetic dataset with numeric columns and categories.
num_rows = 1_000_000
np.random.seed(42)

categories = np.random.choice(['US', 'UK', 'CA', 'AU'], size=num_rows)
values_a = np.random.rand(num_rows) * 100.0
values_b = np.random.rand(num_rows) * 50.0

# Build a Pandas DataFrame eagerly, all data loaded immediately into memory.
pdf = pd.DataFrame({'country': categories, 'value_a': values_a, 'value_b': values_b})

# Build a Polars lazy frame using scan, deferring actual file reading or computation.
pl_df = pl.from_pandas(pdf)
lazy_df = pl_df.lazy()

# Define a simple analytical style query with filter and aggregation operations.
country_filter = 'US'

# Eager Pandas pipeline executes each step immediately on full dataset memory.
start_eager = time.time()
filtered_pdf = pdf[pdf['country'] == country_filter]
result_eager = filtered_pdf['value_a'].mean()
end_eager = time.time()

# Lazy Polars pipeline records operations, optimizer can push filter and prune columns.
start_lazy = time.time()
lazy_query = (
    lazy_df
    .filter(pl.col('country') == country_filter)
    .select(pl.col('value_a'))
    .select(pl.col('value_a').mean().alias('mean_value_a'))
)

# Print the optimized plan to show predicate pushdown and projection pruning behavior.
print('Optimized lazy plan shows pushed filter and pruned columns:')
print(lazy_query.explain(optimized=True))

# Execute the lazy query, actual work happens here after optimization decisions.
result_lazy = lazy_query.collect()
end_lazy = time.time()

# Print both results and simple timing comparison for conceptual performance illustration.
print('\nPandas eager mean value result and seconds:')
print(result_eager, round(end_eager - start_eager, 4))

print('\nPolars lazy mean value result and seconds:')
print(result_lazy['mean_value_a'][0], round(end_lazy - start_lazy, 4))



### **1.3. Choosing Eager Execution**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_A/image_01_03.jpg?v=1766898822" width="250">



>* Eager execution runs operations immediately for feedback
>* Best for exploratory work needing clear stepwise results

>* Best for small datasets where speed matters less
>* Immediate results aid teaching, prototyping, collaboration

>* Eager execution gives concrete datasets for integrations
>* Choose eager mode when clarity and immediacy matter



## **2. Constructing Lazy Plans**

### **2.1. Scanning Data Sources**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_A/image_02_01.jpg?v=1766898842" width="250">



>* Lazy scanning registers sources without loading data
>* Scan node anchors the plan and saves work

>* Lazy scans describe data, enabling global planning
>* Engine reads only needed columns, files, partitions

>* Lazy scans unify many different data sources
>* Experiment with pipelines, executing only needed data



In [None]:
#@title Python Code - Scanning Data Sources

# Demonstrate scanning CSV and Parquet sources lazily using Polars in Colab.
# Show that scan operations create lazy plans without immediate data loading.
# Compare lazy scans with eager reads and print simple plan information.

import polars as pl

# Create a small eager DataFrame representing daily temperatures in Fahrenheit.
weather_df = pl.DataFrame({"day": ["Mon", "Tue", "Wed"], "temp_f": [70, 75, 80]})

# Save the DataFrame to CSV and Parquet files for scanning demonstrations.
weather_df.write_csv("weather.csv")
weather_df.write_parquet("weather.parquet")

# Perform an eager read from the CSV file, immediately loading all data into memory.
eager_csv = pl.read_csv("weather.csv")
print("Eager CSV read type:", type(eager_csv))

# Perform a lazy scan from the CSV file, creating a logical plan without loading rows.
lazy_csv = pl.scan_csv("weather.csv")
print("Lazy CSV scan type:", type(lazy_csv))

# Perform a lazy scan from the Parquet file, again creating a logical plan only.
lazy_parquet = pl.scan_parquet("weather.parquet")
print("Lazy Parquet scan type:", type(lazy_parquet))

# Show the lazy CSV plan, which includes a scan node describing the data source.
print("\nLazy CSV logical plan:")
print(lazy_csv.explain())

# Execute the lazy CSV plan, materializing the data only when collect is called.
result_df = lazy_csv.collect()
print("\nCollected DataFrame head:")
print(result_df)



### **2.2. Lazy Expression Chaining**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_A/image_02_02.jpg?v=1766898856" width="250">



>* Describe full data transformation as one pipeline
>* Engine sees whole plan and optimizes execution

>* Describe data flow as one logical pipeline
>* Engine reorders, combines steps for efficient execution

>* Plan the whole transformation workflow upfront
>* Engine optimizes pipeline without intermediate materialization



In [None]:
#@title Python Code - Lazy Expression Chaining

# Demonstrate lazy expression chaining with a simple Polars query pipeline.
# Show how multiple transformations build one coherent lazy plan description.
# Finally execute the plan and display the optimized result clearly.

import polars as pl

# Create a small in memory DataFrame representing simple retail transactions.
# Distances use miles and prices use dollars for familiar imperial style units.
# This eager frame will be converted into a lazy plan for transformations.
# The data stays tiny so printed output remains short and readable.

data = pl.DataFrame({
    "store": ["A", "A", "B", "B"],
    "miles_shipped": [10, 25, 5, 40],
    "price_dollars": [100.0, 80.0, 50.0, 120.0],
    "discount_rate": [0.10, 0.20, 0.00, 0.15],
})

# Start a lazy chain from the eager DataFrame using the lazy method.
# No work happens immediately; we only describe the transformation steps.
# Each method call returns another lazy object representing a new logical state.
# Think of this as sketching the full route before driving anywhere.

lazy_plan = (
    data.lazy()
    .filter(pl.col("miles_shipped") > 8)
    .with_columns(
        discounted_price=pl.col("price_dollars") * (1 - pl.col("discount_rate"))
    )
    .group_by("store")
    .agg(
        total_revenue=pl.col("discounted_price").sum(),
        average_miles=pl.col("miles_shipped").mean(),
    )
)

# Show the optimized logical plan to see how Polars understands the chain.
# This prints a compact description, not the actual data rows yet.
# The engine can reorder or combine steps before any execution occurs.

print("Optimized lazy plan description:")
print(lazy_plan.explain())

# Finally collect the result, which triggers execution of the entire chain.
# Only now does Polars read data, apply filters, and compute aggregations.
# The printed DataFrame is small, staying within the output line limit.

result = lazy_plan.collect()
print("\nFinal aggregated result:")
print(result)



### **2.3. Minimizing Materialization Steps**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_A/image_02_03.jpg?v=1766898881" width="250">



>* Avoid materializing intermediate results during lazy queries
>* Build one end‑to‑end pipeline for global optimization

>* Repeated collects recompute intermediates and waste memory
>* One continuous lazy plan enables powerful optimizations

>* Materialize only at clear, necessary workflow boundaries
>* Sample small subsets to inspect while staying lazy



In [None]:
#@title Python Code - Minimizing Materialization Steps

# Demonstrate minimizing materialization steps using Polars lazy queries.
# Compare repeated collects versus a single lazy pipeline collect.
# Show performance friendly lazy style with minimal intermediate materialization.

import polars as pl
from time import perf_counter

# Create a small in memory DataFrame representing click events.
# Each row stores user identifier, page name, and click count.
# We will convert this frame into a lazy scan style source.

base_df = pl.DataFrame({"user_id": [1, 1, 2, 2, 3, 3], "page": ["home", "cart", "home", "search", "home", "cart"], "clicks": [5, 2, 3, 4, 6, 1]})

# Build a lazy plan that filters and aggregates without early materialization.
# This plan keeps everything lazy until the final collect call.
# We filter home page clicks, then group by user and sum clicks.

lazy_plan_single = base_df.lazy().filter(pl.col("page") == "home").groupby("user_id").agg(pl.col("clicks").sum().alias("total_clicks"))

# Execute the single lazy plan once and measure elapsed time.
# This simulates a pipeline with minimal materialization steps.
# The optimizer can see the entire chain of operations here.

start_single = perf_counter()
result_single = lazy_plan_single.collect()
elapsed_single = perf_counter() - start_single

# Now simulate a less efficient pattern with multiple materializations.
# First collect a filtered frame eagerly from the base DataFrame.
# Then start a new lazy plan from that intermediate frame.

start_multi = perf_counter()
intermediate_eager = base_df.filter(pl.col("page") == "home")
result_multi = intermediate_eager.lazy().groupby("user_id").agg(pl.col("clicks").sum().alias("total_clicks")).collect()
elapsed_multi = perf_counter() - start_multi

# Print both results to confirm they are identical logically.
# Also print timing information to highlight extra overhead.
# On small data timing differences may be tiny but concept remains.

print("Single lazy pipeline result:")
print(result_single)
print("Multiple materializations result:")
print(result_multi)
print("Single pipeline seconds:", round(elapsed_single, 6))
print("Multiple steps seconds:", round(elapsed_multi, 6))
print("Same results:", result_single.frame_equal(result_multi))




## **3. Inspecting Lazy Plans**

### **3.1. Reading Lazy Explain Plans**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_A/image_03_01.jpg?v=1766898900" width="250">



>* Explain plans show a tree of operations
>* Read bottom-up or top-down to follow transformations

>* Separate logical intent from physical execution details
>* Compare plan steps with your mental query model

>* Notice where the optimizer merges or reorders steps
>* Check columns, filters, joins for efficient placement



In [None]:
#@title Python Code - Reading Lazy Explain Plans

# Show a simple Polars lazy query explain plan example.
# Demonstrate reading logical steps from bottom to top.
# Highlight how optimizer changes the physical execution details.

import polars as pl

# Create a small DataFrame representing daily sales in dollars.

df = pl.DataFrame({"day": ["Mon", "Tue", "Wed", "Thu"], "sales": [120, 80, 150, 90]})

# Build a lazy query that filters, adds tax, and computes average.

lazy_query = (
    df.lazy()
    .filter(pl.col("sales") > 90)
    .with_columns((pl.col("sales") * 1.07).alias("sales_with_tax"))
    .select(["day", "sales_with_tax"])
    .group_by("day")
    .agg(pl.col("sales_with_tax").mean().alias("avg_sales_with_tax"))
)

# Print the explain plan to inspect logical and physical steps.

print("--- Lazy query explain plan ---")
print(lazy_query.explain())

# Execute the lazy query and print final result for comparison.

print("\n--- Executed lazy query result ---")
print(lazy_query.collect())



### **3.2. Projection Pushdown Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_A/image_03_02.jpg?v=1766898913" width="250">



>* Engine only reads columns the result needs
>* This cuts I/O, memory use, and computation

>* Early scan nodes listing few columns indicate pushdown
>* Many scanned columns or patterns can block pushdown

>* Engine reads only columns needed for analysis
>* Design queries to help optimizer skip unused columns



In [None]:
#@title Python Code - Projection Pushdown Basics

# Demonstrate projection pushdown using Polars lazy queries and explain plans.
# Show how selecting fewer columns changes the lazy scan behavior.
# Help beginners see optimization effects without large data or complex code.

import polars as pl

# Create a small wide DataFrame with several unused columns.
# Imagine this as a tiny version of a wide clickstream table.
# Only some columns will be needed for the final result.

df = pl.DataFrame({
    "timestamp": [1, 2, 3, 4],
    "user_id": [10, 20, 10, 30],
    "campaign": ["A", "B", "A", "A"],
    "browser": ["Chrome", "Safari", "Edge", "Chrome"],
    "country": ["US", "US", "CA", "GB"],
})

# Turn the DataFrame into a lazy frame for optimization.
# This does not execute any work immediately.

lazy_all = df.lazy()

# Build a lazy query that only needs timestamp and campaign columns.
# The groupby and count ignore browser and country columns.

lazy_query = (
    lazy_all
    .select(["timestamp", "campaign"])
    .groupby("campaign")
    .count()
)

# Show the optimized plan to inspect projection pushdown behavior.
# Look for the scan node listing only timestamp and campaign columns.

print("Optimized plan with projection pushdown enabled:")
print(lazy_query.explain(optimized=True))

# Execute the lazy query to confirm it still returns correct results.
# The result uses only the necessary columns for the aggregation.

result = lazy_query.collect()
print("\nQuery result using only needed columns:")
print(result)



### **3.3. Recognizing common optimizations**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_A/image_03_03.jpg?v=1766898929" width="250">



>* Spot projection pushdown reading only needed columns
>* Look for early filters to drop rows

>* Polars fuses and rewrites operations for efficiency
>* Reordered nodes cut intermediates, keeping results identical

>* Polars optimizes joins, sorts, and data movement
>* Streaming and plan patterns reveal efficiency and issues



In [None]:
#@title Python Code - Recognizing common optimizations

# Show how Polars optimizes lazy queries with explain plans.
# Compare plans to recognize projection and predicate pushdown optimizations.
# Help beginners see fewer columns and rows scanned for efficiency.

import polars as pl

# Create a small DataFrame that mimics a wider fact table.
# Columns include id, timestamp, category, value, and flag.
# We will write this DataFrame to a parquet file.

df = pl.DataFrame({
    "id": [1, 2, 3, 4, 5],
    "timestamp": [1_700_000_001, 1_700_000_002, 1_700_000_003, 1_700_000_004, 1_700_000_005],
    "category": ["A", "B", "A", "B", "A"],
    "value": [10.0, 20.0, 30.0, 40.0, 50.0],
    "flag": [True, False, True, False, True],
})

file_path = "example_fact_table.parquet"

df.write_parquet(file_path)

# Build a lazy query that selects few columns without filters.
# This query should trigger projection pushdown optimization.
# We only need id and value columns from the parquet file.

lazy_projection = (
    pl.scan_parquet(file_path)
    .select([pl.col("id"), pl.col("value") * 2])
)

print("Projection pushdown plan, notice selected columns only:")
print(lazy_projection.explain())

# Build a lazy query that filters rows and selects columns.
# This query should show both predicate and projection pushdown.
# Filter keeps rows where value exceeds thirty and flag is True.

lazy_predicate_projection = (
    pl.scan_parquet(file_path)
    .filter((pl.col("value") > 30.0) & (pl.col("flag") == True))
    .select([pl.col("id"), pl.col("value")])
)

print("\nPredicate and projection pushdown combined plan:")
print(lazy_predicate_projection.explain())





# <font color="#418FDE" size="6.5" uppercase>**Lazy API Concepts**</font>


In this lecture, you learned to:
- Describe how Polars’ lazy execution model works and how it contrasts with Pandas’ eager evaluation. 
- Build lazy query plans in Polars using scan operations and expression chains. 
- Inspect and execute lazy plans to validate results and understand optimization effects. 

In the next Lecture (Lecture B), we will go over 'Optimizing Pipelines'