# <font color="#418FDE" size="6.5" uppercase>**Optimizing Pipelines**</font>

>Last update: 20251228.
    
By the end of this Lecture, you will be able to:
- Optimize Polars IO operations by choosing appropriate file formats and scan options. 
- Reduce memory usage in Polars pipelines through column pruning, type choices, and lazy evaluation. 
- Profile and benchmark Polars pipelines against existing Pandas implementations to quantify performance improvements. 


## **1. Efficient Polars IO**

### **1.1. Choosing CSV vs Parquet**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_B/image_01_01.jpg?v=1766900451" width="250">



>* CSV is simple and shareable but slow
>* Parquet is compressed, typed, and much faster

>* CSV needs heavy parsing, slowing large scans
>* Parquet skips parsing, enabling faster repeated queries

>* Use CSV at system edges for interoperability
>* Convert cleaned data to Parquet for efficient analytics



In [None]:
#@title Python Code - Choosing CSV vs Parquet

# Compare reading CSV versus Parquet with Polars for simple analytics demonstration.
# Show how file formats affect read speed and file size clearly here.
# Help choose CSV or Parquet for faster repeated analytical queries overall.

import time
import os
import polars as pl

# Create a small synthetic dataset with several numeric columns for demonstration.
num_rows = 500000
sales_df = pl.DataFrame({"store_id": pl.arange(0, num_rows), "day_index": pl.arange(0, num_rows), "units_sold": pl.arange(0, num_rows) % 50, "unit_price": pl.repeat(19.99, num_rows)})

# Define file paths inside current working directory for CSV and Parquet outputs.
csv_path = "demo_sales.csv"
parquet_path = "demo_sales.parquet"

# Write dataset to CSV format without index column, using default comma delimiter.
start_write_csv = time.time()
sales_df.write_csv(csv_path)
end_write_csv = time.time()

# Write dataset to Parquet format using default compression settings for efficiency.
start_write_parquet = time.time()
sales_df.write_parquet(parquet_path)
end_write_parquet = time.time()

# Measure file sizes in bytes for both formats using operating system utilities.
csv_size_bytes = os.path.getsize(csv_path)
parquet_size_bytes = os.path.getsize(parquet_path)

# Time reading CSV using lazy scan, then collect to execute the query fully.
start_read_csv = time.time()
result_csv = pl.scan_csv(csv_path).select([pl.col("units_sold").sum()]).collect()
end_read_csv = time.time()

# Time reading Parquet using lazy scan, then collect to execute the query fully.
start_read_parquet = time.time()
result_parquet = pl.scan_parquet(parquet_path).select([pl.col("units_sold").sum()]).collect()
end_read_parquet = time.time()

# Helper function to format bytes as kilobytes with two decimal places for readability.
def format_kilobytes(byte_count):
    return f"{byte_count / 1024:.2f} KB"

# Print summary comparing file sizes and read times for CSV versus Parquet formats.
print("CSV size:", format_kilobytes(csv_size_bytes), "Read seconds:", round(end_read_csv - start_read_csv, 4))
print("Parquet size:", format_kilobytes(parquet_size_bytes), "Read seconds:", round(end_read_parquet - start_read_parquet, 4))

# Print sums to confirm both formats contain identical numeric information after reading.
print("CSV units_sold sum:", int(result_csv[0, 0]))
print("Parquet units_sold sum:", int(result_parquet[0, 0]))

# Print simple recommendation based on observed read times and file sizes here.
print("Parquet usually wins for repeated analytical reads on large datasets overall.")



### **1.2. Streaming And Chunked Reads**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_B/image_01_02.jpg?v=1766900469" width="250">



>* Process large datasets by streaming small chunks
>* Compute while reading to keep memory usage low

>* Polars stores and processes data in chunks
>* Chunk-wise processing cuts memory use and runtime

>* Best for single-pass filters and aggregations
>* Enables scalable, memory-efficient analytics on huge datasets



In [None]:
#@title Python Code - Streaming And Chunked Reads

# Demonstrate Polars streaming and chunked reads with a simple aggregation example.
# Compare normal eager reading with lazy streaming over a synthetic CSV dataset.
# Show memory friendly processing by scanning and aggregating data in manageable chunks.

import polars as pl
import numpy as np
import os

# Create a synthetic CSV file representing large daily sales records.
num_rows = 200000
rng = np.random.default_rng(seed=42)

cities = ["New York", "Chicago", "Dallas", "Seattle"]
city_data = rng.choice(cities, size=num_rows)


amount_data = rng.integers(low=5, high=500, size=num_rows)


csv_path = "sales_data.csv"


with open(csv_path, "w") as f:
    f.write("city,amount\n")
    for city, amount in zip(city_data, amount_data):
        f.write(f"{city},{amount}\n")


file_size_mb = os.path.getsize(csv_path) / (1024 * 1024)
print(f"CSV file size megabytes approximately: {file_size_mb:.2f}")


# Eager read loads entire file into memory before computing aggregations.


sales_eager = pl.read_csv(csv_path)


result_eager = sales_eager.group_by("city").agg(pl.col("amount").sum().alias("total_amount"))


print("Eager mode totals by city dollars:")
print(result_eager)


# Lazy streaming read processes data in chunks and avoids full materialization.


sales_lazy = pl.scan_csv(csv_path)


result_streaming = (
    sales_lazy.group_by("city")
    .agg(pl.col("amount").sum().alias("total_amount"))
    .collect(streaming=True)
)


print("Streaming mode totals by city dollars:")
print(result_streaming)



### **1.3. Large File Scan Tuning**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_B/image_01_03.jpg?v=1766900485" width="250">



>* Configure scans carefully for huge datasets
>* Select needed columns and time ranges only

>* Balance parallel reads with available memory limits
>* Tune batch size, threads, and IO per environment

>* Align scans with data partitions and layout
>* Refine filters and layout to minimize IO



In [None]:
#@title Python Code - Large File Scan Tuning

# Demonstrate Polars large file scan tuning with column and row selection.
# Compare full scan versus tuned scan using a synthetic large dataset.
# Show timing differences when pruning columns and pushing down filters.

import time
import polars as pl

# Create a synthetic dataset representing large log records.
# We keep row count moderate for Colab memory safety.
row_count = 2_000_000

# Build a lazy frame with several columns including timestamps and error codes.
lf = pl.DataFrame(
    {
        "timestamp": pl.datetime_range(
            low=pl.datetime(2024, 1, 1),
            high=pl.datetime(2024, 12, 31),
            interval="1m",
            eager=True,
        )[:row_count],
        "error_code": pl.randint(100, 600, row_count),
        "user_id": pl.randint(1, 50_000, row_count),
        "payload": pl.repeat("some long text payload", row_count),
    }
).lazy()

# Write the dataset to Parquet to simulate a large columnar log file.
parquet_path = "large_logs.parquet"
lf.collect().write_parquet(parquet_path)

# Define a helper function for timing lazy queries with clear labels.
def run_and_time(label, lazy_frame):
    start = time.time()
    result = lazy_frame.collect()
    duration = time.time() - start
    print(f"{label} took {duration:.3f} seconds, rows {result.height}.")


# Scenario one: naive scan reading all columns and all rows.
lf_naive = pl.scan_parquet(parquet_path)

# Scenario two: tuned scan selecting needed columns and recent rows only.
lf_tuned = (
    pl.scan_parquet(parquet_path)
    .select(["timestamp", "error_code"])
    .filter(pl.col("timestamp") >= pl.datetime(2024, 12, 1))
)

# Scenario three: tuned scan with smaller row groups for memory friendliness.
# We simulate by reading then rechunking into smaller groups.
lf_small_groups = (
    pl.scan_parquet(parquet_path)
    .with_row_count("row_index")
    .filter(pl.col("row_index") % 10 == 0)
    .select(["timestamp", "error_code"])
)

# Run and time each scenario to observe performance differences.
run_and_time("Naive full scan", lf_naive)
run_and_time("Tuned column and date scan", lf_tuned)
run_and_time("Sampled small group scan", lf_small_groups)



## **2. Memory Efficient Pipelines**

### **2.1. Column Pruning Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_B/image_02_01.jpg?v=1766900511" width="250">



>* Select only needed columns to save memory
>* Lazy engines skip reading unused columns entirely

>* Select only columns needed for current task
>* Fewer columns reduce disk reads and memory use

>* Lazy evaluation identifies only truly needed columns
>* Pruned columns speed up joins, aggregations, scalability



In [None]:
#@title Python Code - Column Pruning Basics

# Demonstrate column pruning using Polars lazy scanning basics.
# Compare scanning all columns versus selecting only needed columns.
# Show memory friendly behavior using simple timing and column counts.

import polars as pl
import time

# Create a wide DataFrame with many unnecessary columns.
num_rows = 1_000_000
wide_df = pl.DataFrame({
    "customer_id": pl.arange(0, num_rows),
    "order_amount": pl.arange(0, num_rows) * 1.5,
    "order_date": pl.date_range(
        low=pl.datetime(2020, 1, 1),
        high=pl.datetime(2020, 1, 10),
        interval="1m",
        eager=True,
    )[:num_rows],
    "channel": ["email"] * num_rows,
    "big_text": ["unused description"] * num_rows,
})

# Write the DataFrame to Parquet to simulate warehouse storage.
file_path = "wide_orders.parquet"
wide_df.write_parquet(file_path)

# Define a lazy scan that reads all columns without pruning.
scan_all = pl.scan_parquet(file_path)

# Define a lazy scan that selects only needed columns for aggregation.
scan_pruned = pl.scan_parquet(file_path).select([
    "customer_id",
    "order_amount",
    "channel",
])

# Define a simple aggregation that uses only selected columns.
agg_all = scan_all.groupby("channel").agg([
    pl.col("order_amount").mean().alias("avg_order_amount"),
])

# Define the same aggregation using the pruned lazy scan.
agg_pruned = scan_pruned.groupby("channel").agg([
    pl.col("order_amount").mean().alias("avg_order_amount"),
])

# Helper function to time lazy query execution with minimal overhead.
def run_and_time_lazy(query, label):
    start = time.time()
    result = query.collect()
    duration = time.time() - start
    print(f"{label} took {duration:.4f} seconds, columns: {result.width}.")
    return result

# Run both queries and compare timings and column counts.
result_all = run_and_time_lazy(agg_all, "Aggregation with all columns")

result_pruned = run_and_time_lazy(agg_pruned, "Aggregation with pruned columns")

# Show that both results are logically identical despite different column usage.
print("Results equal:", result_all.frame_equal(result_pruned))



### **2.2. Compact Data Types**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_B/image_02_02.jpg?v=1766900526" width="250">



>* Smaller data types cut per-column memory use
>* Match type to value range to save gigabytes

>* Match data types to real value ranges
>* Use categorical codes instead of repeated strings

>* Use dates and booleans to save memory
>* Right-sized types compound into faster, larger pipelines



In [None]:
#@title Python Code - Compact Data Types

# Show how compact data types reduce memory usage in Polars pipelines.
# Compare wide default types with smaller integer and categorical types.
# Print memory usage for both schemas using a simple retail style example.

import polars as pl
import numpy as np

# Create a small example dataset representing daily store sales counts.
num_rows = 1_000_000
store_ids = np.random.randint(1, 51, size=num_rows)
items_sold = np.random.randint(0, 500, size=num_rows)
category_names = np.random.choice(["Grocery", "Clothing", "Electronics", "Toys"], size=num_rows)

# Build a lazy frame using default wide integer and string types.
lf_wide = pl.LazyFrame(
    {
        "store_id": store_ids,
        "items_sold": items_sold,
        "category": category_names,
    }
)

# Build a lazy frame using compact integer and categorical types.
lf_compact = pl.LazyFrame(
    {
        "store_id": store_ids.astype("int16"),
        "items_sold": items_sold.astype("int16"),
        "category": category_names,
    }
).with_columns(pl.col("category").cast(pl.Categorical))

# Collect both frames into memory so we can inspect their schemas and sizes.
df_wide = lf_wide.collect()
df_compact = lf_compact.collect()

# Helper function to estimate memory usage using Polars estimated_size method.
def estimate_megabytes(df: pl.DataFrame) -> float:
    return df.estimated_size() / (1024 * 1024)

# Print schemas and memory usage for wide and compact representations.
print("Wide schema types:", df_wide.schema)
print("Compact schema types:", df_compact.schema)
print("Wide frame megabytes:", round(estimate_megabytes(df_wide), 2))
print("Compact frame megabytes:", round(estimate_megabytes(df_compact), 2))
print("Memory reduction megabytes:", round(estimate_megabytes(df_wide) - estimate_megabytes(df_compact), 2))
print("Memory reduction percent:", round(100 * (estimate_megabytes(df_wide) - estimate_megabytes(df_compact)) / estimate_megabytes(df_wide), 2))



### **2.3. Minimizing Data Copies**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_B/image_02_03.jpg?v=1766900540" width="250">



>* Limit full dataset copies between transformations
>* Use views and delay materialization to save memory

>* Delay materializing intermediates; avoid unnecessary full copies
>* Let optimizer combine steps and reuse buffers

>* Avoid hidden copy operations and unnecessary conversions
>* Share base data, reuse heavy work across branches



In [None]:
#@title Python Code - Minimizing Data Copies

# Demonstrate minimizing data copies using Polars lazy evaluation pipeline.
# Compare eager materialization versus single lazy pipeline execution memory behavior.
# Show that delaying materialization reduces temporary memory usage for transformations.

import polars as pl
import numpy as np

n_rows = 1_000_000
np.random.seed(42)

base_df = pl.DataFrame({"user_id": np.random.randint(0, 100_000, n_rows)})
base_df = base_df.with_columns(pl.col("user_id").cast(pl.Int32))

base_df = base_df.with_columns(pl.col("user_id").alias("user_id_copy"))
base_df = base_df.with_columns((pl.col("user_id_copy") % 100).alias("bucket"))

print("Eager pipeline shape and columns:", base_df.shape, list(base_df.columns))

lazy_df = pl.scan_ipc(pl.BytesIO(base_df.write_ipc(compression_level=0)))

lazy_pipeline = (
    lazy_df
    .with_columns(pl.col("user_id").alias("user_id_copy"))
    .with_columns((pl.col("user_id_copy") % 100).alias("bucket"))
)

result_df = lazy_pipeline.collect()

print("Lazy pipeline shape and columns:", result_df.shape, list(result_df.columns))




## **3. Profiling Polars Pipelines**

### **3.1. Fair Timing Methods**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_B/image_03_01.jpg?v=1766900556" width="250">



>* Measure realistic, end-to-end pipelines, not fragments
>* Keep data, steps, and logic identical across libraries

>* Separate setup from runs; repeat measurements carefully
>* Consider cold versus warm runs and caching

>* Include full lazy execution, not just planning
>* Define consistent timing window, including IO costs



In [None]:
#@title Python Code - Fair Timing Methods

# Demonstrate fair timing for Pandas and Polars pipelines end to end.
# Separate setup from execution and discard slow warmup runs fairly.
# Ensure equivalent logic and clear timing windows for both libraries.

import timeit, textwrap, statistics as stats

setup_code = textwrap.dedent(
    """
import numpy as np
import pandas as pd
import polars as pl

n_rows = 500_000
np.random.seed(0)

values = np.random.rand(n_rows)
keys = np.random.randint(0, 1000, size=n_rows)

pdf = pd.DataFrame({"key": keys, "value": values})

pldf = pl.DataFrame({"key": keys, "value": values})

lazy_pldf = pldf.lazy()
"""
)

pandas_stmt = textwrap.dedent(
    """
result = (
    pdf[pdf["value"] > 0.5]
    .assign(value_squared=lambda df: df["value"] ** 2)
    .groupby("key", as_index=False)["value_squared"]
    .mean()
)
"""
)

polars_stmt = textwrap.dedent(
    """
result = (
    lazy_pldf
    .filter(pl.col("value") > 0.5)
    .with_columns((pl.col("value") ** 2).alias("value_squared"))
    .group_by("key")
    .agg(pl.col("value_squared").mean())
    .collect()
)
"""
)

warmup_runs = 1

repeat_runs = 5

pandas_times = timeit.repeat(
    stmt=pandas_stmt,
    setup=setup_code,
    repeat=repeat_runs + warmup_runs,
    number=1,
)

polars_times = timeit.repeat(
    stmt=polars_stmt,
    setup=setup_code,
    repeat=repeat_runs + warmup_runs,
    number=1,
)

pandas_steady = sorted(pandas_times[warmup_runs:])

polars_steady = sorted(polars_times[warmup_runs:])

pandas_mean = stats.mean(pandas_steady)

polars_mean = stats.mean(polars_steady)

print("Pandas steady runs seconds:", [round(t, 4) for t in pandas_steady])

print("Polars steady runs seconds:", [round(t, 4) for t in polars_steady])

print("Pandas mean seconds:", round(pandas_mean, 4))

print("Polars mean seconds:", round(polars_mean, 4))

print("Speedup factor Polars versus Pandas:", round(pandas_mean / polars_mean, 2))



### **3.2. Reading Profiling Results**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_B/image_03_02.jpg?v=1766900658" width="250">



>* Compare time spent in each pipeline stage
>* Identify consistent savings and operations benefiting most

>* Check memory and CPU, not time only
>* Compare efficiency, scalability, and multicore usage patterns

>* Use profiling insights to refine pipeline design
>* Iterate to build intuition for performance gains



In [None]:
#@title Python Code - Reading Profiling Results

# Compare simple Pandas and Polars timings and memory usage side by side.
# Show how to read profiling style results beyond single wall clock numbers.
# Keep output short, clear, and beginner friendly for quick Colab experimentation.

import time
import psutil
import numpy as np

import pandas as pd
import polars as pl

process = psutil.Process()
np.random.seed(42)

n_rows = 500000
n_groups = 50

sizes = np.random.randint(1, 500, size=n_rows)
weights = np.random.rand(n_rows) * 10.0

groups = np.random.randint(0, n_groups, size=n_rows)

pdf = pd.DataFrame({"group": groups, "size": sizes, "weight": weights})

pldf = pl.from_pandas(pdf)

start_mem = process.memory_info().rss
start_time = time.perf_counter()

pd_result = pdf.groupby("group").agg({"size": "sum", "weight": "mean"})

pd_time = time.perf_counter() - start_time
pd_mem = process.memory_info().rss - start_mem

start_mem = process.memory_info().rss
start_time = time.perf_counter()

pl_result = (
    pldf.lazy()
    .groupby("group")
    .agg([pl.col("size").sum(), pl.col("weight").mean()])
    .collect()
)

pl_time = time.perf_counter() - start_time
pl_mem = process.memory_info().rss - start_mem

print("Pandas time seconds:", round(pd_time, 4))
print("Polars time seconds:", round(pl_time, 4))
print("Pandas memory bytes:", pd_mem)
print("Polars memory bytes:", pl_mem)

print("Faster library overall:", "Polars" if pl_time < pd_time else "Pandas")
print("Lower memory library:", "Polars" if pl_mem < pd_mem else "Pandas")

print("Pandas result sample:")
print(pd_result.head(3))

print("Polars result sample:")
print(pl_result.head(3))



### **3.3. Practical Performance Targets**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_04/Lecture_B/image_03_03.jpg?v=1766900672" width="250">



>* Turn raw timings into context-specific performance goals
>* Tie benchmarks to real business and user needs

>* Set relative targets like speedup and memory cuts
>* Tie targets to real workflows and user expectations

>* Set tiered targets from baseline to ambitious
>* Link tiers to profiling metrics and stop tuning



# <font color="#418FDE" size="6.5" uppercase>**Optimizing Pipelines**</font>


In this lecture, you learned to:
- Optimize Polars IO operations by choosing appropriate file formats and scan options. 
- Reduce memory usage in Polars pipelines through column pruning, type choices, and lazy evaluation. 
- Profile and benchmark Polars pipelines against existing Pandas implementations to quantify performance improvements. 

In the next Module (Module 5), we will go over 'Migration Patterns and Tools'