# <font color="#418FDE" size="6.5" uppercase>**Selecting and Filtering**</font>

>Last update: 20260101.
    
By the end of this Lecture, you will be able to:
- Apply Polars column selection and projection syntax to reproduce common pandas selection patterns. 
- Translate pandas boolean indexing and query-style filters into Polars expressions on DataFrames and LazyFrames. 
- Refactor chained pandas selection and filtering code into readable, idiomatic Polars pipelines. 


## **1. Polars Column Selection**

### **1.1. Column Name Patterns**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_01_01.jpg?v=1767312877" width="250">



>* Use name patterns instead of listing columns
>* Group related columns by prefixes or suffixes

>* Consistent column naming enables powerful pattern selection
>* Patterns reduce typing, mistakes, and improve transparency

>* Consistent name patterns lead to better schemas
>* Patterns keep selection code stable and portable



In [None]:
#@title Python Code - Column Name Patterns

# Demonstrate selecting Polars columns using simple name patterns.
# Show prefix and suffix based column selection with clear printed output.
# Compare manual column listing with concise pattern based selection.
#pip install polars.

#import polars library for DataFrame creation and selection.
import polars as pl

#create a small DataFrame with patterned column names.
df = pl.DataFrame({
    "price_usd": [10, 20, 30],
    "price_eur": [9, 18, 27],
    "sales_count": [2, 3, 4],
    "cost_usd": [5, 7, 9],
})

#print the full DataFrame to understand available columns.
print("Full DataFrame with all columns:\n", df)

#select columns manually by listing each price related column.
manual_price = df.select(["price_usd", "price_eur"])

#print manual selection result for comparison clarity.
print("\nManual selection of price columns:\n", manual_price)

#select columns automatically using prefix pattern for price columns.
pattern_price = df.select(pl.col("^price_.*$"))

#print pattern based selection showing same result more concisely.
print("\nPattern selection using prefix 'price_*':\n", pattern_price)

#select columns automatically using suffix pattern for usd currency.
pattern_usd = df.select(pl.col("^.*_usd$"))

#print suffix based selection to highlight flexible name patterns.
print("\nPattern selection using suffix '*_usd':\n", pattern_usd)



### **1.2. Dropping Columns**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_01_02.jpg?v=1767312919" width="250">



>* Dropping columns shapes the dataset flowing forward
>* Select columns to keep; others drop implicitly

>* Sometimes explicitly drop irrelevant or duplicate columns
>* Use projections to implement dropping by name pattern

>* Combine exclusions and inclusions in one projection
>* Keeps important columns as schemas grow and change



In [None]:
#@title Python Code - Dropping Columns

# Demonstrate dropping columns using Polars projections.
# Compare keeping columns versus explicitly dropping columns.
# Show how projection shapes downstream analysis data.

# pip install polars for DataFrame operations.
# pip install pyarrow for optional backend support.

# Import polars library for column operations.
import polars as pl

# Create a small customer DataFrame example.
df = pl.DataFrame({"customer_id": [1, 2, 3], "age_years": [25, 40, 31], "raw_notes": ["VIP", "Late payer", "New"], "temp_score": [0.8, 0.3, 0.6]})

# Show original DataFrame with all columns.
print("Original DataFrame with all columns:")
print(df)

# Keep only important identifier and metric columns.
project_keep = df.select([pl.col("customer_id"), pl.col("temp_score")])

# Show DataFrame after positive selection projection.
print("\nAfter projection keeping id and score:")
print(project_keep)

# Drop sensitive or temporary columns using exclude pattern.
project_drop = df.select(pl.all().exclude(["raw_notes", "temp_score"]))

# Show DataFrame after dropping notes and temporary score.
print("\nAfter dropping notes and temporary score:")
print(project_drop)



### **1.3. Expression Based Selection**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_01_03.jpg?v=1767312936" width="250">



>* Define output columns using Polars expressions directly
>* Combine selection and transformations into one readable step

>* Like spreadsheet formulas that create visible columns
>* Select columns and define new ones in-place

>* Lazy expressions improve efficiency and avoid redundancy
>* Selections stay readable, showing final engineered columns



In [None]:
#@title Python Code - Expression Based Selection

# Demonstrate Polars expression based column selection with simple sales data.
# Show selecting original, transformed, and newly derived columns together.
# Keep everything beginner friendly and runnable inside Google Colab.

# !pip install polars pyarrow fsspec.

# Import Polars library for DataFrame operations.
import polars as pl

# Create a small DataFrame with simple sales information.
data = pl.DataFrame({"product": ["A", "B", "C"], "price_usd": [10, 20, 15], "quantity": [2, 1, 4]})

# Use expression based selection to define resulting columns directly.
result = data.select([
    pl.col("product"),
    (pl.col("price_usd") * 1.1).alias("price_with_tax_usd"),
    (pl.col("price_usd") * pl.col("quantity")).alias("revenue_usd"),
    pl.when(pl.col("price_usd") * pl.col("quantity") > 30).then(pl.lit("high")).otherwise(pl.lit("low")).alias("revenue_flag"),
])

# Print original DataFrame to compare with expression based selection result.
print("Original DataFrame:\n", data)

# Print resulting DataFrame showing selected and derived columns together.
print("\nSelected and derived columns:\n", result)



## **2. Filtering Rows with Expressions**

### **2.1. Boolean Filters with plcol**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_02_01.jpg?v=1767312993" width="250">



>* Filters are boolean expressions built from columns
>* Expressions return true rows, replacing pandas masks

>* Use column helper to define row conditions
>* Write clear filters directly on relevant columns

>* Combine multiple column conditions into one expression
>* Use composite boolean expressions for clear, predictable filtering



In [None]:
#@title Python Code - Boolean Filters with plcol

# Demonstrate basic boolean filtering using Polars column helper expressions.
# Show how column expressions create boolean masks for filtering rows.
# Compare multiple simple filters on a small customer transactions DataFrame.

# pip install polars if running outside Google Colab environment.

# Import Polars library for DataFrame creation and filtering.
import polars as pl

# Create a small DataFrame with customer transaction information.
customers_df = pl.DataFrame({"age": [25, 32, 45, 29], "amount_usd": [40, 120, 75, 200], "country": ["US", "US", "UK", "CA"]})

# Show the original DataFrame to understand available columns.
print("Original DataFrame:")
print(customers_df)

# Build a boolean expression selecting customers older than thirty years.
age_filter_expr = pl.col("age") > 30

# Apply the age filter expression using filter method on DataFrame.
older_customers_df = customers_df.filter(age_filter_expr)

# Display filtered rows where age condition is true only.
print("\nCustomers older than thirty:")
print(older_customers_df)

# Build a boolean expression selecting large purchases above one hundred dollars.
large_purchase_expr = pl.col("amount_usd") > 100

# Apply the large purchase filter expression using filter method.
large_purchases_df = customers_df.filter(large_purchase_expr)

# Display filtered rows where purchase amount condition is true only.
print("\nPurchases greater than one hundred dollars:")
print(large_purchases_df)



### **2.2. Logical Condition Chaining**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_02_02.jpg?v=1767313011" width="250">



>* Combine multiple conditions using logical boolean operators
>* Think in optimized expression trees instead of masks

>* Use clear, grouped conditions to maintain readability
>* Combine multiple rules carefully to avoid logic mistakes

>* Combine many conditions into one Polars expression
>* Improves performance, readability, and query optimization



In [None]:
#@title Python Code - Logical Condition Chaining

# Demonstrate logical condition chaining with simple Polars filters.
# Compare separate filters versus one combined chained condition expression.
# Show how parentheses clarify complex logical filter groupings.

# !pip install polars pyarrow.

# Import Polars for DataFrame creation and filtering.
import polars as pl

# Create a small customer DataFrame with simple example data.
customers = pl.DataFrame({"name": ["Ann", "Bob", "Cara", "Dan"], "active": [True, False, True, True], "spend_usd": [120.0, 40.0, 300.0, 80.0], "region": ["US", "US", "EU", "US"]})

# Show the original DataFrame for reference before filtering operations.
print("Original customers DataFrame:")
print(customers)

# Build separate simple conditions for activity and spending threshold.
cond_active = pl.col("active") == True
cond_spend = pl.col("spend_usd") >= 100

# Chain conditions using logical AND to keep active high spending customers.
filtered_and = customers.filter(cond_active & cond_spend)

# Display result of chained AND condition filtering operation.
print("\nActive customers spending at least 100 USD:")
print(filtered_and)

# Build a third condition for region being United States specifically.
cond_region_us = pl.col("region") == "US"

# Chain conditions mixing AND and OR with explicit parentheses grouping.
filtered_complex = customers.filter(cond_active & (cond_spend | cond_region_us))

# Display result showing logical chaining with mixed operators and parentheses.
print("\nActive customers with high spend or located in US region:")
print(filtered_complex)



### **2.3. Eager vs Lazy Filtering**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_02_03.jpg?v=1767313026" width="250">



>* Eager mode runs each filter immediately, materializing DataFrames
>* Lazy mode stores a plan, executing filters later

>* Eager filtering creates many intermediate DataFrames
>* Lazy filtering optimizes whole query, saving resources

>* Use eager filtering for small, interactive exploration
>* Use lazy filtering for scalable, optimized data pipelines



In [None]:
#@title Python Code - Eager vs Lazy Filtering

# Demonstrate eager versus lazy filtering using Polars DataFrames and LazyFrames.
# Show that eager executes immediately while lazy builds a deferred query plan.
# Compare printed results to see identical logic with different execution strategies.

# !pip install polars pyarrow fsspec.

# Import polars library for DataFrame and LazyFrame operations.
import polars as pl

# Create a small DataFrame representing daily temperatures in Fahrenheit.
df = pl.DataFrame({"day": ["Mon", "Tue", "Wed", "Thu", "Fri"], "temp_f": [70, 65, 80, 90, 75]})

# Show original DataFrame to understand starting point before filtering.
print("Original DataFrame:")
print(df)

# Apply eager filtering directly, executed immediately on the DataFrame.
filtered_eager = df.filter(pl.col("temp_f") > 75)

# Print eager filtering result, showing only hotter days above threshold.
print("\nEager filtering result:")
print(filtered_eager)

# Build a LazyFrame from the same DataFrame for lazy execution.
lazy_plan = df.lazy()

# Add lazy filter expression, not executed until collect is called.
lazy_filtered = lazy_plan.filter(pl.col("temp_f") > 75)

# Print the lazy query plan to highlight deferred execution behavior.
print("\nLazy query plan:")
print(lazy_filtered.explain())

# Collect lazy result, triggering actual execution and materialization of filtered data.
filtered_lazy = lazy_filtered.collect()

# Print lazy filtering result, matching eager result but optimized internally.
print("\nLazy filtering collected result:")
print(filtered_lazy)



## **3. Refactoring Filter Chains**

### **3.1. Replacing chained pandas loc calls**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_03_01.jpg?v=1767313089" width="250">



>* Replace many pandas filters with one pipeline
>* Polars groups all filter conditions for clarity

>* Describe the final filtered dataset in Polars
>* Combine all filter rules into one pipeline

>* Separate filtering from column selection and calculations
>* Single Polars pipeline makes logic clearer and maintainable



In [None]:
#@title Python Code - Replacing chained pandas loc calls

# Demonstrate replacing chained pandas loc filters with one clear Polars pipeline.
# Show equivalent customer filtering logic in pandas and Polars side by side.
# Keep data tiny and output short for easy beginner friendly understanding.

# !pip install polars pandas.

# Import required libraries for pandas and Polars examples.
import pandas as pd
import polars as pl

# Create small customer dataset with country, status, and lifetime value columns.
customers_data = {
    "customer_id": [1, 2, 3, 4],
    "country": ["USA", "USA", "Canada", "USA"],
    "status": ["active", "inactive", "active", "active"],
    "lifetime_value_usd": [1200, 300, 2500, 800],
}

# Build pandas DataFrame from dictionary for chained loc filtering demonstration.
df_pd = pd.DataFrame(customers_data)

# Show original pandas DataFrame to understand starting customer records clearly.
print("Pandas original DataFrame:")
print(df_pd)

# Apply chained pandas loc filters stepwise for country, status, and value.
df_pd_filtered = df_pd.loc[df_pd["country"] == "USA"]

# Continue pandas filtering chain for active status customers only.
df_pd_filtered = df_pd_filtered.loc[df_pd_filtered["status"] == "active"]

# Final pandas filter keeps customers with lifetime value above threshold.
df_pd_filtered = df_pd_filtered.loc[df_pd_filtered["lifetime_value_usd"] > 1000]

# Display final pandas filtered result after multiple chained loc calls.
print("\nPandas filtered customers:")
print(df_pd_filtered)

# Build Polars DataFrame from same dictionary for pipeline style filtering.
df_pl = pl.DataFrame(customers_data)

# Apply single Polars filter combining all conditions in one clear expression.
df_pl_filtered = df_pl.filter(
    (pl.col("country") == "USA")
    & (pl.col("status") == "active")
    & (pl.col("lifetime_value_usd") > 1000)
)

# Display Polars filtered result showing same customers using one pipeline.
print("\nPolars filtered customers:")
print(df_pl_filtered)



### **3.2. Readable Polars Pipelines**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_03_02.jpg?v=1767313104" width="250">



>* Use one clear Polars pipeline for transformations
>* Tell a readable story from raw to result

>* Group related filters into clear conceptual steps
>* Match pipeline structure and names to domain language

>* Keep steps concise, explicit, and conceptually complete
>* Name intermediate columns clearly to support understanding



In [None]:
#@title Python Code - Readable Polars Pipelines

# Demonstrate readable Polars filtering pipelines with simple customer orders example.
# Compare scattered filters with a single clear Polars pipeline expression.
# Show how named steps improve understanding for beginners and collaborators.

# !pip install polars pyarrow --quiet.

# Import Polars for DataFrame and expression based pipelines.
import polars as pl

# Create a small example DataFrame with customer orders data.
orders = pl.DataFrame({"customer": ["Alice", "Bob", "Cara", "Dan"],
                       "region": ["West", "East", "West", "South"],
                       "order_value_usd": [120.0, 40.0, 300.0, 80.0],
                       "days_since_order": [10, 50, 5, 20]})

# Show the original raw orders DataFrame for quick reference.
print("Raw orders DataFrame:")
print(orders)

# Build a readable pipeline describing high value recent western customers.
pipeline_result = (
    orders
    .filter(pl.col("region") == "West")
    .filter(pl.col("order_value_usd") >= 100.0)
    .filter(pl.col("days_since_order") <= 30)
    .with_columns((pl.col("order_value_usd") > 250.0)
                  .alias("is_very_high_value"))
)

# Print the final filtered result with clear narrative meaning.
print("\nHigh value recent western customers:")
print(pipeline_result)



### **3.3. Validating Filtered Outputs**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_03_03.jpg?v=1767313143" width="250">



>* Verify refactored Polars filters match intended results
>* Compare with trusted baselines, crucial in highâ€‘stakes domains

>* Compare global summaries between old and new pipelines
>* Check local edge-case rows to confirm matching behavior

>* Make validation ongoing with regression-style checks
>* Document filter intent to prevent hidden data issues



In [None]:
#@title Python Code - Validating Filtered Outputs

# Demonstrate validating filtered outputs between pandas and Polars pipelines.
# Show global checks like row counts and simple aggregate comparisons.
# Show local checks using specific known example customer records.

# !pip install polars pandas.

# Import required libraries for pandas and Polars usage.
import pandas as pd
import polars as pl

# Create small example sales data with simple customer transactions.
data = {
    "customer_id": [1, 2, 3, 4, 5, 6],
    "state": ["CA", "CA", "NY", "CA", "TX", "CA"],
    "amount_usd": [120.0, 80.0, 200.0, 150.0, 50.0, 300.0],
    "is_test_account": [False, True, False, False, False, True],
}

# Build pandas DataFrame representing original trusted filtering logic.
df_pd = pd.DataFrame(data)

# Apply original pandas filter for California real customers over threshold.
filtered_pd = df_pd[(df_pd["state"] == "CA") & (df_pd["amount_usd"] >= 100.0) & (~df_pd["is_test_account"])]

# Build Polars DataFrame representing refactored pipeline version.
df_pl = pl.DataFrame(data)

# Apply Polars filter using expression based pipeline style.
filtered_pl = (
    df_pl
    .filter(
        (pl.col("state") == "CA")
        & (pl.col("amount_usd") >= 100.0)
        & (~pl.col("is_test_account"))
    )
)

# Perform global validation by comparing row counts between both filtered results.
print("Row counts pandas versus polars:", len(filtered_pd), len(filtered_pl))

# Perform global validation by comparing total revenue for filtered transactions.
print("Total revenue pandas versus polars:", filtered_pd["amount_usd"].sum(), filtered_pl["amount_usd"].sum())

# Select local check customers that should pass filters for manual verification.
check_ids = [1, 4]

# Show pandas inclusion results for selected customer identifiers.
print("Pandas included ids:", sorted(filtered_pd["customer_id"].tolist()))

# Show Polars inclusion results for selected customer identifiers.
print("Polars included ids:", sorted(filtered_pl["customer_id"].to_list()))

# Confirm both systems include same benchmark identifiers for confidence.
print("Local check passed:", sorted(check_ids) == sorted(filtered_pl["customer_id"].to_list()))



# <font color="#418FDE" size="6.5" uppercase>**Selecting and Filtering**</font>


In this lecture, you learned to:
- Apply Polars column selection and projection syntax to reproduce common pandas selection patterns. 
- Translate pandas boolean indexing and query-style filters into Polars expressions on DataFrames and LazyFrames. 
- Refactor chained pandas selection and filtering code into readable, idiomatic Polars pipelines. 

In the next Module (Module 3), we will go over 'Transformations and Pipelines'