# <font color="#418FDE" size="6.5" uppercase>**Selecting and Filtering**</font>

>Last update: 20251227.
    
By the end of this Lecture, you will be able to:
- Select and rename columns in Polars using expressions that parallel common Pandas patterns. 
- Filter rows in Polars based on boolean conditions and combined predicates. 
- Apply simple column expressions such as arithmetic, string operations, and conditional logic in Polars. 


## **1. Selecting Polars Columns**

### **1.1. Column Name Patterns**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_01_01.jpg?v=1766893954" width="250">



>* Use name patterns instead of listing columns
>* Select column groups by shared prefixes, suffixes, keywords

>* Pattern-based selection adapts as schemas change
>* Automatically includes new related columns, reducing errors

>* Patterned names promote clear, consistent data schemas
>* They simplify selecting, documenting, and reusing related columns



In [None]:
#@title Python Code - Column Name Patterns

# Demonstrate selecting Polars columns using simple name patterns.
# Show prefix, suffix, and substring based column selections clearly.
# Compare pattern selections with manual column listing briefly.

import polars as pl

# Create a small example DataFrame with patterned column names.
data = {
    "demo_age_years": [25, 40, 32],
    "demo_income_usd": [50000, 82000, 61000],
    "temp_F_morning": [68.0, 70.5, 69.2],
    "temp_F_evening": [72.3, 73.1, 71.8],
}

# Build the DataFrame from the dictionary using Polars constructor.
df = pl.DataFrame(data)

# Show the full DataFrame so patterns in column names are visible.
print("Full DataFrame with patterned column names:")
print(df)

# Select all columns starting with prefix "demo_" using Polars selector.
demo_cols = df.select(pl.col("demo_*"))

# Select all columns ending with suffix "_evening" using Polars selector.
evening_cols = df.select(pl.col("*_evening"))

# Select all columns containing substring "temp" anywhere in their names.
temp_cols = df.select(pl.col("*temp*"))

# Print the three pattern based selections to compare results clearly.
print("\nColumns starting with 'demo_':")
print(demo_cols)

print("\nColumns ending with '_evening':")
print(evening_cols)

print("\nColumns containing 'temp' substring:")
print(temp_cols)



### **1.2. Managing Columns**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_01_02.jpg?v=1766893969" width="250">



>* Choose, drop, and reorder columns intentionally
>* Use expressions to combine selection, exclusion, reordering

>* Combine keep and drop in one selection
>* Reorder key columns first for easier inspection

>* Rename columns during selection for cleaner schemas
>* Standardized names simplify multi-source data analysis



In [None]:
#@title Python Code - Managing Columns

# Demonstrate managing Polars columns with selection and exclusion.
# Show keeping important columns and dropping temporary helper columns.
# Also show renaming and reordering columns in a single expression.

import polars as pl

# Create a simple transactions DataFrame with extra temporary columns.
transactions = pl.DataFrame({
    "transaction_id": [1, 2, 3],
    "customer_id": [101, 102, 103],
    "amount_usd": [25.0, 40.5, 12.0],
    "temp_check": [0.1, 0.2, 0.3],
    "debug_flag": [True, False, True],
})

# Show the original DataFrame with all columns included.
print("Original transactions DataFrame with all columns:")
print(transactions)

# Select important business columns and drop temporary helper columns.
cleaned = transactions.select([
    pl.col("transaction_id"),
    pl.col("customer_id"),
    pl.col("amount_usd"),
])

# Show the cleaned DataFrame after dropping temporary helper columns.
print("\nCleaned DataFrame without temporary helper columns:")
print(cleaned)

# Reorder columns and rename amount_usd to total_amount_usd for clarity.
reordered = transactions.select([
    pl.col("customer_id").alias("customer_id"),
    pl.col("transaction_id").alias("transaction_id"),
    pl.col("amount_usd").alias("total_amount_usd"),
])

# Show the reordered and renamed columns in the final DataFrame.
print("\nReordered and renamed columns in final DataFrame:")
print(reordered)



### **1.3. Expression Based Selection**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_01_03.jpg?v=1766893983" width="250">



>* Columns become dynamic transformations, not fixed labels
>* One expression defines, transforms, and renames output columns

>* Combine select, transform, and rename in one step
>* Describe final columns; Polars optimizes execution automatically

>* Think declaratively; Polars handles execution details
>* Define, transform, and label columns in one expression



In [None]:
#@title Python Code - Expression Based Selection

# Demonstrate Polars expression based column selection and transformation.
# Show selecting, creating, and renaming columns in one expression.
# Compare original DataFrame with transformed selection result clearly.

import polars as pl

# Create a simple retail transactions DataFrame with three columns.
df = pl.DataFrame({"customer_id": [1, 2, 3], "items": [2, 1, 4], "price_usd": [5.0, 10.0, 3.5]})

# Show the original DataFrame for clear comparison later.
print("Original DataFrame:")
print(df)

# Use expression based selection to define the final schema.
result = df.select([
    pl.col("customer_id").alias("customer"),
    (pl.col("items") * pl.col("price_usd")).alias("total_revenue_usd"),
    (pl.col("price_usd") * 1.1).round(2).alias("price_usd_with_tax"),
])

# Print the transformed DataFrame showing selected and computed columns.
print("\nTransformed selection with expressions:")
print(result)



## **2. Filtering Rows in Polars**

### **2.1. Boolean Masks Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_02_01.jpg?v=1766893998" width="250">



>* Filtering uses a boolean mask per row
>* True rows are kept, false rows removed

>* Boolean masks are first-class Polars expression results
>* Each mask value aligns with a specific row

>* Boolean masks are reusable, composable filter expressions
>* They map human conditions to concrete filtered rows



In [None]:
#@title Python Code - Boolean Masks Basics

# Demonstrate basic boolean masks with simple Polars DataFrame filtering.
# Show how conditions create true or false values per DataFrame row.
# Apply boolean masks to keep only rows that satisfy chosen conditions.

import polars as pl

# Create a small DataFrame representing simple customer orders in dollars.
orders = pl.DataFrame({"customer": ["Ann", "Bob", "Cara", "Dan"], "total_dollars": [15, 45, 8, 60]})

# Build a boolean mask expression for orders above a chosen dollar threshold.
mask_expr = pl.col("total_dollars") > 20

# Show the DataFrame and the evaluated boolean mask side by side.
print("Original orders DataFrame and boolean mask values:")
print(orders.with_columns(mask_expr.alias("above_20")))

# Use the same boolean mask expression to filter and keep only expensive orders.
filtered_orders = orders.filter(mask_expr)

# Display the filtered DataFrame that only includes rows where mask was true.
print("\nFiltered orders where total_dollars is greater than twenty:")
print(filtered_orders)



### **2.2. Logical Conditions in Polars**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_02_02.jpg?v=1766894013" width="250">



>* Combine conditions with and, or, and not
>* Use logical building blocks for precise filters

>* Each comparison creates a boolean column per row
>* Combine booleans with AND or OR for filtering

>* Use NOT to exclude unwanted rows precisely
>* Combine AND, OR, NOT and verify logic carefully



In [None]:
#@title Python Code - Logical Conditions in Polars

# Demonstrate logical conditions with Polars filters.
# Show and, or, and not combinations clearly.
# Print filtered tables with few example rows.

import polars as pl

# Create a small orders DataFrame with simple columns.
orders = pl.DataFrame({"order_id": [1, 2, 3, 4, 5], "amount_usd": [40, 120, 75, 200, 15], "country": ["USA", "USA", "Canada", "Mexico", "USA"], "is_expedited": [True, False, True, False, True]}).

# Define a condition for high value orders above one hundred dollars.
high_value = pl.col("amount_usd") > 100

# Define a condition for domestic orders shipped inside the United States.
domestic_usa = pl.col("country") == "USA"

# Filter orders that are both high value and domestic using logical and.
filtered_and = orders.filter(high_value & domestic_usa)

# Filter orders that are high value or expedited using logical or.
filtered_or = orders.filter(high_value | pl.col("is_expedited"))

# Filter orders that are not domestic using logical not negation.
filtered_not = orders.filter(~domestic_usa)

# Print results for each logical combination with clear labels.
print("High value and domestic orders:")
print(filtered_and)

print("\nHigh value or expedited orders:")
print(filtered_or)

print("\nOrders not from USA:")
print(filtered_not)



### **2.3. Null Aware Filtering**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_02_03.jpg?v=1766894026" width="250">



>* Nulls mean missing or unknown, not regular values
>* Comparisons with null become null and exclude rows

>* Combined conditions can drop rows with nulls
>* Explicitly test or recode nulls to control filtering

>* Decide how missing values affect your analysis
>* Build filters that explicitly include or reinterpret nulls



In [None]:
#@title Python Code - Null Aware Filtering

# Demonstrate null aware filtering behavior using simple Polars DataFrame.
# Show how comparisons with null values behave during filtering operations.
# Compare default filtering with explicit null handling using is_not_null conditions.

import polars as pl

# Create a small DataFrame with ages and incomes including null values.
df = pl.DataFrame({"name": ["Ann", "Bob", "Cara", "Dan"], "age": [28, None, 42, None], "income_dollars": [50000, 62000, None, 45000]})

# Show the original DataFrame so we can see null positions clearly.
print("Original DataFrame with possible null values:\n", df)

# Filter rows where age is greater than thirty, note null ages are automatically excluded.
filtered_age = df.filter(pl.col("age") > 30)

# Display filtered result, notice rows with null age are missing from this output.
print("\nFiltered where age greater than thirty, null ages excluded:\n", filtered_age)

# Now filter using explicit null handling, require age not null before comparison.
filtered_explicit = df.filter(pl.col("age").is_not_null() & (pl.col("age") > 30))

# Display explicitly handled result, which matches previous but shows intentional null treatment.
print("\nFiltered with explicit null handling on age column:\n", filtered_explicit)



## **3. Core Column Expressions**

### **3.1. Numeric Column Arithmetic**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_03_01.jpg?v=1766894042" width="250">



>* Use arithmetic on columns to create insights
>* Operations run columnwise with lazy, fast execution

>* Combine columns, constants, and math functions flexibly
>* Chain multiple arithmetic steps without temporary columns

>* Watch dtypes, nulls, and precision in arithmetic
>* Handle zeros, null denominators, and extreme outliers



In [None]:
#@title Python Code - Numeric Column Arithmetic

# Demonstrate basic numeric column arithmetic using Polars DataFrame operations.
# Show how to compute revenue and discounts from price and quantity columns.
# Highlight how expressions operate on entire columns without manual loops.

import polars as pl

# Create a simple DataFrame with item prices and quantities.
data = {"item": ["apple", "banana", "orange"], "price_usd": [1.5, 0.75, 1.25], "quantity": [4, 6, 3]}

df = pl.DataFrame(data)

# Compute line revenue and discounted revenue using numeric column arithmetic.
result = df.with_columns([
    (pl.col("price_usd") * pl.col("quantity")).alias("line_revenue_usd"),
    (pl.col("price_usd") * 0.9).alias("discount_price_usd"),
])

# Print the original and transformed DataFrames to observe numeric arithmetic effects.
print("Original DataFrame with prices and quantities:")
print(df)

print("\nDataFrame with computed revenue and discounted prices:")
print(result)



### **3.2. String Operations and Regex**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_03_02.jpg?v=1766894057" width="250">



>* Use vectorized string transforms on text columns
>* Define column-wide operations instead of looping rows

>* Parse semi-structured text into meaningful columns
>* Chain column-level string steps into clean pipelines

>* Use regex patterns to match and extract text
>* Apply patterns columnwise to clean and validate data



In [None]:
#@title Python Code - String Operations and Regex

# Demonstrate basic Polars string operations on text columns.
# Show regex usage for extracting structured information from text.
# Compare original and transformed columns with concise printed output.

import polars as pl

# Create a simple DataFrame with messy text and product codes.
df = pl.DataFrame({"name": ["  alice smith  ", "BOB JONES", "cArOl king"], "email": ["alice@example.com", "bob_sales@shop.co.uk", "carol-2025@data.io"], "product_code": ["ELEC-123-US", "TOY-77-CA", "BOOK-999-UK"]})

# Build expressions that clean names and extract email domains.
clean_name_expr = pl.col("name").str.strip_chars().str.to_titlecase()
email_domain_expr = pl.col("email").str.extract(r"@(.+)$", group_index=1)

# Use regex to split product codes into category and numeric identifier parts.
category_expr = pl.col("product_code").str.extract(r"^([A-Z]+)-", group_index=1)
number_expr = pl.col("product_code").str.extract(r"-([0-9]+)-", group_index=1)

# Select original columns plus new transformed columns for comparison.
result = df.select([pl.col("name"), clean_name_expr.alias("clean_name"), pl.col("email"), email_domain_expr.alias("email_domain"), pl.col("product_code"), category_expr.alias("category"), number_expr.alias("item_number")])

# Print the final DataFrame to show string and regex results.
print(result)



### **3.3. Conditional Column Logic**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_B/image_03_03.jpg?v=1766894070" width="250">



>* Use if-then logic to build new columns
>* Encode business rules and classifications directly in transformations

>* Define clear true-or-false predicates for rows
>* Map predicates to outcomes for readable classifications

>* Combine conditional logic with other column expressions
>* Scales to large data using declarative rules



In [None]:
#@title Python Code - Conditional Column Logic

# Demonstrate simple conditional column logic using Polars expressions.
# Classify orders based on total amount and shipping distance thresholds.
# Show how conditions create readable business rule driven columns.

import polars as pl

# Create a small example DataFrame with order information.
orders = pl.DataFrame({"order_id": [1, 2, 3, 4], "amount_usd": [20, 120, 260, 80], "distance_miles": [5, 40, 15, 70]})

# Build a conditional expression that classifies order value tiers.
value_tier_expr = (
    pl.when(pl.col("amount_usd") < 50)
    .then("small")
    .when(pl.col("amount_usd") < 200)
    .then("medium")
    .otherwise("large")
)

# Build another conditional expression for shipping urgency categories.
shipping_status_expr = (
    pl.when(pl.col("distance_miles") <= 10)
    .then("local")
    .when(pl.col("distance_miles") <= 50)
    .then("regional")
    .otherwise("long_haul")
)

# Apply both conditional expressions to create new descriptive columns.
result = orders.with_columns([
    value_tier_expr.alias("value_tier"),
    shipping_status_expr.alias("shipping_status"),
])

# Print the final DataFrame to inspect conditional logic results.
print(result)



# <font color="#418FDE" size="6.5" uppercase>**Selecting and Filtering**</font>


In this lecture, you learned to:
- Select and rename columns in Polars using expressions that parallel common Pandas patterns. 
- Filter rows in Polars based on boolean conditions and combined predicates. 
- Apply simple column expressions such as arithmetic, string operations, and conditional logic in Polars. 

In the next Module (Module 3), we will go over 'Transformations and Joins'