# Pandas Fundamentals II - Part 2: Filtering Data

## Week 3, Day 1 (Wednesday) - April 23rd, 2025

### Overview
This is the second part of our Pandas Fundamentals II session, focusing on filtering data. While we touched on basic filtering in Part 1, here we'll delve deeper into advanced filtering techniques and methods to efficiently extract relevant subsets from DataFrames.

### Learning Objectives
- Master complex filtering techniques in Pandas
- Use advanced boolean expressions to filter data
- Understand pattern matching for string filtering
- Apply SQL-like thinking to complex filtering operations
- Learn specialized filtering methods like `.query()` and `.isin()`

### Prerequisites
- Python fundamentals (Week 1)
- Pandas Fundamentals I (Week 2, Day 2)
- Indexing and Selection (Week 3, Part 1)

## 1. Introduction to Data Filtering

In the previous section, we covered basic data selection and indexing. Here, we'll focus on filtering, which allows us to extract specific rows from a DataFrame based on conditions. In SQL terms, we're going deeper into the `WHERE` clause functionality.

In [None]:
# Import libraries
import pandas as pd
import numpy as np

# Create a sample e-commerce dataset
data = {
    'order_id': ['ORD001', 'ORD002', 'ORD003', 'ORD004', 'ORD005', 'ORD006', 'ORD007', 'ORD008'],
    'customer_id': ['C101', 'C102', 'C101', 'C103', 'C104', 'C103', 'C105', 'C102'],
    'product_category': ['Electronics', 'Clothing', 'Home', 'Electronics', 'Books', 'Home', 'Electronics', 'Clothing'],
    'product_name': ['Laptop', 'T-shirt', 'Lamp', 'Smartphone', 'Python Book', 'Chair', 'Headphones', 'Jeans'],
    'price': [1200.50, 25.99, 45.75, 800.00, 35.50, 120.99, 75.25, 49.99],
    'quantity': [1, 2, 1, 1, 3, 2, 1, 1],
    'order_date': ['2025-01-15', '2025-01-20', '2025-01-22', '2025-02-05', '2025-02-10', '2025-02-15', '2025-02-20', '2025-03-01'],
    'is_delivered': [True, True, True, True, False, False, True, False]
}

# Create DataFrame
orders_df = pd.DataFrame(data)

# Convert date column to datetime type
orders_df['order_date'] = pd.to_datetime(orders_df['order_date'])

# Add total amount column
orders_df['total_amount'] = orders_df['price'] * orders_df['quantity']

print("Sample E-commerce Orders DataFrame:")
print(orders_df)

## 2. Review of Basic Boolean Filtering

Before diving into advanced techniques, let's quickly review basic boolean filtering:

In [None]:
# Filter orders for Electronics products
electronics_orders = orders_df[orders_df['product_category'] == 'Electronics']
print("Orders for Electronics products:")
print(electronics_orders)

# Filter expensive orders (total over $100)
expensive_orders = orders_df[orders_df['total_amount'] > 100]
print("\nExpensive orders (>$100):")
print(expensive_orders)

# Combining conditions with & (AND) and | (OR)
# Electronics orders that are delivered
delivered_electronics = orders_df[(orders_df['product_category'] == 'Electronics') & 
                                 (orders_df['is_delivered'] == True)]
print("\nDelivered Electronics orders:")
print(delivered_electronics)

## 3. Advanced Boolean Expressions

Now let's explore more complex boolean expressions for filtering:

In [None]:
# Multiple OR conditions
# Orders for Electronics OR Books
tech_orders = orders_df[(orders_df['product_category'] == 'Electronics') | 
                        (orders_df['product_category'] == 'Books')]
print("Orders for Electronics or Books:")
print(tech_orders)

# Complex conditions with both AND and OR
# High-value Electronics orders OR orders with multiple items
complex_filter = orders_df[((orders_df['product_category'] == 'Electronics') & 
                           (orders_df['price'] > 500)) | 
                           (orders_df['quantity'] > 1)]
print("\nComplex filter result:")
print(complex_filter)

# Using NOT condition with ~
# Orders that are not delivered
undelivered = orders_df[~orders_df['is_delivered']]
print("\nUndelivered orders:")
print(undelivered)

### SQL Comparison for Complex Filters

The complex filter above would be equivalent to this SQL:

```sql
SELECT * FROM orders
WHERE (product_category = 'Electronics' AND price > 500)
   OR (quantity > 1)
```

## 4. Filtering with .isin() Method

When you want to filter based on multiple possible values (like SQL's `IN` clause), the `.isin()` method is very useful:

In [None]:
# Filter orders for specific categories
# SQL equivalent: WHERE product_category IN ('Electronics', 'Books')
selected_categories = orders_df[orders_df['product_category'].isin(['Electronics', 'Books'])]
print("Orders for Electronics or Books using .isin():")
print(selected_categories)

# Filter for specific customers
vip_customers = ['C101', 'C103']
vip_orders = orders_df[orders_df['customer_id'].isin(vip_customers)]
print("\nOrders from VIP customers:")
print(vip_orders)

# NOT IN using ~
# Orders that are not Electronics or Clothing
other_categories = orders_df[~orders_df['product_category'].isin(['Electronics', 'Clothing'])]
print("\nOrders not in Electronics or Clothing categories:")
print(other_categories)

## 5. Filtering with .between() for Range Checks

To filter values within a range (like SQL's `BETWEEN` operator), use the `.between()` method:

In [None]:
# Orders with total amount between $50 and $500
# SQL equivalent: WHERE total_amount BETWEEN 50 AND 500
mid_range_orders = orders_df[orders_df['total_amount'].between(50, 500)]
print("Mid-range orders ($50-$500):")
print(mid_range_orders)

# Orders from February 2025
feb_orders = orders_df[orders_df['order_date'].between('2025-02-01', '2025-02-28')]
print("\nOrders from February 2025:")
print(feb_orders)

## 6. String Filtering Methods

Pandas provides powerful string filtering methods through the `.str` accessor. These are similar to SQL's `LIKE` clause:

In [None]:
# Startswith - products that start with 'L'
# SQL equivalent: WHERE product_name LIKE 'L%'
l_products = orders_df[orders_df['product_name'].str.startswith('L')]
print("Products starting with 'L':")
print(l_products)

# Contains - products with 'phone' in the name
# SQL equivalent: WHERE product_name LIKE '%phone%'
phone_products = orders_df[orders_df['product_name'].str.contains('phone', case=False)]
print("\nProducts containing 'phone':")
print(phone_products)

# Regular expressions
# Products with names that contain 'p' followed by any character and then 't'
# SQL equivalent: WHERE product_name REGEXP 'p.t'
pattern_products = orders_df[orders_df['product_name'].str.contains(r'p.t', case=False, regex=True)]
print("\nProducts matching pattern 'p.t':")
print(pattern_products)

## 7. The .query() Method

Pandas provides the `.query()` method for a more SQL-like syntax that can be more readable for complex filters:

In [None]:
# Simple query
# SQL equivalent: WHERE product_category = 'Electronics'
electronics = orders_df.query("product_category == 'Electronics'")
print("Electronics products using query():")
print(electronics)

# Complex query with multiple conditions
# SQL equivalent: WHERE (product_category == 'Electronics' OR product_category == 'Books') AND price > 50
complex_query = orders_df.query("(product_category == 'Electronics' or product_category == 'Books') and price > 50")
print("\nComplex query result:")
print(complex_query)

# Using variables in queries with @
min_price = 100
max_price = 500
price_range_query = orders_df.query("price >= @min_price and price <= @max_price")
print("\nOrders with price between $100 and $500:")
print(price_range_query)

## 8. Using .where() and .mask() for Conditional Replacement

The `.where()` and `.mask()` methods filter data but also let you replace values that don't meet your criteria:

In [None]:
# Using .where() to keep values that meet a condition, replace others with a value
# Keep original price for expensive items (>500), set others to the median price
median_price = orders_df['price'].median()
price_modified = orders_df['price'].where(orders_df['price'] > 500, median_price)
print("Modified prices using .where():")
print(price_modified)

# Using .mask() (opposite of .where()) to replace values that meet a condition
# Apply a 10% discount to Electronics items
discounted_prices = orders_df['price'].mask(orders_df['product_category'] == 'Electronics', 
                                           orders_df['price'] * 0.9)
print("\nDiscounted prices using .mask():")
print(discounted_prices)

## 9. Filtering Time Series Data

Pandas provides specialized methods for filtering time series data:

In [None]:
# Filter by specific date periods
# Orders from January 2025
jan_orders = orders_df[orders_df['order_date'].dt.month == 1]
print("January 2025 orders:")
print(jan_orders)

# Orders from the first quarter of 2025
q1_orders = orders_df[orders_df['order_date'].dt.quarter == 1]
print("\nQ1 2025 orders:")
print(q1_orders)

# Orders from weekdays (Monday-Friday)
weekday_orders = orders_df[orders_df['order_date'].dt.dayofweek < 5]
print("\nWeekday orders:")
print(weekday_orders)

## 10. Chaining Filters

You can chain multiple filter operations for a step-by-step filtering approach:

In [None]:
# Chain filters to find high-value Electronics orders from January 2025
result = (orders_df
          .query("product_category == 'Electronics'")
          .query("order_date >= '2025-01-01' and order_date <= '2025-01-31'")
          .query("total_amount > 1000")
         )
print("High-value Electronics orders from January 2025:")
print(result)

# Alternative method using boolean filters
result_alt = (orders_df
             [orders_df['product_category'] == 'Electronics']
             [orders_df['order_date'].dt.month == 1]
             [orders_df['total_amount'] > 1000]
             )
print("\nSame result using boolean filters:")
print(result_alt)

## 11. SQL to Pandas Filtering Translation Guide

Let's expand our SQL translation guide with filtering-focused operations:

| SQL Operation | Pandas Equivalent |
|--------------|-------------------|
| `WHERE col = value` | `df[df['col'] == value]` or `df.query("col == value")` |
| `WHERE col > value` | `df[df['col'] > value]` or `df.query("col > value")` |
| `WHERE col1 = val1 AND col2 = val2` | `df[(df['col1'] == val1) & (df['col2'] == val2)]` |
| `WHERE col1 = val1 OR col2 = val2` | `df[(df['col1'] == val1) | (df['col2'] == val2)]` |
| `WHERE col IN (val1, val2)` | `df[df['col'].isin([val1, val2])]` |
| `WHERE col NOT IN (val1, val2)` | `df[~df['col'].isin([val1, val2])]` |
| `WHERE col BETWEEN val1 AND val2` | `df[df['col'].between(val1, val2)]` |
| `WHERE col LIKE 'pattern%'` | `df[df['col'].str.startswith('pattern')]` |
| `WHERE col LIKE '%pattern%'` | `df[df['col'].str.contains('pattern')]` |
| `WHERE col IS NULL` | `df[df['col'].isna()]` |
| `WHERE col IS NOT NULL` | `df[df['col'].notna()]` |
| `WHERE DATE_PART('month', date_col) = 1` | `df[df['date_col'].dt.month == 1]` |

## 12. Practice Exercises

Let's practice these filtering techniques with some exercises:

### Exercise 1: Basic Filtering
Find all orders from customer C102 or C104.

In [None]:
# Your code here

### Exercise 2: Complex Filtering
Find all Electronics products that cost less than $100 or any product purchased in quantities greater than 2.

In [None]:
# Your code here

### Exercise 3: String Filtering
Find all products whose names contain the letter 'a' (case insensitive).

In [None]:
# Your code here

### Exercise 4: SQL to Pandas Translation
Translate the following SQL query to Pandas:
```sql
SELECT * FROM orders
WHERE (product_category = 'Electronics' OR product_category = 'Home')
AND is_delivered = TRUE
AND order_date >= '2025-02-01'
ORDER BY total_amount DESC
```

In [None]:
# Your code here

### Exercise 5: Advanced Filtering
Find all orders that meet these criteria:
- Placed on a weekend (Saturday or Sunday)
- Total amount is above the average total amount for all orders
- For products in either the 'Electronics' or 'Books' categories

In [None]:
# Your code here

## Next Steps

In the next part, we'll cover "Handling missing values" in Pandas DataFrames, another essential skill for data preprocessing.

Continue to Part 3: Handling Missing Values when you're ready to proceed.