# Pandas Fundamentals I - Part 3: Selection and Filtering

## Week 2, Day 2 (Thursday) - April 17th, 2025

### Overview
This is the third part of our introduction to Pandas, focusing on data selection and filtering operations. We'll translate SQL WHERE and SELECT statements to Pandas and learn about advanced indexing.

### Learning Objectives
- Select columns and rows using different methods
- Filter data using boolean conditions
- Understand loc and iloc indexers
- Combine selection and filtering operations
- Translate SQL queries to Pandas code

### Prerequisites
- Pandas Fundamentals I - Parts 1 & 2

In [None]:
# Import libraries
import pandas as pd
import numpy as np

# Create a sample DataFrame to work with - e-commerce sales data
data = {
    'order_id': ['ORD001', 'ORD002', 'ORD003', 'ORD004', 'ORD005', 'ORD006', 'ORD007', 'ORD008'],
    'customer_id': ['CUST01', 'CUST02', 'CUST03', 'CUST01', 'CUST04', 'CUST02', 'CUST05', 'CUST03'],
    'product_id': ['PROD01', 'PROD02', 'PROD03', 'PROD02', 'PROD01', 'PROD04', 'PROD05', 'PROD01'],
    'category': ['Electronics', 'Clothing', 'Books', 'Clothing', 'Electronics', 'Home', 'Electronics', 'Books'],
    'quantity': [1, 2, 3, 1, 1, 2, 1, 2],
    'price': [1200.00, 89.99, 24.95, 89.99, 1200.00, 149.50, 399.99, 1200.00],
    'order_date': ['2025-01-05', '2025-01-07', '2025-01-10', '2025-01-12', '2025-01-15', '2025-01-18', '2025-01-20', '2025-01-22'],
    'payment_method': ['Credit Card', 'PayPal', 'Credit Card', 'Debit Card', 'PayPal', 'Credit Card', 'Debit Card', 'PayPal']
}

# Create DataFrame
sales_df = pd.DataFrame(data)

# Convert order_date to datetime
sales_df['order_date'] = pd.to_datetime(sales_df['order_date'])

# Calculate total amount for each order
sales_df['total_amount'] = sales_df['quantity'] * sales_df['price']

# Display the DataFrame
print("Sales DataFrame:")
print(sales_df)

## 1. Selecting Columns (SQL SELECT)

In SQL, you use the SELECT statement to choose specific columns from a table. In Pandas, there are several ways to select columns from a DataFrame.

In [None]:
# SQL: SELECT order_id, customer_id, total_amount FROM sales
# Pandas - Method 1: Using column names in square brackets (preferred for multiple columns)
selected_columns = sales_df[['order_id', 'customer_id', 'total_amount']]
print("Selected columns (method 1):")
print(selected_columns.head())

# Method 2: Using dot notation (only works for column names without spaces or special characters)
# SQL: SELECT total_amount FROM sales
print("\nSelected column (dot notation):")
print(sales_df.total_amount.head())

# Method 3: Using the .loc indexer with column names
# SQL: SELECT order_id, product_id, price FROM sales
print("\nSelected columns using .loc:")
print(sales_df.loc[:, ['order_id', 'product_id', 'price']].head())

# Method 4: Using the .iloc indexer with column positions
# SQL: SELECT columns 0, 2, 4 (by position)
print("\nSelected columns using .iloc:")
print(sales_df.iloc[:, [0, 2, 4]].head())

### Selecting a range of columns

You can also select a range of columns using slicing:

In [None]:
# Using .loc to select a range of columns
# SQL: SELECT order_id, customer_id, product_id, category FROM sales
print("Range of columns using .loc:")
print(sales_df.loc[:, 'order_id':'category'].head())

# Using .iloc to select a range of column positions
# SQL: SELECT first 4 columns
print("\nRange of columns using .iloc:")
print(sales_df.iloc[:, 0:4].head())

### Excluding columns

In SQL, you might use `SELECT * FROM table EXCEPT column1, column2`. In Pandas, we can use the `drop()` method or filter the columns list:

In [None]:
# Method 1: Using drop()
# SQL: SELECT * FROM sales EXCEPT payment_method
print("DataFrame with payment_method dropped:")
print(sales_df.drop(columns=['payment_method']).head())

# Method 2: Using a list comprehension to filter column names
columns_to_exclude = ['payment_method', 'order_date']
filtered_columns = [col for col in sales_df.columns if col not in columns_to_exclude]
print("\nDataFrame with multiple columns excluded:")
print(sales_df[filtered_columns].head())

## 2. Filtering Rows with Boolean Masks (SQL WHERE)

In SQL, you use the WHERE clause to filter rows based on conditions. In Pandas, we use boolean masks for filtering.

In [None]:
# Basic filtering with a single condition
# SQL: SELECT * FROM sales WHERE category = 'Electronics'
electronics = sales_df[sales_df['category'] == 'Electronics']
print("Electronics products:")
print(electronics)

# Filtering with numeric comparison
# SQL: SELECT * FROM sales WHERE price > 100
expensive_items = sales_df[sales_df['price'] > 100]
print("\nExpensive items (price > 100):")
print(expensive_items)

### Combining multiple conditions

In SQL, you use AND, OR, and NOT operators to combine conditions. In Pandas, we use the operators &, |, and ~ respectively.

In [None]:
# AND condition
# SQL: SELECT * FROM sales WHERE category = 'Electronics' AND price > 1000
expensive_electronics = sales_df[(sales_df['category'] == 'Electronics') & (sales_df['price'] > 1000)]
print("Expensive electronics:")
print(expensive_electronics)

# OR condition
# SQL: SELECT * FROM sales WHERE category = 'Electronics' OR category = 'Books'
electronics_or_books = sales_df[(sales_df['category'] == 'Electronics') | (sales_df['category'] == 'Books')]
print("\nElectronics or Books:")
print(electronics_or_books)

# NOT condition
# SQL: SELECT * FROM sales WHERE NOT category = 'Electronics'
not_electronics = sales_df[~(sales_df['category'] == 'Electronics')]
print("\nNon-electronics items:")
print(not_electronics)

### Important Note on Combining Conditions

When combining multiple conditions in Pandas, you must use parentheses around each condition. This is different from SQL where parentheses are optional.

```python
# Correct:
filtered_df = df[(df['col1'] > 10) & (df['col2'] < 20)]

# Incorrect - will raise an error:
# filtered_df = df[df['col1'] > 10 & df['col2'] < 20]
```

The reason is that the & operator has higher precedence than the comparison operators, so without parentheses, the expression would be evaluated as `df[(df['col1'] > (10 & df['col2'])) < 20]`, which is not what we want.

### Filter with the IN operator

In SQL, the IN operator allows you to specify multiple values in a WHERE clause. In Pandas, we use the `isin()` method.

In [None]:
# SQL: SELECT * FROM sales WHERE payment_method IN ('Credit Card', 'PayPal')
card_or_paypal = sales_df[sales_df['payment_method'].isin(['Credit Card', 'PayPal'])]
print("Orders paid with Credit Card or PayPal:")
print(card_or_paypal)

# SQL: SELECT * FROM sales WHERE customer_id IN ('CUST01', 'CUST03')
selected_customers = sales_df[sales_df['customer_id'].isin(['CUST01', 'CUST03'])]
print("\nOrders from selected customers:")
print(selected_customers)

### Filtering with string operations

In SQL, you might use the LIKE operator for pattern matching. In Pandas, we use string methods with the `str` accessor.

In [None]:
# SQL: SELECT * FROM sales WHERE product_id LIKE 'PROD0%'
# Pandas: startswith
starts_with_prod0 = sales_df[sales_df['product_id'].str.startswith('PROD0')]
print("Product IDs starting with 'PROD0':")
print(starts_with_prod0)

# SQL: SELECT * FROM sales WHERE payment_method LIKE '%Card%'
# Pandas: contains
contains_card = sales_df[sales_df['payment_method'].str.contains('Card')]
print("\nPayment methods containing 'Card':")
print(contains_card)

### Filtering with date operations

In SQL, you might use date functions to filter by date. In Pandas, we use datetime properties and functions.

In [None]:
# SQL: SELECT * FROM sales WHERE order_date >= '2025-01-15'
recent_orders = sales_df[sales_df['order_date'] >= '2025-01-15']
print("Orders on or after Jan 15, 2025:")
print(recent_orders)

# SQL: SELECT * FROM sales WHERE EXTRACT(MONTH FROM order_date) = 1
january_orders = sales_df[sales_df['order_date'].dt.month == 1]
print("\nOrders in January:")
print(january_orders)  # Should be all orders in our example dataset

## 3. The loc and iloc Indexers

We've already seen `.loc` and `.iloc` for column selection, but they're also powerful tools for selecting both rows and columns together. This is equivalent to SELECT-WHERE combinations in SQL.

### Using `loc` for label-based indexing

The `.loc` indexer selects data by row and column labels. The syntax is `df.loc[row_labels, column_labels]`.

In [None]:
# Set the index to order_id for demonstration
sales_indexed = sales_df.set_index('order_id')
print("DataFrame with order_id as index:")
print(sales_indexed.head())

# Select a specific row by label
# SQL: SELECT * FROM sales WHERE order_id = 'ORD001'
order_001 = sales_indexed.loc['ORD001']
print("\nOrder ORD001:")
print(order_001)

# Select specific rows and columns by label
# SQL: SELECT customer_id, product_id, total_amount FROM sales WHERE order_id IN ('ORD001', 'ORD003')
selected_orders = sales_indexed.loc[['ORD001', 'ORD003'], ['customer_id', 'product_id', 'total_amount']]
print("\nSelected orders and columns:")
print(selected_orders)

### Using `iloc` for position-based indexing

The `.iloc` indexer selects data by row and column positions (integers). The syntax is `df.iloc[row_positions, column_positions]`.

In [None]:
# Reset index for demonstration with numeric positions
sales_df_reset = sales_df.reset_index(drop=True)
print("DataFrame with numeric index:")
print(sales_df_reset.head())

# Select first row, all columns
# SQL: SELECT * FROM sales LIMIT 1
first_row = sales_df_reset.iloc[0]
print("\nFirst row:")
print(first_row)

# Select rows 2-4 and columns 1, 3, 5
# SQL: Hard to express exactly in SQL - would need row numbers
selected_data = sales_df_reset.iloc[2:5, [1, 3, 5]]
print("\nRows 2-4, columns 1, 3, 5:")
print(selected_data)

### loc vs. iloc: Key Differences

Understanding the differences between `loc` and `iloc` is crucial:

1. **loc**:
   - Uses labels (row/column names)
   - Inclusive of both start and end labels in slices
   - Similar to SQL WHERE with exact values

2. **iloc**:
   - Uses integer positions (0-based)
   - Inclusive of start, exclusive of end in slices (like Python lists)
   - No direct SQL equivalent for row positions
   
When to use each:
- Use `loc` when you have specific row/column labels in mind (e.g., IDs, names)
- Use `iloc` when you need specific positions regardless of the labels

## 4. Combined Selection and Filtering

Now let's combine column selection with row filtering to perform more complex queries.

In [None]:
# Method 1: Chaining operations
# SQL: SELECT order_id, product_id, price FROM sales WHERE category = 'Electronics'
electronics_info = sales_df[sales_df['category'] == 'Electronics'][['order_id', 'product_id', 'price']]
print("Electronics info (chaining):")
print(electronics_info)

# Method 2: Using loc (more efficient)
# SQL: SELECT order_id, product_id, price FROM sales WHERE category = 'Electronics'
electronics_info_loc = sales_df.loc[sales_df['category'] == 'Electronics', ['order_id', 'product_id', 'price']]
print("\nElectronics info (using loc):")
print(electronics_info_loc)

# More complex example
# SQL: SELECT order_id, customer_id, total_amount FROM sales 
#      WHERE category IN ('Electronics', 'Books') AND price > 50 AND order_date >= '2025-01-10'
complex_query = sales_df.loc[
    (sales_df['category'].isin(['Electronics', 'Books'])) & 
    (sales_df['price'] > 50) & 
    (sales_df['order_date'] >= '2025-01-10'),
    ['order_id', 'customer_id', 'total_amount']
]
print("\nComplex query result:")
print(complex_query)