# Data Filtering and Selection in Pandas

This notebook covers various techniques to filter and subset data efficiently using pandas. We'll explore different methods from basic comparison operators to advanced filtering techniques.

## Import Required Libraries

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)
pd.set_option('display.width', 1000)

# For reproducibility
np.random.seed(42)

## Creating Sample Data

Let's create sample DataFrames with various data types including numeric, string, categorical, and datetime data to demonstrate different filtering techniques.

In [None]:
# Create a sample dataset with different data types
dates = pd.date_range('2023-01-01', periods=100)
categories = ['A', 'B', 'C', 'D']
regions = ['North', 'South', 'East', 'West']
status = ['Active', 'Pending', 'Completed', 'Cancelled']

df = pd.DataFrame({
    'date': dates,
    'numeric_value': np.random.randint(1, 100, 100),
    'float_value': np.random.normal(50, 15, 100),
    'category': np.random.choice(categories, 100),
    'region': np.random.choice(regions, 100),
    'status': np.random.choice(status, 100),
    'text': ['Sample text ' + str(i) for i in range(100)],
    'is_valid': np.random.choice([True, False], 100)
})

# Add some missing values
for col in df.columns:
    if col != 'date':  # Keep dates intact
        mask = np.random.choice([True, False], 100, p=[0.05, 0.95])  # 5% NaN
        df.loc[mask, col] = np.nan

# Convert category to categorical data type
df['category'] = df['category'].astype('category')

# Display the first few rows of the DataFrame
print("DataFrame shape:", df.shape)
df.head()

## Basic Data Filtering

Pandas provides several ways to filter data. Let's start with the most basic approach using comparison operators (`>`, `<`, `==`, `!=`, `>=`, `<=`).

In [None]:
# Filter rows where numeric_value > 50
high_values = df[df['numeric_value'] > 50]
print(f"Rows with numeric_value > 50: {len(high_values)}")
high_values.head()

In [None]:
# Filter rows where category equals 'A'
category_a = df[df['category'] == 'A']
print(f"Rows with category 'A': {len(category_a)}")
category_a.head()

In [None]:
# Filter rows where status is not 'Completed'
not_completed = df[df['status'] != 'Completed']
print(f"Rows with status not 'Completed': {len(not_completed)}")
not_completed.head()

In [None]:
# Filter rows by date
recent_data = df[df['date'] >= '2023-03-01']
print(f"Rows with date on or after March 1, 2023: {len(recent_data)}")
recent_data.head()

## Advanced Filtering Techniques

We can combine multiple conditions using logical operators (`&` and `|`). Note that when using multiple conditions, each condition must be enclosed in parentheses.

In [None]:
# Filter with multiple conditions using AND (&)
# Find rows where numeric_value > 50 AND region is 'North'
filtered_df = df[(df['numeric_value'] > 50) & (df['region'] == 'North')]
print(f"Rows matching both conditions: {len(filtered_df)}")
filtered_df.head()

In [None]:
# Filter with multiple conditions using OR (|)
# Find rows where category is 'A' OR status is 'Completed'
filtered_df = df[(df['category'] == 'A') | (df['status'] == 'Completed')]
print(f"Rows matching either condition: {len(filtered_df)}")
filtered_df.head()

In [None]:
# Combining AND and OR conditions
# Find rows where (numeric_value > 50 AND region is 'North') OR category is 'A'
complex_filter = df[((df['numeric_value'] > 50) & (df['region'] == 'North')) | (df['category'] == 'A')]
print(f"Rows matching complex condition: {len(complex_filter)}")
complex_filter.head()

## Boolean Indexing

Boolean indexing is a powerful technique where you create a boolean mask (Series of True/False values) and apply it to filter the DataFrame.

In [None]:
# Create a boolean mask
mask = df['float_value'] > 60
print("First 10 values of the mask:")
print(mask.head(10))

# Apply the mask to filter the DataFrame
filtered_df = df[mask]
print(f"\nNumber of rows with float_value > 60: {len(filtered_df)}")
filtered_df.head()

In [None]:
# Creating and applying a more complex mask
mask = (df['numeric_value'] > 50) & (df['float_value'] < 70) & (df['is_valid'] == True)
filtered_df = df[mask]
print(f"Rows matching the complex mask: {len(filtered_df)}")
filtered_df.head()

In [None]:
# Visualizing the distribution of data before and after filtering
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.histplot(df['float_value'], kde=True)
plt.title('Original Distribution')
plt.axvline(x=60, color='red', linestyle='--')

plt.subplot(1, 2, 2)
sns.histplot(filtered_df['float_value'], kde=True)
plt.title('Filtered Distribution')

plt.tight_layout()
plt.show()

## Using Query Method

The `query()` method provides a more readable way to filter data, especially for complex conditions. It accepts string expressions and is often more concise than boolean indexing.

In [None]:
# Basic query
filtered_df = df.query('numeric_value > 50')
print(f"Rows with numeric_value > 50: {len(filtered_df)}")
filtered_df.head()

In [None]:
# More complex query with multiple conditions
filtered_df = df.query('numeric_value > 50 and region == "North"')
print(f"Rows matching the complex query: {len(filtered_df)}")
filtered_df.head()

In [None]:
# Using OR conditions in query
filtered_df = df.query('category == "A" or status == "Completed"')
print(f"Rows matching either condition: {len(filtered_df)}")
filtered_df.head()

In [None]:
# Using variables in query with @ symbol
min_value = 70
max_value = 90
filtered_df = df.query('numeric_value >= @min_value and numeric_value <= @max_value')
print(f"Rows with numeric_value between {min_value} and {max_value}: {len(filtered_df)}")
filtered_df.head()

## Filtering with isin()

The `isin()` method lets you filter rows where values are in a specified list or set of values.

In [None]:
# Filter rows where category is either 'A' or 'B'
categories_of_interest = ['A', 'B']
filtered_df = df[df['category'].isin(categories_of_interest)]
print(f"Rows with category in {categories_of_interest}: {len(filtered_df)}")
filtered_df.head()

In [None]:
# Filter rows where region is NOT 'North' or 'South'
regions_to_exclude = ['North', 'South']
filtered_df = df[~df['region'].isin(regions_to_exclude)]  # Note the ~ operator for negation
print(f"Rows with region NOT in {regions_to_exclude}: {len(filtered_df)}")
filtered_df.head()

In [None]:
# Combine isin() with other conditions
filtered_df = df[(df['category'].isin(['A', 'B'])) & (df['numeric_value'] > 50)]
print(f"Rows matching combined conditions: {len(filtered_df)}")
filtered_df.head()

## String Filtering with str Accessor

The `.str` accessor provides string manipulation methods for filtering text data.

In [None]:
# Filter rows where text contains the word 'text'
filtered_df = df[df['text'].str.contains('text')]
print(f"Rows where text contains 'text': {len(filtered_df)}")
filtered_df.head()

In [None]:
# Filter rows where text starts with 'Sample'
filtered_df = df[df['text'].str.startswith('Sample')]
print(f"Rows where text starts with 'Sample': {len(filtered_df)}")
filtered_df.head()

In [None]:
# Filter rows where text ends with specific numbers
filtered_df = df[df['text'].str.endswith(('1', '2', '3'))]
print(f"Rows where text ends with '1', '2', or '3': {len(filtered_df)}")
filtered_df.head()

In [None]:
# Using regex with str.contains()
import re
# Filter rows where text contains a number between 10 and 20
filtered_df = df[df['text'].str.contains(r'Sample text 1[0-9]')]
print(f"Rows where text contains numbers 10-19: {len(filtered_df)}")
filtered_df

## Filtering with loc and iloc

`loc` is used for label-based indexing, while `iloc` is used for integer-position-based indexing.

In [None]:
# Select rows by index and columns by label using loc
# Get rows with indices 5 to 10 and columns 'date', 'numeric_value', 'category'
subset = df.loc[5:10, ['date', 'numeric_value', 'category']]
subset

In [None]:
# Combine loc with boolean indexing
# Select specific columns for rows where numeric_value > 50
subset = df.loc[df['numeric_value'] > 50, ['date', 'numeric_value', 'category', 'region']]
print(f"Shape of subset: {subset.shape}")
subset.head()

In [None]:
# Using iloc for position-based selection
# Get first 5 rows and first 3 columns
subset = df.iloc[0:5, 0:3]
subset

In [None]:
# Combining iloc with specific positions
# Get rows at positions 10, 20, 30 and columns at positions 1, 3, 5
subset = df.iloc[[10, 20, 30], [1, 3, 5]]
subset

## Combining Multiple Filters

Let's explore different ways to chain and combine multiple filtering operations.

In [None]:
# Method 1: Chaining filters
filtered_df = df[df['numeric_value'] > 50][df['category'] == 'A']
print(f"Rows after chaining filters: {len(filtered_df)}")
filtered_df.head()

In [None]:
# Method 2: Using & operator (generally preferred for clarity and performance)
filtered_df = df[(df['numeric_value'] > 50) & (df['category'] == 'A')]
print(f"Rows using & operator: {len(filtered_df)}")
filtered_df.head()

In [None]:
# Method 3: Using query method
filtered_df = df.query('numeric_value > 50 and category == "A"')
print(f"Rows using query method: {len(filtered_df)}")
filtered_df.head()

In [None]:
# Complex filtering example
# Find rows where:
# - numeric_value is between 40 and 80
# - category is either A or B
# - region is not West
# - date is after February 15, 2023

complex_filter = df[
    (df['numeric_value'] > 40) & 
    (df['numeric_value'] < 80) & 
    (df['category'].isin(['A', 'B'])) & 
    (df['region'] != 'West') & 
    (df['date'] > '2023-02-15')
]

print(f"Rows matching complex filter: {len(complex_filter)}")
complex_filter.head()

In [None]:
# Same filter using query method
complex_query = df.query(
    'numeric_value > 40 and '
    'numeric_value < 80 and '
    'category in ["A", "B"] and '
    'region != "West" and '
    'date > "2023-02-15"'
)

print(f"Rows matching complex query: {len(complex_query)}")
complex_query.head()

# Verify that both methods give the same result
print(f"Results are identical: {complex_filter.equals(complex_query)}")

## Date and Time Filtering

Let's explore techniques for filtering data based on date and time fields.

In [None]:
# Basic date filtering
start_date = '2023-02-01'
end_date = '2023-03-31'

date_filtered = df[(df['date'] >= start_date) & (df['date'] <= end_date)]
print(f"Rows between {start_date} and {end_date}: {len(date_filtered)}")
date_filtered.head()

In [None]:
# Using datetime properties to filter
# Filter for days that are weekends (Saturday or Sunday)
weekend_data = df[df['date'].dt.dayofweek.isin([5, 6])]  # 5=Saturday, 6=Sunday
print(f"Rows falling on weekends: {len(weekend_data)}")
weekend_data.head()

In [None]:
# Filter by month
march_data = df[df['date'].dt.month == 3]  # March
print(f"Rows from March: {len(march_data)}")
march_data.head()

In [None]:
# Complex date filtering
# Get data for Mondays in February with numeric_value > 50
monday_feb_data = df[
    (df['date'].dt.month == 2) &  # February
    (df['date'].dt.dayofweek == 0) &  # Monday
    (df['numeric_value'] > 50)
]
print(f"Mondays in February with numeric_value > 50: {len(monday_feb_data)}")
monday_feb_data

## Numeric Data Filtering

Let's look at specific techniques for filtering numeric data.

In [None]:
# Using between() method for range filtering
range_filtered = df[df['numeric_value'].between(40, 60)]
print(f"Rows with numeric_value between 40 and 60: {len(range_filtered)}")
range_filtered.head()

In [None]:
# Filtering based on quantiles
q1 = df['float_value'].quantile(0.25)
q3 = df['float_value'].quantile(0.75)
iqr = q3 - q1

middle_range = df[(df['float_value'] >= q1) & (df['float_value'] <= q3)]
print(f"Q1: {q1:.2f}, Q3: {q3:.2f}, IQR: {iqr:.2f}")
print(f"Rows with float_value in middle 50% (between Q1 and Q3): {len(middle_range)}")
middle_range.head()

In [None]:
# Identifying outliers using IQR
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = df[(df['float_value'] < lower_bound) | (df['float_value'] > upper_bound)]
print(f"Lower bound: {lower_bound:.2f}, Upper bound: {upper_bound:.2f}")
print(f"Number of outliers: {len(outliers)}")
outliers.head()

In [None]:
# Visualizing the distribution with outliers
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['float_value'])
plt.axvline(x=lower_bound, color='red', linestyle='--', label=f'Lower bound ({lower_bound:.2f})')
plt.axvline(x=upper_bound, color='red', linestyle='--', label=f'Upper bound ({upper_bound:.2f})')
plt.legend()
plt.title('Distribution of float_value with Outlier Boundaries')
plt.show()

## Handling NaN Values in Filters

Missing values (NaN) require special handling in filtering operations.

In [None]:
# Count missing values in each column
missing_values = df.isna().sum()
print("Missing values by column:")
missing_values

In [None]:
# Filter rows with missing numeric_value
missing_numeric = df[df['numeric_value'].isna()]
print(f"Rows with missing numeric_value: {len(missing_numeric)}")
missing_numeric.head()

In [None]:
# Filter rows with no missing values in any column
complete_rows = df.dropna()
print(f"Rows with no missing values: {len(complete_rows)}")
print(f"Original DataFrame shape: {df.shape}")
complete_rows.head()

In [None]:
# Filter rows that have at least 6 non-null values
mostly_complete_rows = df.dropna(thresh=6)
print(f"Rows with at least 6 non-null values: {len(mostly_complete_rows)}")
mostly_complete_rows.head()

In [None]:
# Filling NaN values before filtering
# Fill NaN in numeric_value with the median
median_value = df['numeric_value'].median()
filled_df = df.copy()
filled_df['numeric_value'] = filled_df['numeric_value'].fillna(median_value)

# Now filter with no NaN issues
filtered_with_filled = filled_df[filled_df['numeric_value'] > 50]
print(f"Rows after filling NaNs and filtering: {len(filtered_with_filled)}")
filtered_with_filled.head()

## Performance Considerations

Different filtering methods can have varying performance implications, especially with large datasets.

In [None]:
# Create a larger DataFrame for performance testing
large_df = pd.DataFrame({
    'A': np.random.randint(1, 100, 1000000),
    'B': np.random.choice(['X', 'Y', 'Z'], 1000000),
    'C': np.random.normal(0, 1, 1000000)
})

large_df.head()

In [None]:
# Compare performance of different filtering methods
import time

# Method 1: Boolean indexing
start_time = time.time()
filtered_1 = large_df[large_df['A'] > 50]
time_1 = time.time() - start_time

# Method 2: Query method
start_time = time.time()
filtered_2 = large_df.query('A > 50')
time_2 = time.time() - start_time

# Method 3: NumPy where + loc
start_time = time.time()
mask = np.where(large_df['A'] > 50)[0]
filtered_3 = large_df.iloc[mask]
time_3 = time.time() - start_time

print(f"Boolean indexing time: {time_1:.5f} seconds")
print(f"Query method time: {time_2:.5f} seconds")
print(f"NumPy where + iloc time: {time_3:.5f} seconds")

In [None]:
# Compare performance for complex filters
start_time = time.time()
filtered_1 = large_df[(large_df['A'] > 50) & (large_df['B'] == 'X') & (large_df['C'] > 0)]
time_1 = time.time() - start_time

start_time = time.time()
filtered_2 = large_df.query('A > 50 and B == "X" and C > 0')
time_2 = time.time() - start_time

print(f"Complex boolean indexing time: {time_1:.5f} seconds")
print(f"Complex query method time: {time_2:.5f} seconds")
print(f"Number of rows in result: {len(filtered_1)}")

## Summary and Best Practices

Here's a summary of the filtering techniques we've covered and some best practices:

1. **Basic Filtering**:
   - Use comparison operators (`>`, `<`, `==`, etc.) for simple conditions
   - Enclose multiple conditions in parentheses when using `&` and `|`

2. **Boolean Indexing**:
   - Create boolean masks with conditions and apply them to filter DataFrames
   - Great for complex conditions and reusable filters

3. **Query Method**:
   - More readable for complex conditions 
   - Can be more efficient for larger datasets
   - Use `@` to reference external variables

4. **isin() Method**:
   - Efficient for checking membership in lists of values
   - Use `~` to negate the condition

5. **String Filtering**:
   - Use the `.str` accessor with methods like `contains()`, `startswith()`, etc.
   - Support for regex patterns with `contains()`

6. **Loc and Iloc**:
   - `loc` for label-based indexing
   - `iloc` for integer position-based indexing
   - Combine with boolean masks for powerful subsetting

7. **Date Filtering**:
   - Use date properties with `.dt` accessor
   - Compare with date strings or datetime objects

8. **Handling NaN Values**:
   - Use `isna()` and `notna()` to identify missing values
   - Consider filling NaN values before filtering
   - Use `dropna()` with appropriate parameters

9. **Performance Tips**:
   - The `query()` method can be faster for large datasets
   - Avoid chaining multiple filters when possible
   - Consider using NumPy filtering for very large datasets