# Advanced Pandas Operations - Part 1: GroupBy Operations

## Week 3, Day 2 (Thursday) - April 24th, 2025

### Overview
This session focuses on GroupBy operations in Pandas, which are similar to SQL's GROUP BY clause. GroupBy allows you to split your data into groups based on some criteria, apply a function to each group independently, and then combine the results. This is one of the most powerful features of Pandas for data analysis.

### Learning Objectives
- Understand the split-apply-combine paradigm in Pandas
- Master GroupBy operations and their SQL equivalents
- Learn various aggregation techniques with GroupBy
- Apply transformations to groups
- Perform complex grouping operations with multiple criteria

### Prerequisites
- Python fundamentals (Week 1)
- Pandas Fundamentals I & II (Week 2, Day 2 & Week 3, Day 1)
- SQL knowledge (prior to course)

## 1. Introduction to GroupBy Operations

GroupBy operations are built on the **split-apply-combine** paradigm:

1. **Split**: Data is split into groups based on one or more keys
2. **Apply**: A function is applied to each group independently
3. **Combine**: The results are combined into a new data structure

This is analogous to SQL's GROUP BY clause. Let's start by creating a sample dataset to explore these concepts:

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create a sample sales dataset
data = {
    'date': pd.date_range('2025-01-01', periods=30, freq='D'),
    'product_id': np.random.choice(['P001', 'P002', 'P003', 'P004', 'P005'], size=30),
    'category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Books'], size=30),
    'store_id': np.random.choice(['S01', 'S02', 'S03'], size=30),
    'region': np.random.choice(['North', 'South', 'East', 'West'], size=30),
    'quantity': np.random.randint(1, 10, size=30),
    'unit_price': np.random.uniform(10, 100, size=30).round(2),
    'discount': np.random.choice([0, 0.1, 0.2, 0.3], size=30)
}

# Create DataFrame
sales_df = pd.DataFrame(data)

# Calculate total price
sales_df['total_price'] = (sales_df['quantity'] * sales_df['unit_price'] * (1 - sales_df['discount'])).round(2)

# Display the first few rows
print("Sample Sales DataFrame:")
print(sales_df.head())

## 2. Basic GroupBy Operations

The simplest GroupBy operation involves grouping data by a single column and applying an aggregation function:

In [None]:
# Group by product category and calculate total sales
category_sales = sales_df.groupby('category')['total_price'].sum().reset_index()
print("Total sales by category:")
print(category_sales)

# SQL equivalent:
# SELECT category, SUM(total_price) as total_price
# FROM sales
# GROUP BY category

# Visualize the results
plt.figure(figsize=(10, 6))
plt.bar(category_sales['category'], category_sales['total_price'])
plt.title('Total Sales by Category')
plt.xlabel('Category')
plt.ylabel('Total Sales ($)')
plt.tight_layout()
plt.show()

In [None]:
# Group by store and calculate multiple statistics
store_stats = sales_df.groupby('store_id').agg({
    'total_price': 'sum',
    'quantity': 'sum',
    'unit_price': 'mean'
}).reset_index()

# Rename columns for clarity
store_stats.columns = ['store_id', 'total_sales', 'total_quantity', 'avg_unit_price']

print("Store statistics:")
print(store_stats)

# SQL equivalent:
# SELECT 
#     store_id, 
#     SUM(total_price) as total_sales, 
#     SUM(quantity) as total_quantity, 
#     AVG(unit_price) as avg_unit_price
# FROM sales
# GROUP BY store_id

## 3. Grouping by Multiple Columns

Just like in SQL, we can group by multiple columns to create more detailed aggregations:

In [None]:
# Group by store and category
store_category_sales = sales_df.groupby(['store_id', 'category'])['total_price'].sum().reset_index()
print("Total sales by store and category:")
print(store_category_sales)

# SQL equivalent:
# SELECT store_id, category, SUM(total_price) as total_price
# FROM sales
# GROUP BY store_id, category

# Pivot the result for better visualization
pivoted = store_category_sales.pivot(index='store_id', columns='category', values='total_price')
print("\nPivoted view:")
print(pivoted)

# Visualize the results with a grouped bar chart
store_category_sales.pivot(index='category', columns='store_id', values='total_price').plot(kind='bar', figsize=(12, 6))
plt.title('Sales by Category and Store')
plt.ylabel('Total Sales ($)')
plt.xlabel('Category')
plt.legend(title='Store')
plt.tight_layout()
plt.show()

In [None]:
# Group by date (truncated to week) and region
sales_df['week'] = sales_df['date'].dt.to_period('W')
weekly_region_sales = sales_df.groupby(['week', 'region'])['total_price'].sum().reset_index()
print("Weekly sales by region:")
print(weekly_region_sales)

# SQL equivalent (PostgreSQL):
# SELECT 
#     DATE_TRUNC('week', date) as week,
#     region,
#     SUM(total_price) as total_price
# FROM sales
# GROUP BY DATE_TRUNC('week', date), region
# ORDER BY week, region

# Create a line plot for weekly sales by region
weekly_pivot = weekly_region_sales.pivot(index='week', columns='region', values='total_price')
weekly_pivot.plot(figsize=(12, 6), marker='o')
plt.title('Weekly Sales by Region')
plt.ylabel('Total Sales ($)')
plt.xlabel('Week')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

## 4. Aggregation Functions

Pandas provides several built-in aggregation functions, similar to SQL's aggregate functions:

In [None]:
# Basic aggregation functions
category_aggs = sales_df.groupby('category').agg({
    'total_price': ['sum', 'mean', 'min', 'max', 'count', 'std'],
    'quantity': ['sum', 'mean', 'min', 'max']
})

print("Multiple aggregations by category:")
print(category_aggs)

# SQL equivalent:
# SELECT 
#     category,
#     SUM(total_price) as total_price_sum,
#     AVG(total_price) as total_price_mean,
#     MIN(total_price) as total_price_min,
#     MAX(total_price) as total_price_max,
#     COUNT(total_price) as total_price_count,
#     STDDEV(total_price) as total_price_std,
#     SUM(quantity) as quantity_sum,
#     AVG(quantity) as quantity_mean,
#     MIN(quantity) as quantity_min,
#     MAX(quantity) as quantity_max
# FROM sales
# GROUP BY category

In [None]:
# Flattening multi-level column names for easier access
category_aggs.columns = ['_'.join(col).strip() for col in category_aggs.columns.values]
category_aggs.reset_index(inplace=True)
print("Flattened column names:")
print(category_aggs)

## 5. Custom Aggregation Functions

In addition to built-in aggregation functions, we can define custom aggregations:

In [None]:
# Define custom aggregation functions
def range_calc(x):
    return x.max() - x.min()

def discount_pct(x):
    return (x > 0).mean() * 100  # Percentage of sales with discounts

# Apply custom aggregations
custom_aggs = sales_df.groupby('category').agg({
    'total_price': ['sum', range_calc],  # Mix of built-in and custom
    'unit_price': lambda x: x.quantile(0.75) - x.quantile(0.25),  # IQR
    'discount': discount_pct  # Custom function
})

# Rename columns
custom_aggs.columns = ['_'.join(col).strip() for col in custom_aggs.columns.values]
custom_aggs.rename(columns={
    'total_price_range_calc': 'total_price_range',
    'unit_price_<lambda>': 'unit_price_iqr',
    'discount_discount_pct': 'pct_discounted'
}, inplace=True)

print("Custom aggregations:")
print(custom_aggs)

## 6. The GroupBy Object

When you call `.groupby()`, Pandas returns a GroupBy object that you can use to perform various operations on the groups:

In [None]:
# Create a GroupBy object
grouped = sales_df.groupby('category')

# Examine the GroupBy object
print(f"Type of grouped: {type(grouped)}")
print(f"Groups: {list(grouped.groups.keys())}")

# Get the group for 'Electronics'
electronics_group = grouped.get_group('Electronics')
print("\nElectronics group:")
print(electronics_group.head())

# Iterate through groups
print("\nIterate through first 2 rows of each group:")
for name, group in grouped:
    print(f"\nGroup: {name}")
    print(group.head(2))  # Show first 2 rows of each group

## 7. Transformation with GroupBy

The `.transform()` method applies a function to each group and returns a Series or DataFrame with the same shape as the original, making it ideal for creating features based on group statistics:

In [None]:
# Create copies of the DataFrame to avoid modifying the original
sales_transform = sales_df.copy()

# Add group averages to each row
sales_transform['category_avg_price'] = sales_transform.groupby('category')['total_price'].transform('mean')

# Calculate percentage of category average
sales_transform['pct_of_category_avg'] = (sales_transform['total_price'] / sales_transform['category_avg_price'] * 100).round(2)

print("DataFrame with transformed values:")
print(sales_transform[['category', 'total_price', 'category_avg_price', 'pct_of_category_avg']].head(10))

# SQL equivalent (using window functions):
# SELECT
#     category,
#     total_price,
#     AVG(total_price) OVER (PARTITION BY category) as category_avg_price,
#     (total_price / AVG(total_price) OVER (PARTITION BY category)) * 100 as pct_of_category_avg
# FROM sales

# Find sales above/below category average
above_avg = sales_transform[sales_transform['pct_of_category_avg'] > 100]
print(f"\nPercentage of sales above category average: {len(above_avg) / len(sales_transform) * 100:.2f}%")

In [None]:
# Multiple transformations at once
sales_transform = sales_df.copy()

# Apply multiple statistics as transforms
category_stats = sales_transform.groupby('category')['total_price'].transform(['mean', 'min', 'max', 'std'])
sales_transform[category_stats.columns] = category_stats

# Calculate Z-score relative to category
sales_transform['z_score'] = (sales_transform['total_price'] - sales_transform['mean']) / sales_transform['std']

print("DataFrame with multiple transforms and Z-score:")
cols_to_show = ['category', 'total_price', 'mean', 'std', 'z_score']
print(sales_transform[cols_to_show].head(10))

# Identify outlier sales (Z-score > 1.5 or < -1.5)
outliers = sales_transform[abs(sales_transform['z_score']) > 1.5]
print(f"\nNumber of outlier sales: {len(outliers)}")
if len(outliers) > 0:
    print("Sample of outliers:")
    print(outliers[cols_to_show].head())

## 8. Filtering with GroupBy

The `.filter()` method allows you to filter entire groups based on a condition:

In [None]:
# Filter categories that have more than 8 sales
high_volume_categories = sales_df.groupby('category').filter(lambda x: len(x) > 8)
print(f"Categories with more than 8 sales: {high_volume_categories['category'].unique()}")
print(f"Original DataFrame shape: {sales_df.shape}, Filtered shape: {high_volume_categories.shape}")

# Filter stores with total sales > $400
high_sales_stores = sales_df.groupby('store_id').filter(lambda x: x['total_price'].sum() > 400)
print(f"\nStores with total sales > $400: {high_sales_stores['store_id'].unique()}")
print(f"Original DataFrame shape: {sales_df.shape}, Filtered shape: {high_sales_stores.shape}")

# SQL equivalent (for the high sales stores):
# SELECT *
# FROM sales
# WHERE store_id IN (
#     SELECT store_id 
#     FROM sales 
#     GROUP BY store_id 
#     HAVING SUM(total_price) > 400
# )

## 9. GroupBy with Time Series Data

GroupBy is particularly useful with time series data, allowing you to aggregate data by various time periods:

In [None]:
# Group by day
daily_sales = sales_df.groupby(sales_df['date'].dt.date)['total_price'].sum().reset_index()
daily_sales.columns = ['date', 'daily_sales']
print("Daily sales:")
print(daily_sales.head())

# Group by week
weekly_sales = sales_df.groupby(pd.Grouper(key='date', freq='W'))['total_price'].sum().reset_index()
weekly_sales.columns = ['week_ending', 'weekly_sales']
print("\nWeekly sales:")
print(weekly_sales)

# Group by day of week
sales_df['day_of_week'] = sales_df['date'].dt.day_name()
day_of_week_sales = sales_df.groupby('day_of_week')['total_price'].sum().reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
print("\nSales by day of week:")
print(day_of_week_sales)

# Plot daily and weekly sales
fig, axes = plt.subplots(2, 1, figsize=(12, 10))

# Daily sales
axes[0].plot(daily_sales['date'], daily_sales['daily_sales'], marker='o')
axes[0].set_title('Daily Sales')
axes[0].set_ylabel('Sales ($)')
axes[0].grid(True, linestyle='--', alpha=0.7)

# Weekly sales
axes[1].plot(weekly_sales['week_ending'], weekly_sales['weekly_sales'], marker='s', linewidth=2)
axes[1].set_title('Weekly Sales')
axes[1].set_ylabel('Sales ($)')
axes[1].grid(True, linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

# Plot sales by day of week
plt.figure(figsize=(10, 6))
day_of_week_sales.plot(kind='bar')
plt.title('Sales by Day of Week')
plt.ylabel('Total Sales ($)')
plt.grid(True, axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

## 10. Applying Multiple Functions with Named Aggregation

Pandas 0.25+ introduced a more readable way to perform named aggregations:

In [None]:
# Named aggregation
named_aggs = sales_df.groupby('category').agg(
    total_sales=('total_price', 'sum'),
    avg_price=('total_price', 'mean'),
    num_sales=('total_price', 'count'),
    min_price=('total_price', 'min'),
    max_price=('total_price', 'max'),
    total_quantity=('quantity', 'sum'),
    avg_quantity=('quantity', 'mean')
)

print("Named aggregation results:")
print(named_aggs)

# SQL equivalent:
# SELECT 
#     category,
#     SUM(total_price) as total_sales,
#     AVG(total_price) as avg_price,
#     COUNT(total_price) as num_sales,
#     MIN(total_price) as min_price,
#     MAX(total_price) as max_price,
#     SUM(quantity) as total_quantity,
#     AVG(quantity) as avg_quantity
# FROM sales
# GROUP BY category

## 11. SQL to Pandas GroupBy Translation Guide

Here's a reference guide for translating common SQL GROUP BY operations to Pandas GroupBy:

| SQL Operation | Pandas Equivalent |
|--------------|-------------------|
| `SELECT col1, SUM(col2) FROM table GROUP BY col1` | `df.groupby('col1')['col2'].sum()` |
| `SELECT col1, col2, SUM(col3) FROM table GROUP BY col1, col2` | `df.groupby(['col1', 'col2'])['col3'].sum()` |
| `SELECT col1, AVG(col2) FROM table GROUP BY col1` | `df.groupby('col1')['col2'].mean()` |
| `SELECT col1, COUNT(*) FROM table GROUP BY col1` | `df.groupby('col1').size()` |
| `SELECT col1, COUNT(col2) FROM table GROUP BY col1` | `df.groupby('col1')['col2'].count()` |
| `SELECT col1, SUM(col2), AVG(col3) FROM table GROUP BY col1` | `df.groupby('col1').agg({'col2': 'sum', 'col3': 'mean'})` |
| `SELECT col1, SUM(col2) FROM table GROUP BY col1 HAVING SUM(col2) > 100` | `df.groupby('col1').filter(lambda x: x['col2'].sum() > 100)` |
| `SELECT col1, SUM(col2) AS sum_col2 FROM table GROUP BY col1 ORDER BY sum_col2 DESC` | `df.groupby('col1')['col2'].sum().sort_values(ascending=False)` |
| `SELECT col1, col2, AVG(col3) OVER (PARTITION BY col1)` | `df['avg_col3'] = df.groupby('col1')['col3'].transform('mean')` |

## 12. Practice Exercises

### Exercise 1: Basic GroupBy
Calculate the total quantity and average unit price for each region in the `sales_df` DataFrame.

In [None]:
# Your code here

### Exercise 2: Multiple Group Keys
Group the `sales_df` DataFrame by both `region` and `store_id`, and calculate the total sales. Sort the results by total sales in descending order.

In [None]:
# Your code here

### Exercise 3: Transform
Add a column to the `sales_df` DataFrame that shows what percentage each transaction's quantity is of the store's average quantity.

In [None]:
# Your code here

### Exercise 4: SQL to Pandas Translation
Translate the following SQL query to Pandas code:
```sql
SELECT 
    category,
    region,
    COUNT(*) as num_sales,
    SUM(total_price) as total_sales,
    AVG(discount) as avg_discount
FROM sales
GROUP BY category, region
HAVING COUNT(*) > 2
ORDER BY total_sales DESC
```

In [None]:
# Your code here

### Exercise 5: Time-Based Grouping
Group the `sales_df` DataFrame by week and calculate the average daily sales for each week. Then, identify the week with the highest average daily sales.

In [None]:
# Your code here

## Next Steps

In the next parts of today's session, we'll continue with:
- Part 2: More advanced aggregation functions
- Part 3: Pivot tables and cross-tabulations

Continue to Part 2: Advanced Aggregation Functions when you're ready to proceed.