# GroupBy and Aggregation in Pandas

## Overview

**GroupBy** = Split-Apply-Combine strategy for data analysis

### The Split-Apply-Combine Pattern

```
1. SPLIT    → Divide data into groups
2. APPLY    → Apply function to each group
3. COMBINE  → Combine results back together
```

### Visual Example

```
Original Data:
Product   Region   Sales
Laptop    North    1000
Phone     South     500
Laptop    South     800
Phone     North     600

           ↓ SPLIT by Product

Group 1: Laptop    [1000, 800]
Group 2: Phone     [500, 600]

           ↓ APPLY sum()

Laptop: 1800
Phone:  1100

           ↓ COMBINE

Product   Total_Sales
Laptop    1800
Phone     1100
```

### Key Concepts

| Concept | Description | Example |
|---------|-------------|----------|
| **GroupBy** | Split data into groups | `df.groupby('category')` |
| **Aggregation** | Summarize each group | `sum()`, `mean()`, `count()` |
| **Transform** | Return same shape | Add group mean to each row |
| **Filter** | Keep/remove groups | Keep groups with sum > 1000 |
| **Apply** | Custom operations | Any function |

### What We'll Learn
1. ✅ Basic groupby operations
2. ✅ Single and multiple aggregations
3. ✅ Multiple groupby columns
4. ✅ Transform vs aggregation
5. ✅ Filtering groups
6. ✅ Custom aggregation functions
7. ✅ Named aggregations
8. ✅ Real-world business analysis

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.precision', 2)

print("✅ Libraries imported")
print(f"Pandas version: {pd.__version__}")

## Sample Dataset: E-commerce Sales

We'll use a comprehensive sales dataset to demonstrate groupby operations.

In [None]:
# Create realistic e-commerce sales data
np.random.seed(42)

n_orders = 200

# Generate dates over 3 months
start_date = datetime(2024, 1, 1)
dates = [start_date + timedelta(days=np.random.randint(0, 90)) for _ in range(n_orders)]

sales_df = pd.DataFrame({
    'order_id': range(1001, 1001 + n_orders),
    'date': dates,
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Headphones', 'Watch', 'Camera'], n_orders),
    'category': np.random.choice(['Electronics', 'Accessories'], n_orders),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_orders),
    'customer_type': np.random.choice(['New', 'Returning', 'VIP'], n_orders, p=[0.4, 0.4, 0.2]),
    'quantity': np.random.randint(1, 6, n_orders),
    'unit_price': np.random.choice([299, 599, 899, 1299, 1999, 2999], n_orders),
    'discount_%': np.random.choice([0, 5, 10, 15, 20], n_orders, p=[0.4, 0.25, 0.2, 0.1, 0.05])
})

# Calculate derived columns
sales_df['subtotal'] = sales_df['quantity'] * sales_df['unit_price']
sales_df['discount_amount'] = sales_df['subtotal'] * sales_df['discount_%'] / 100
sales_df['total_amount'] = sales_df['subtotal'] - sales_df['discount_amount']
sales_df['month'] = pd.to_datetime(sales_df['date']).dt.month_name()
sales_df['year'] = pd.to_datetime(sales_df['date']).dt.year

print("Sales Dataset:")
print(sales_df.head(15))
print(f"\nShape: {sales_df.shape}")
print(f"\nColumns: {sales_df.columns.tolist()}")
print(f"\nData types:\n{sales_df.dtypes}")

## 1. GroupBy Basics

### What is GroupBy?

**GroupBy** creates a grouped object that splits data by unique values in a column.

### Syntax

```python
# Basic groupby
grouped = df.groupby('column')

# Apply aggregation
result = df.groupby('column')['value_column'].sum()
```

### Understanding the GroupBy Object

```python
grouped = df.groupby('product')
# This creates a GroupBy object (not a DataFrame yet)

# To see results, apply an aggregation:
grouped.sum()      # Sum for each group
grouped.mean()     # Mean for each group
grouped.count()    # Count for each group
```

### Basic Workflow

```python
# Step 1: Group by category
df.groupby('category')

# Step 2: Select column to aggregate
df.groupby('category')['sales']

# Step 3: Apply aggregation
df.groupby('category')['sales'].sum()
```

In [None]:
print("=== GROUPBY BASICS ===\n")

# Example 1: Simple groupby with sum
print("Example 1: Total sales by product")
product_sales = sales_df.groupby('product')['total_amount'].sum()
print(product_sales.sort_values(ascending=False))
print(f"\nType: {type(product_sales)}")
print()

# Example 2: Count orders by product
print("Example 2: Number of orders by product")
product_count = sales_df.groupby('product')['order_id'].count()
print(product_count.sort_values(ascending=False))
print()

# Example 3: Average order value by region
print("Example 3: Average order value by region")
region_avg = sales_df.groupby('region')['total_amount'].mean()
print(region_avg.sort_values(ascending=False))
print()

# Example 4: Understanding GroupBy object
print("Example 4: Understanding the GroupBy object")
grouped = sales_df.groupby('product')
print(f"GroupBy object: {grouped}")
print(f"Number of groups: {grouped.ngroups}")
print(f"Group names: {list(grouped.groups.keys())}")
print()

# Example 5: Get first and last rows of each group
print("Example 5: First order for each product")
first_orders = sales_df.groupby('product').first()
print(first_orders[['date', 'quantity', 'total_amount']].head())
print()

# Example 6: Group size (number of items in each group)
print("Example 6: Size of each product group")
group_sizes = sales_df.groupby('product').size()
print(group_sizes.sort_values(ascending=False))

## 2. Common Aggregation Functions

### Built-in Aggregation Functions

| Function | Description | Example Use Case |
|----------|-------------|------------------|
| `sum()` | Sum of values | Total revenue |
| `mean()` | Average | Average order value |
| `median()` | Middle value | Median price |
| `min()` | Minimum value | Lowest price |
| `max()` | Maximum value | Highest revenue |
| `count()` | Number of items | Order count |
| `std()` | Standard deviation | Price volatility |
| `var()` | Variance | Revenue variance |
| `first()` | First value | First order date |
| `last()` | Last value | Last order date |
| `size()` | Group size | Items per group |
| `nunique()` | Unique count | Unique customers |

### Statistical Functions

| Function | Description |
|----------|-------------|
| `quantile(q)` | Quantile (e.g., q=0.25 for 25th percentile) |
| `sem()` | Standard error of mean |
| `skew()` | Skewness |
| `kurt()` | Kurtosis |

### String Functions

```python
# For string columns
df.groupby('category')['name'].agg(lambda x: ', '.join(x))
```

In [None]:
print("=== AGGREGATION FUNCTIONS ===\n")

# Example 1: Multiple aggregations on single column
print("Example 1: Product statistics")
product_stats = sales_df.groupby('product')['total_amount'].agg(['sum', 'mean', 'count', 'min', 'max'])
product_stats.columns = ['Total_Revenue', 'Avg_Order', 'Num_Orders', 'Min_Order', 'Max_Order']
print(product_stats.sort_values('Total_Revenue', ascending=False))
print()

# Example 2: Statistical measures
print("Example 2: Revenue statistics by region")
region_stats = sales_df.groupby('region')['total_amount'].agg([
    'mean', 'median', 'std', 'min', 'max'
])
region_stats.columns = ['Mean', 'Median', 'Std_Dev', 'Min', 'Max']
print(region_stats)
print()

# Example 3: Count unique values
print("Example 3: Unique products per region")
unique_products = sales_df.groupby('region')['product'].nunique()
print(unique_products)
print()

# Example 4: First and last dates
print("Example 4: First and last order date by product")
date_range = sales_df.groupby('product')['date'].agg(['first', 'last'])
date_range.columns = ['First_Order', 'Last_Order']
print(date_range)
print()

# Example 5: Quantiles
print("Example 5: 25th, 50th, 75th percentile of order amounts")
quantiles = sales_df.groupby('customer_type')['total_amount'].quantile([0.25, 0.5, 0.75])
print(quantiles)
print()

# Example 6: Size vs count
print("Example 6: Difference between size() and count()")
print("\nsize() - includes NaN:")
print(sales_df.groupby('product').size())
print("\ncount() - excludes NaN:")
print(sales_df.groupby('product')['total_amount'].count())
print("\nNote: They're the same if no NaN values exist")

## 3. Aggregating Multiple Columns

### Different Aggregations for Different Columns

```python
df.groupby('category').agg({
    'sales': 'sum',
    'profit': 'mean',
    'customer': 'nunique'
})
```

### Multiple Aggregations per Column

```python
df.groupby('category').agg({
    'sales': ['sum', 'mean', 'count'],
    'profit': ['sum', 'mean']
})
```

### Named Aggregations (Pandas 0.25+)

```python
df.groupby('category').agg(
    total_sales=('sales', 'sum'),
    avg_profit=('profit', 'mean'),
    unique_customers=('customer', 'nunique')
)
```

### Benefits of Named Aggregations
- ✅ Clear column names
- ✅ No MultiIndex columns
- ✅ More readable code
- ✅ Easier to work with results

In [None]:
print("=== MULTIPLE COLUMN AGGREGATION ===\n")

# Example 1: Different aggregations for different columns
print("Example 1: Product performance metrics")
product_metrics = sales_df.groupby('product').agg({
    'total_amount': 'sum',
    'quantity': 'sum',
    'order_id': 'count',
    'discount_%': 'mean'
})
product_metrics.columns = ['Total_Revenue', 'Units_Sold', 'Num_Orders', 'Avg_Discount']
print(product_metrics.sort_values('Total_Revenue', ascending=False))
print()

# Example 2: Multiple aggregations per column
print("Example 2: Comprehensive product analysis")
comprehensive = sales_df.groupby('product').agg({
    'total_amount': ['sum', 'mean', 'max'],
    'quantity': ['sum', 'mean'],
    'order_id': 'count'
})
print(comprehensive)
print()

# Example 3: Named aggregations (cleaner)
print("Example 3: Named aggregations (recommended)")
named_agg = sales_df.groupby('product').agg(
    total_revenue=('total_amount', 'sum'),
    avg_order_value=('total_amount', 'mean'),
    max_order=('total_amount', 'max'),
    units_sold=('quantity', 'sum'),
    num_orders=('order_id', 'count'),
    avg_discount=('discount_%', 'mean')
)
print(named_agg.sort_values('total_revenue', ascending=False))
print()

# Example 4: Custom aggregation names with round
print("Example 4: Formatted results")
formatted = sales_df.groupby('region').agg(
    total_revenue=('total_amount', 'sum'),
    avg_order_value=('total_amount', 'mean'),
    num_orders=('order_id', 'count')
).round(2)
print(formatted.sort_values('total_revenue', ascending=False))
print()

# Example 5: Complex aggregations
print("Example 5: Customer type analysis")
customer_analysis = sales_df.groupby('customer_type').agg(
    total_revenue=('total_amount', 'sum'),
    avg_order=('total_amount', 'mean'),
    median_order=('total_amount', 'median'),
    total_orders=('order_id', 'count'),
    unique_products=('product', 'nunique'),
    avg_quantity=('quantity', 'mean'),
    avg_discount=('discount_%', 'mean')
).round(2)
print(customer_analysis.sort_values('total_revenue', ascending=False))

## 4. Grouping by Multiple Columns

### Syntax

```python
# Group by multiple columns
df.groupby(['col1', 'col2'])['value'].sum()
```

### How It Works

Creates groups for each **unique combination** of values:

```
Original Data:
Product   Region   Sales
Laptop    North    1000
Laptop    South     800
Phone     North     600
Phone     South     500

After groupby(['Product', 'Region']):

Group 1: (Laptop, North)  → 1000
Group 2: (Laptop, South)  → 800
Group 3: (Phone, North)   → 600
Group 4: (Phone, South)   → 500
```

### MultiIndex Result

Result has a **MultiIndex** (hierarchical index):

```python
result = df.groupby(['Product', 'Region'])['Sales'].sum()
# Index: (Laptop, North), (Laptop, South), ...
```

### Reset Index

```python
# Convert MultiIndex back to columns
result.reset_index()
```

### Use Cases
- Sales by product × region
- Revenue by year × month
- Performance by team × employee
- Cross-dimensional analysis

In [None]:
print("=== GROUPING BY MULTIPLE COLUMNS ===\n")

# Example 1: Product × Region analysis
print("Example 1: Sales by product and region")
product_region = sales_df.groupby(['product', 'region'])['total_amount'].sum()
print(product_region.sort_values(ascending=False).head(15))
print(f"\nIndex type: {type(product_region.index)}")
print()

# Example 2: Reset index for cleaner output
print("Example 2: Same data with reset index")
product_region_df = sales_df.groupby(['product', 'region'])['total_amount'].sum().reset_index()
product_region_df.columns = ['Product', 'Region', 'Total_Sales']
print(product_region_df.sort_values('Total_Sales', ascending=False).head(10))
print()

# Example 3: Three-way grouping
print("Example 3: Product × Region × Customer Type")
three_way = sales_df.groupby(['product', 'region', 'customer_type']).agg(
    total_sales=('total_amount', 'sum'),
    num_orders=('order_id', 'count'),
    avg_order=('total_amount', 'mean')
).round(2)
print(three_way.sort_values('total_sales', ascending=False).head(15))
print()

# Example 4: Month × Product analysis
print("Example 4: Monthly product sales")
monthly_product = sales_df.groupby(['month', 'product']).agg(
    total_revenue=('total_amount', 'sum'),
    num_orders=('order_id', 'count')
).reset_index()
print(monthly_product.sort_values('total_revenue', ascending=False).head(15))
print()

# Example 5: Comprehensive multi-dimensional analysis
print("Example 5: Region × Customer Type performance")
region_customer = sales_df.groupby(['region', 'customer_type']).agg(
    total_revenue=('total_amount', 'sum'),
    avg_order=('total_amount', 'mean'),
    num_orders=('order_id', 'count'),
    units_sold=('quantity', 'sum'),
    unique_products=('product', 'nunique')
).round(2).reset_index()
print(region_customer.sort_values('total_revenue', ascending=False))

## 5. Transform vs Aggregation

### Key Difference

| Operation | Returns | Shape | Use Case |
|-----------|---------|-------|----------|
| **Aggregation** | Summary per group | Reduced (fewer rows) | Get group totals |
| **Transform** | Value per row | Same as original | Add group stats to each row |

### Visual Comparison

```
Original (5 rows):
Product   Sales
Laptop    1000
Laptop     800
Phone      500
Phone      600
Tablet     700

AGGREGATION (3 rows - one per group):
Product   Total_Sales
Laptop    1800
Phone     1100
Tablet     700

TRANSFORM (5 rows - same as original):
Product   Sales   Group_Total
Laptop    1000    1800
Laptop     800    1800
Phone      500    1100
Phone      600    1100
Tablet     700     700
```

### Syntax

```python
# Aggregation - reduces rows
df.groupby('category')['sales'].sum()

# Transform - keeps all rows
df['group_total'] = df.groupby('category')['sales'].transform('sum')
```

### Common Use Cases for Transform
- Add group mean/median to each row
- Calculate percentage of group total
- Normalize within groups
- Fill missing values with group average
- Compare individual to group performance

In [None]:
print("=== TRANSFORM VS AGGREGATION ===\n")

# Example 1: Compare aggregation vs transform
print("Example 1: Aggregation vs Transform")
print("\nAggregation (reduces rows):")
agg_result = sales_df.groupby('product')['total_amount'].sum()
print(f"Shape: {agg_result.shape}")
print(agg_result.head())

print("\nTransform (same number of rows):")
sales_df['product_total'] = sales_df.groupby('product')['total_amount'].transform('sum')
print(f"Shape: {sales_df.shape}")
print(sales_df[['product', 'total_amount', 'product_total']].head(10))
print()

# Example 2: Add group statistics to each row
print("Example 2: Add multiple group statistics")
sales_df['product_mean'] = sales_df.groupby('product')['total_amount'].transform('mean')
sales_df['product_max'] = sales_df.groupby('product')['total_amount'].transform('max')
sales_df['product_min'] = sales_df.groupby('product')['total_amount'].transform('min')
print(sales_df[['product', 'total_amount', 'product_mean', 'product_max', 'product_min']].head(10))
print()

# Example 3: Calculate percentage of group total
print("Example 3: Percentage of product total")
sales_df['pct_of_product_total'] = (
    sales_df['total_amount'] / sales_df['product_total'] * 100
).round(2)
print(sales_df[['product', 'total_amount', 'product_total', 'pct_of_product_total']].head(15))
print()

# Example 4: Normalize within groups (z-score)
print("Example 4: Z-score normalization within products")
sales_df['product_std'] = sales_df.groupby('product')['total_amount'].transform('std')
sales_df['z_score'] = (
    (sales_df['total_amount'] - sales_df['product_mean']) / sales_df['product_std']
).round(2)
print(sales_df[['product', 'total_amount', 'product_mean', 'product_std', 'z_score']].head(10))
print("\nInterpretation: z_score shows how many std devs from product mean")
print()

# Example 5: Rank within groups
print("Example 5: Rank orders within each product")
sales_df['product_rank'] = sales_df.groupby('product')['total_amount'].rank(
    ascending=False, method='dense'
)
top_orders = sales_df[sales_df['product_rank'] <= 3].sort_values(['product', 'product_rank'])
print(top_orders[['product', 'total_amount', 'product_rank']].head(15))
print()

# Example 6: Fill missing values with group mean
print("Example 6: Fill missing with group mean (demo)")
# Create sample with missing values
demo_df = sales_df[['product', 'total_amount']].head(20).copy()
demo_df.loc[2, 'total_amount'] = np.nan
demo_df.loc[5, 'total_amount'] = np.nan
print("Before filling:")
print(demo_df.head(10))
print()
demo_df['total_amount'] = demo_df.groupby('product')['total_amount'].transform(
    lambda x: x.fillna(x.mean())
)
print("After filling with group mean:")
print(demo_df.head(10))

## 6. Filtering Groups

### What is Group Filtering?

**Filter** keeps or removes **entire groups** based on group properties.

### Difference from Row Filtering

```python
# Row filtering: Keep rows where sales > 1000
df[df['sales'] > 1000]

# Group filtering: Keep groups where total sales > 10000
df.groupby('product').filter(lambda x: x['sales'].sum() > 10000)
```

### Syntax

```python
df.groupby('column').filter(function)
```

### Visual Example

```
Original Data:
Product   Sales
Laptop    1000    ┐
Laptop     800    ├─ Total: 1800 ✅ Keep
Phone      200    ┘
Phone      150    ├─ Total: 350 ❌ Remove
Tablet    1200    ┘
Tablet     900    ├─ Total: 2100 ✅ Keep

Filter: Keep groups with total > 500

Result:
Product   Sales
Laptop    1000
Laptop     800
Tablet    1200
Tablet     900
```

### Common Use Cases
- Keep products with > 100 orders
- Remove customers with < $1000 spend
- Keep categories with > 5 unique items
- Filter groups by size

In [None]:
print("=== FILTERING GROUPS ===\n")

# Example 1: Keep products with total sales > $20,000
print("Example 1: Products with total sales > $20,000")
high_revenue_products = sales_df.groupby('product').filter(
    lambda x: x['total_amount'].sum() > 20000
)
print(f"Original rows: {len(sales_df)}")
print(f"After filtering: {len(high_revenue_products)}")
print("\nProducts kept:")
print(high_revenue_products.groupby('product')['total_amount'].sum().sort_values(ascending=False))
print()

# Example 2: Keep regions with > 40 orders
print("Example 2: Regions with more than 40 orders")
busy_regions = sales_df.groupby('region').filter(
    lambda x: len(x) > 40
)
print("Region order counts:")
print(busy_regions.groupby('region').size().sort_values(ascending=False))
print()

# Example 3: Keep products with average order > $2000
print("Example 3: Products with high average order value")
high_aov_products = sales_df.groupby('product').filter(
    lambda x: x['total_amount'].mean() > 2000
)
print(high_aov_products.groupby('product')['total_amount'].agg(['mean', 'count']).sort_values('mean', ascending=False))
print()

# Example 4: Keep groups with low variability (std < 500)
print("Example 4: Products with consistent pricing (low std dev)")
consistent_products = sales_df.groupby('product').filter(
    lambda x: x['total_amount'].std() < 500
)
print(consistent_products.groupby('product')['total_amount'].agg(['mean', 'std']).sort_values('std'))
print()

# Example 5: Complex filter - multiple conditions
print("Example 5: Products with >30 orders AND avg order >$1500")
premium_products = sales_df.groupby('product').filter(
    lambda x: (len(x) > 30) and (x['total_amount'].mean() > 1500)
)
print(premium_products.groupby('product').agg(
    num_orders=('order_id', 'count'),
    avg_order=('total_amount', 'mean'),
    total_revenue=('total_amount', 'sum')
).round(2).sort_values('total_revenue', ascending=False))
print()

# Example 6: Filter vs normal boolean indexing comparison
print("Example 6: Filter (entire groups) vs Boolean indexing (individual rows)")
print("\nBoolean indexing (rows with sales > 3000):")
high_sales_rows = sales_df[sales_df['total_amount'] > 3000]
print(f"Rows kept: {len(high_sales_rows)}")
print()
print("Group filter (products with total > 20000):")
print(f"Rows kept: {len(high_revenue_products)}")
print("\nNote: Filter keeps ALL rows of matching groups!")

## 7. Apply and Custom Functions

### Using apply() with GroupBy

**`apply()`** allows you to apply **any function** to each group.

### Syntax

```python
df.groupby('category').apply(function)
```

### When to Use apply()

| Scenario | Use |
|----------|-----|
| **Built-in aggregation** (sum, mean) | `groupby().sum()` ✅ Faster |
| **Custom logic** | `groupby().apply()` |
| **Multiple operations** | `groupby().apply()` |
| **Return DataFrame per group** | `groupby().apply()` |

### Function Types

**1. Lambda functions**
```python
df.groupby('cat').apply(lambda x: x['sales'].max() - x['sales'].min())
```

**2. Named functions**
```python
def custom_stats(group):
    return pd.Series({
        'total': group['sales'].sum(),
        'avg': group['sales'].mean()
    })

df.groupby('cat').apply(custom_stats)
```

**3. Return DataFrame**
```python
def top_n(group, n=3):
    return group.nlargest(n, 'sales')

df.groupby('cat').apply(top_n, n=5)
```

### Common Custom Operations
- Calculate range (max - min)
- Get top N rows per group
- Complex statistical calculations
- Custom business logic
- Weighted averages

In [None]:
print("=== APPLY AND CUSTOM FUNCTIONS ===\n")

# Example 1: Simple lambda with apply
print("Example 1: Revenue range per product (max - min)")
revenue_range = sales_df.groupby('product')['total_amount'].apply(
    lambda x: x.max() - x.min()
)
print(revenue_range.sort_values(ascending=False))
print()

# Example 2: Custom function returning Series
print("Example 2: Custom statistics function")
def custom_stats(group):
    return pd.Series({
        'total_revenue': group['total_amount'].sum(),
        'avg_order': group['total_amount'].mean(),
        'revenue_range': group['total_amount'].max() - group['total_amount'].min(),
        'num_orders': len(group),
        'total_units': group['quantity'].sum()
    })

product_stats = sales_df.groupby('product').apply(custom_stats).round(2)
print(product_stats.sort_values('total_revenue', ascending=False))
print()

# Example 3: Get top 3 orders per product
print("Example 3: Top 3 orders for each product")
def top_orders(group, n=3):
    return group.nlargest(n, 'total_amount')[['order_id', 'total_amount', 'date']]

top_3_per_product = sales_df.groupby('product').apply(top_orders)
print(top_3_per_product.head(15))
print()

# Example 4: Calculate weighted average
print("Example 4: Weighted average price by quantity")
def weighted_avg(group):
    return (group['unit_price'] * group['quantity']).sum() / group['quantity'].sum()

weighted_prices = sales_df.groupby('product').apply(weighted_avg).round(2)
print(weighted_prices.sort_values(ascending=False))
print()

# Example 5: Complex business logic
print("Example 5: Performance score (custom business logic)")
def performance_score(group):
    """Calculate performance score: revenue * order_count * avg_rating_proxy"""
    revenue = group['total_amount'].sum()
    order_count = len(group)
    consistency = 1 / (1 + group['total_amount'].std())  # Lower std = more consistent
    score = revenue * order_count * consistency
    return score

performance = sales_df.groupby('product').apply(performance_score).round(2)
print(performance.sort_values(ascending=False))
print()

# Example 6: Multiple group analysis
print("Example 6: Region × Customer Type performance")
def region_customer_analysis(group):
    return pd.Series({
        'total_revenue': group['total_amount'].sum(),
        'num_orders': len(group),
        'avg_order': group['total_amount'].mean(),
        'avg_discount': group['discount_%'].mean(),
        'units_sold': group['quantity'].sum()
    })

region_customer_perf = sales_df.groupby(['region', 'customer_type']).apply(
    region_customer_analysis
).round(2)
print(region_customer_perf.sort_values('total_revenue', ascending=False).head(10))

## 8. Iterating Over Groups

### When to Iterate

Use iteration when you need to:
- Process each group separately
- Generate reports per group
- Debug groupby operations
- Apply operations that can't be vectorized

### Syntax

```python
for name, group in df.groupby('column'):
    # name: group identifier
    # group: DataFrame for that group
    print(f"Processing {name}")
    print(group)
```

### Multiple Groupby Columns

```python
for (col1_val, col2_val), group in df.groupby(['col1', 'col2']):
    print(f"Group: {col1_val}, {col2_val}")
```

### Access Specific Group

```python
# Get specific group
grouped = df.groupby('product')
laptop_group = grouped.get_group('Laptop')
```

### Performance Note
⚠️ **Iteration is slower** than vectorized operations. Use only when necessary!

In [None]:
print("=== ITERATING OVER GROUPS ===\n")

# Example 1: Basic iteration
print("Example 1: Iterate and print summary for each product")
for product_name, product_group in sales_df.groupby('product'):
    total = product_group['total_amount'].sum()
    count = len(product_group)
    print(f"{product_name}: {count} orders, ${total:,.2f} revenue")
print()

# Example 2: Get specific group
print("Example 2: Access specific group (Laptop)")
grouped = sales_df.groupby('product')
laptop_data = grouped.get_group('Laptop')
print(f"Laptop orders: {len(laptop_data)}")
print(laptop_data[['order_id', 'quantity', 'total_amount']].head())
print()

# Example 3: Multiple groupby columns
print("Example 3: Iterate over region × customer type")
for (region, cust_type), group in sales_df.groupby(['region', 'customer_type']):
    revenue = group['total_amount'].sum()
    if revenue > 10000:  # Only show significant segments
        print(f"{region} - {cust_type}: ${revenue:,.2f}")
print()

# Example 4: Generate report per group
print("Example 4: Generate mini-reports for each region")
for region, region_data in sales_df.groupby('region'):
    print(f"\n{'='*50}")
    print(f"REGION: {region}")
    print(f"{'='*50}")
    print(f"Total Orders: {len(region_data)}")
    print(f"Total Revenue: ${region_data['total_amount'].sum():,.2f}")
    print(f"Avg Order: ${region_data['total_amount'].mean():,.2f}")
    print(f"Top Product: {region_data['product'].mode()[0]}")
    print(f"Unique Customers: {region_data['customer_type'].nunique()}")
print()

# Example 5: Filter and process
print("Example 5: Process only high-revenue products")
for product, product_data in sales_df.groupby('product'):
    total_revenue = product_data['total_amount'].sum()
    if total_revenue > 30000:
        avg_discount = product_data['discount_%'].mean()
        print(f"{product}: ${total_revenue:,.2f} (Avg discount: {avg_discount:.1f}%)")
print()

# Example 6: Store groups in dictionary
print("Example 6: Store each group in a dictionary")
product_dict = {}
for product, product_data in sales_df.groupby('product'):
    product_dict[product] = product_data

print("Available products:", list(product_dict.keys()))
print(f"\nLaptop group shape: {product_dict['Laptop'].shape}")
print(f"Phone group shape: {product_dict['Phone'].shape}")

## 9. Comprehensive Business Analysis Example

### Scenario: E-commerce Performance Report

**Business Questions:**
1. Which products are top performers?
2. How do different regions compare?
3. What's the performance by customer type?
4. What are the monthly trends?
5. Which product-region combinations are best?
6. What's the discount impact analysis?

We'll use multiple groupby techniques to answer these questions.

In [None]:
print("="*80)
print("COMPREHENSIVE E-COMMERCE PERFORMANCE ANALYSIS")
print("="*80)
print()

# Question 1: Top performing products
print("1. PRODUCT PERFORMANCE ANALYSIS")
print("-" * 80)
product_performance = sales_df.groupby('product').agg(
    total_revenue=('total_amount', 'sum'),
    num_orders=('order_id', 'count'),
    avg_order_value=('total_amount', 'mean'),
    total_units=('quantity', 'sum'),
    avg_discount=('discount_%', 'mean'),
    max_order=('total_amount', 'max'),
    min_order=('total_amount', 'min')
).round(2).sort_values('total_revenue', ascending=False)

# Add market share
total_revenue = sales_df['total_amount'].sum()
product_performance['market_share_%'] = (
    product_performance['total_revenue'] / total_revenue * 100
).round(2)

print(product_performance)
print()

# Question 2: Regional comparison
print("2. REGIONAL PERFORMANCE")
print("-" * 80)
regional_performance = sales_df.groupby('region').agg(
    total_revenue=('total_amount', 'sum'),
    num_orders=('order_id', 'count'),
    avg_order=('total_amount', 'mean'),
    unique_products=('product', 'nunique'),
    total_units=('quantity', 'sum')
).round(2).sort_values('total_revenue', ascending=False)
print(regional_performance)
print()

# Question 3: Customer type analysis
print("3. CUSTOMER TYPE ANALYSIS")
print("-" * 80)
customer_analysis = sales_df.groupby('customer_type').agg(
    total_revenue=('total_amount', 'sum'),
    num_orders=('order_id', 'count'),
    avg_order=('total_amount', 'mean'),
    median_order=('total_amount', 'median'),
    avg_units=('quantity', 'mean'),
    avg_discount=('discount_%', 'mean')
).round(2).sort_values('total_revenue', ascending=False)

# Add revenue per order
customer_analysis['revenue_per_order'] = (
    customer_analysis['total_revenue'] / customer_analysis['num_orders']
).round(2)

print(customer_analysis)
print()

# Question 4: Monthly trends
print("4. MONTHLY TRENDS")
print("-" * 80)
monthly_trends = sales_df.groupby('month').agg(
    total_revenue=('total_amount', 'sum'),
    num_orders=('order_id', 'count'),
    avg_order=('total_amount', 'mean'),
    total_units=('quantity', 'sum')
).round(2)

# Sort by month order
month_order = ['January', 'February', 'March']
monthly_trends = monthly_trends.reindex([m for m in month_order if m in monthly_trends.index])
print(monthly_trends)
print()

# Question 5: Product × Region combinations
print("5. TOP PRODUCT-REGION COMBINATIONS")
print("-" * 80)
product_region = sales_df.groupby(['product', 'region']).agg(
    total_revenue=('total_amount', 'sum'),
    num_orders=('order_id', 'count'),
    avg_order=('total_amount', 'mean')
).round(2).sort_values('total_revenue', ascending=False).head(15)
print(product_region)
print()

# Question 6: Discount impact
print("6. DISCOUNT IMPACT ANALYSIS")
print("-" * 80)
# Create discount bins
sales_df['discount_category'] = pd.cut(
    sales_df['discount_%'],
    bins=[-1, 0, 10, 20, 100],
    labels=['No Discount', 'Low (1-10%)', 'Medium (11-20%)', 'High (>20%)']
)

discount_impact = sales_df.groupby('discount_category').agg(
    num_orders=('order_id', 'count'),
    total_revenue=('total_amount', 'sum'),
    avg_order=('total_amount', 'mean'),
    avg_units=('quantity', 'mean'),
    avg_discount=('discount_%', 'mean')
).round(2)
print(discount_impact)
print()

# Summary metrics
print("="*80)
print("EXECUTIVE SUMMARY")
print("="*80)
print(f"Total Revenue: ${sales_df['total_amount'].sum():,.2f}")
print(f"Total Orders: {len(sales_df):,}")
print(f"Average Order Value: ${sales_df['total_amount'].mean():,.2f}")
print(f"Total Units Sold: {sales_df['quantity'].sum():,}")
print(f"\nTop Product: {product_performance.index[0]} (${product_performance.iloc[0]['total_revenue']:,.2f})")
print(f"Top Region: {regional_performance.index[0]} (${regional_performance.iloc[0]['total_revenue']:,.2f})")
print(f"Best Customer Type: {customer_analysis.index[0]} (${customer_analysis.iloc[0]['total_revenue']:,.2f})")
print(f"\nAverage Discount: {sales_df['discount_%'].mean():.2f}%")
print(f"Orders with Discount: {(sales_df['discount_%'] > 0).sum()} ({(sales_df['discount_%'] > 0).sum()/len(sales_df)*100:.1f}%)")
print("="*80)

## 10. Advanced GroupBy Techniques

### Multiple Column Selection

```python
# Select multiple columns for aggregation
df.groupby('category')[['sales', 'profit']].sum()
```

### Grouping by Calculated Columns

```python
# Group by binned values
df.groupby(pd.cut(df['age'], bins=[0, 18, 35, 60, 100])).mean()

# Group by date components
df.groupby(df['date'].dt.year).sum()
df.groupby(df['date'].dt.to_period('M')).sum()
```

### Grouping with Custom Index

```python
# Group by index level
df.groupby(level=0).sum()  # For MultiIndex
```

### Handling MultiIndex Results

```python
# Flatten MultiIndex columns
result = df.groupby('cat').agg({'sales': ['sum', 'mean']})
result.columns = ['_'.join(col) for col in result.columns]
```

### Combining GroupBy with Other Operations

```python
# GroupBy + Sort + Head
df.groupby('category').apply(lambda x: x.nlargest(3, 'sales'))

# GroupBy + Pivot
df.groupby(['product', 'region'])['sales'].sum().unstack()
```

In [None]:
print("=== ADVANCED GROUPBY TECHNIQUES ===\n")

# Example 1: Group by date components
print("Example 1: Group by month and year")
monthly_sales = sales_df.groupby(
    sales_df['date'].dt.to_period('M')
).agg(
    revenue=('total_amount', 'sum'),
    orders=('order_id', 'count')
).round(2)
print(monthly_sales)
print()

# Example 2: Group by binned values
print("Example 2: Group by order value bins")
value_bins = pd.cut(
    sales_df['total_amount'],
    bins=[0, 1000, 3000, 5000, 10000],
    labels=['Small (<$1k)', 'Medium ($1-3k)', 'Large ($3-5k)', 'XLarge (>$5k)']
)
bin_analysis = sales_df.groupby(value_bins).agg(
    num_orders=('order_id', 'count'),
    avg_amount=('total_amount', 'mean'),
    total_revenue=('total_amount', 'sum')
).round(2)
print(bin_analysis)
print()

# Example 3: Multiple column selection
print("Example 3: Aggregate multiple columns simultaneously")
multi_col = sales_df.groupby('product')[['total_amount', 'quantity', 'discount_%']].agg(['mean', 'sum']).round(2)
print(multi_col.head())
print()

# Example 4: Custom grouping function
print("Example 4: Group by custom function (even/odd order IDs)")
def even_odd(order_id):
    return 'Even' if order_id % 2 == 0 else 'Odd'

even_odd_analysis = sales_df.groupby(sales_df['order_id'].apply(even_odd)).agg(
    num_orders=('order_id', 'count'),
    total_revenue=('total_amount', 'sum')
).round(2)
print(even_odd_analysis)
print()

# Example 5: GroupBy with unstack (pivot-like)
print("Example 5: Product × Region matrix using unstack")
product_region_matrix = sales_df.groupby(['product', 'region'])['total_amount'].sum().unstack(fill_value=0)
print(product_region_matrix)
print()

# Example 6: Cumulative sum within groups
print("Example 6: Cumulative revenue by product (sorted by date)")
sales_sorted = sales_df.sort_values('date')
sales_sorted['cumulative_revenue'] = sales_sorted.groupby('product')['total_amount'].cumsum()
print(sales_sorted[['product', 'date', 'total_amount', 'cumulative_revenue']].head(15))
print()

# Example 7: Rolling mean within groups
print("Example 7: 3-order moving average per product")
sales_sorted['rolling_avg'] = sales_sorted.groupby('product')['total_amount'].transform(
    lambda x: x.rolling(window=3, min_periods=1).mean()
).round(2)
print(sales_sorted[['product', 'total_amount', 'rolling_avg']].head(20))

## 11. Best Practices & Performance Tips

### Best Practices ✅

**1. Use Named Aggregations**
```python
# ✅ Clear and readable
df.groupby('cat').agg(
    total_sales=('sales', 'sum'),
    avg_price=('price', 'mean')
)

# ❌ MultiIndex columns, harder to work with
df.groupby('cat').agg({'sales': 'sum', 'price': 'mean'})
```

**2. Use Built-in Functions When Possible**
```python
# ✅ Fast - built-in aggregation
df.groupby('cat')['sales'].sum()

# ❌ Slower - apply with lambda
df.groupby('cat')['sales'].apply(lambda x: x.sum())
```

**3. Reset Index for Cleaner Results**
```python
# ✅ Clean DataFrame
result = df.groupby('cat')['sales'].sum().reset_index()

# ❌ Grouped column becomes index
result = df.groupby('cat')['sales'].sum()
```

**4. Use Transform for Same-Shape Operations**
```python
# ✅ Add group mean to each row
df['group_mean'] = df.groupby('cat')['sales'].transform('mean')

# ❌ Aggregation reduces rows
df.groupby('cat')['sales'].mean()
```

**5. Filter Groups, Not Rows**
```python
# ✅ Keep entire groups
df.groupby('product').filter(lambda x: x['sales'].sum() > 1000)

# ❌ Filters individual rows
df[df['sales'] > 1000]
```

### Performance Tips 🚀

**1. Avoid Unnecessary apply()**
```python
# Fast
df.groupby('cat')['sales'].sum()  # ~10ms

# Slow
df.groupby('cat')['sales'].apply(sum)  # ~100ms
```

**2. Use Categorical Data Types**
```python
# Faster groupby on categorical columns
df['category'] = df['category'].astype('category')
df.groupby('category')['sales'].sum()  # Faster!
```

**3. Sort Before GroupBy (Sometimes)**
```python
# Can be faster for large datasets
df.sort_values('category').groupby('category', sort=False).sum()
```

**4. Use as_index=False to Avoid Reset**
```python
# ✅ One step
df.groupby('cat', as_index=False)['sales'].sum()

# ❌ Two steps
df.groupby('cat')['sales'].sum().reset_index()
```

### Common Pitfalls ❌

**1. Forgetting to Aggregate**
```python
# ❌ Returns GroupBy object, not results
grouped = df.groupby('category')

# ✅ Apply aggregation
result = df.groupby('category')['sales'].sum()
```

**2. Confusing Transform and Aggregate**
```python
# Aggregate: Reduces rows
df.groupby('cat')['sales'].sum()  # Returns 1 value per group

# Transform: Same number of rows
df.groupby('cat')['sales'].transform('sum')  # Returns value for each row
```

**3. Not Handling MultiIndex**
```python
# ❌ MultiIndex can be confusing
result = df.groupby(['col1', 'col2'])['val'].sum()

# ✅ Flatten with reset_index()
result = df.groupby(['col1', 'col2'])['val'].sum().reset_index()
```

## 12. Practice Exercises

### Beginner Level (1-5)

1. **Calculate total revenue by product**
   - Use `groupby('product')['total_amount'].sum()`

2. **Count number of orders per region**
   - Use `groupby('region').size()`

3. **Find average order value by customer type**
   - Use `groupby('customer_type')['total_amount'].mean()`

4. **Get total units sold per product**
   - Use `groupby('product')['quantity'].sum()`

5. **Find maximum order amount in each region**
   - Use `groupby('region')['total_amount'].max()`

### Intermediate Level (6-10)

6. **Product statistics: total, average, count**
   - Use `groupby('product')['total_amount'].agg(['sum', 'mean', 'count'])`

7. **Revenue by product and region**
   - Use `groupby(['product', 'region'])['total_amount'].sum()`

8. **Add group mean to each row**
   - Use `groupby('product')['total_amount'].transform('mean')`

9. **Keep only products with > 30 orders**
   - Use `groupby('product').filter(lambda x: len(x) > 30)`

10. **Calculate market share % for each product**
    - Sum by product, divide by total, multiply by 100

### Advanced Level (11-15)

11. **Top 3 orders for each product**
    - Use `groupby('product').apply(lambda x: x.nlargest(3, 'total_amount'))`

12. **Calculate weighted average price by quantity**
    - Use custom function with apply

13. **Monthly revenue with month-over-month growth**
    - Group by month, calculate percentage change

14. **Rank orders within each product**
    - Use `groupby('product')['total_amount'].rank()`

15. **Find products with high revenue variability (std > 1000)**
    - Use `groupby('product').filter(lambda x: x['total_amount'].std() > 1000)`

### Challenge Problems (16-20)

16. **Create RFM analysis (Recency, Frequency, Monetary)**
    - Group by customer, calculate days since last order, count, total spend

17. **Identify "star" products: high revenue + high order count + low discount**
    - Multiple aggregations with conditions

18. **Calculate cohort analysis by first purchase month**
    - Complex grouping with date transformations

19. **Find cross-sell opportunities (products often bought together)**
    - Would require order-level grouping (beyond current dataset)

20. **Create a custom performance score combining multiple metrics**
    - Custom function with weighted combination of revenue, orders, consistency

In [None]:
print("=== PRACTICE EXERCISE SOLUTIONS ===\n")
print("Try solving the exercises first, then check solutions!\n")

# Solution 1
print("Solution 1: Total revenue by product")
revenue_by_product = sales_df.groupby('product')['total_amount'].sum().sort_values(ascending=False)
print(revenue_by_product)
print()

# Solution 6
print("Solution 6: Product statistics")
product_stats = sales_df.groupby('product')['total_amount'].agg(['sum', 'mean', 'count'])
product_stats.columns = ['Total', 'Average', 'Count']
print(product_stats.sort_values('Total', ascending=False))
print()

# Solution 10
print("Solution 10: Market share % for each product")
total_revenue = sales_df['total_amount'].sum()
market_share = sales_df.groupby('product')['total_amount'].sum()
market_share_pct = (market_share / total_revenue * 100).sort_values(ascending=False).round(2)
print(market_share_pct)
print()

# Solution 11
print("Solution 11: Top 3 orders for each product")
def top3(group):
    return group.nlargest(3, 'total_amount')[['order_id', 'total_amount']]

top3_per_product = sales_df.groupby('product', group_keys=False).apply(top3)
print(top3_per_product.head(15))
print()

# Solution 14
print("Solution 14: Rank orders within each product")
sales_df['product_rank'] = sales_df.groupby('product')['total_amount'].rank(
    ascending=False, method='dense'
)
print(sales_df[['product', 'total_amount', 'product_rank']].sort_values(['product', 'product_rank']).head(20))
print()

# Solution 20
print("Solution 20: Custom performance score")
def performance_score(group):
    revenue = group['total_amount'].sum()
    order_count = len(group)
    consistency = 1 / (1 + group['total_amount'].std())  # Reward consistency
    avg_discount = group['discount_%'].mean()
    
    # Weighted score
    score = (revenue * 0.5 + 
             order_count * 100 * 0.3 + 
             consistency * 1000 * 0.1 - 
             avg_discount * 50 * 0.1)
    return score

performance_scores = sales_df.groupby('product').apply(performance_score).sort_values(ascending=False)
print(performance_scores.round(2))
print("\nInterpretation: Higher score = better overall performance")

print("\n" + "="*80)
print("Try solving the remaining exercises on your own!")
print("="*80)

## Quick Reference Card

### Basic GroupBy Syntax

```python
# Simple groupby + aggregation
df.groupby('column')['value'].sum()

# Multiple columns
df.groupby(['col1', 'col2'])['value'].sum()

# Multiple aggregations
df.groupby('col')['value'].agg(['sum', 'mean', 'count'])
```

### Named Aggregations (Recommended)

```python
df.groupby('category').agg(
    total_sales=('sales', 'sum'),
    avg_price=('price', 'mean'),
    num_orders=('order_id', 'count')
)
```

### Common Aggregation Functions

```python
sum()      # Total
mean()     # Average
median()   # Middle value
min()      # Minimum
max()      # Maximum
count()    # Count non-null
size()     # Count all (including null)
std()      # Standard deviation
var()      # Variance
nunique()  # Count unique
first()    # First value
last()     # Last value
```

### Transform vs Aggregate

```python
# Aggregate: Reduces rows
df.groupby('cat')['sales'].sum()  # One row per group

# Transform: Same number of rows
df['group_sum'] = df.groupby('cat')['sales'].transform('sum')
```

### Filter Groups

```python
# Keep groups where sum > 1000
df.groupby('product').filter(lambda x: x['sales'].sum() > 1000)
```

### Apply Custom Functions

```python
# Custom aggregation
def custom_func(group):
    return group['value'].max() - group['value'].min()

df.groupby('category').apply(custom_func)
```

### Iterate Over Groups

```python
for name, group in df.groupby('category'):
    print(f"Processing {name}")
    print(group.head())
```

### Reset Index

```python
# Reset index to columns
result = df.groupby('cat')['sales'].sum().reset_index()

# Or use as_index=False
result = df.groupby('cat', as_index=False)['sales'].sum()
```

### Common Patterns

```python
# Top N per group
df.groupby('category').apply(lambda x: x.nlargest(3, 'sales'))

# Percentage of group total
df['pct'] = df['value'] / df.groupby('cat')['value'].transform('sum') * 100

# Rank within groups
df['rank'] = df.groupby('cat')['value'].rank(ascending=False)

# Cumulative sum within groups
df['cumsum'] = df.groupby('cat')['value'].cumsum()

# Fill missing with group mean
df['value'] = df.groupby('cat')['value'].transform(lambda x: x.fillna(x.mean()))
```

## Summary

### Key Concepts Mastered ✅

**1. Split-Apply-Combine Pattern**
- Split data into groups
- Apply function to each group
- Combine results

**2. Aggregation Operations**
- Single aggregations: `sum()`, `mean()`, `count()`
- Multiple aggregations: `agg(['sum', 'mean'])`
- Named aggregations: Clean, readable results
- Different aggregations per column

**3. Transform Operations**
- Add group statistics to each row
- Maintain original DataFrame shape
- Calculate percentages, z-scores
- Fill missing values with group stats

**4. Filter Operations**
- Keep/remove entire groups
- Based on group properties
- Different from row filtering

**5. Apply Custom Functions**
- Lambda functions for simple operations
- Named functions for complex logic
- Return Series or DataFrame

**6. Multiple GroupBy Columns**
- Create groups for unique combinations
- Results in MultiIndex
- Use `reset_index()` for flat structure

---

### Method Selection Guide

| Task | Method | Example |
|------|--------|----------|
| Get group totals | `agg('sum')` | Revenue per product |
| Add group mean to rows | `transform('mean')` | Compare to group avg |
| Keep high-performing groups | `filter()` | Products with >1000 orders |
| Custom calculations | `apply()` | Weighted averages |
| Multiple metrics | `agg({...})` | Total + average + count |
| Top N per group | `apply(nlargest)` | Top 3 orders per product |

---

### Common Patterns

**Pattern 1: Basic Analysis**
```python
df.groupby('category')['sales'].sum().sort_values(ascending=False)
```

**Pattern 2: Multi-Metric Dashboard**
```python
df.groupby('product').agg(
    total=('sales', 'sum'),
    average=('sales', 'mean'),
    count=('order_id', 'count')
)
```

**Pattern 3: Percentage Contribution**
```python
total = df['sales'].sum()
df.groupby('product')['sales'].sum() / total * 100
```

**Pattern 4: Rank Within Groups**
```python
df['rank'] = df.groupby('category')['sales'].rank(ascending=False)
df[df['rank'] <= 3]  # Top 3 per category
```

---

### Next Steps

After mastering GroupBy:
1. **Pivot Tables** - Reshape grouped data
2. **Time Series** - Date-based grouping and resampling
3. **Window Functions** - Rolling calculations within groups
4. **Multi-Index** - Advanced hierarchical indexing
5. **Performance Optimization** - Speed up large GroupBy operations

---

### Remember

- 🎯 **Use named aggregations** for clarity
- ⚡ **Built-in functions** are faster than `apply()`
- 📊 **Transform** preserves shape, **aggregate** reduces
- 🔍 **Filter** operates on groups, not rows
- 🔄 **Reset index** for flat, easy-to-use results

---

**Happy Grouping! 🐼📊**