# GroupBy and Aggregation

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand the split-apply-combine paradigm
2. Use `groupby()` to group data by one or more columns
3. Apply aggregation functions with `agg()`
4. Use `transform()` for group-level calculations
5. Create pivot tables for data summarization
6. Apply filtering within groups

---

## Setup

In [None]:
import pandas as pd
import numpy as np

# Set display options
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 100)

In [None]:
# Create sample sales data
np.random.seed(42)

sales_data = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=20, freq='D').tolist() * 2,
    'region': ['North', 'South', 'East', 'West'] * 10,
    'product': ['Widget', 'Gadget'] * 20,
    'salesperson': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana'], 40),
    'quantity': np.random.randint(1, 50, 40),
    'unit_price': np.random.choice([10.0, 15.0, 20.0, 25.0], 40),
    'discount': np.random.choice([0.0, 0.05, 0.10, 0.15], 40)
})

# Calculate total sale
sales_data['total'] = sales_data['quantity'] * sales_data['unit_price'] * (1 - sales_data['discount'])

print("Sales Data (first 10 rows):")
print(sales_data.head(10))
print(f"\nShape: {sales_data.shape}")

---

## 1. The Split-Apply-Combine Paradigm

GroupBy operations follow a three-step process:

1. **Split**: Divide the data into groups based on some criteria
2. **Apply**: Apply a function to each group independently
3. **Combine**: Combine the results back into a data structure

In [None]:
# Create a GroupBy object
grouped = sales_data.groupby('region')
print(f"Type: {type(grouped)}")
print(f"Number of groups: {grouped.ngroups}")
print(f"Groups: {grouped.groups.keys()}")

In [None]:
# View group sizes
print("Group sizes:")
print(grouped.size())

In [None]:
# Access a specific group
print("North region data:")
print(grouped.get_group('North'))

---

## 2. Basic Aggregations

### 2.1 Single Aggregation Function

In [None]:
# Sum by region
print("Total sales by region:")
print(sales_data.groupby('region')['total'].sum())

In [None]:
# Mean by region
print("Average sale by region:")
print(sales_data.groupby('region')['total'].mean().round(2))

In [None]:
# Count by region
print("Number of sales by region:")
print(sales_data.groupby('region')['total'].count())

In [None]:
# Common aggregation functions
print("Various aggregations for 'total' by region:")
region_group = sales_data.groupby('region')['total']
print(f"Sum: \n{region_group.sum()}")
print(f"\nMean: \n{region_group.mean().round(2)}")
print(f"\nMin: \n{region_group.min()}")
print(f"\nMax: \n{region_group.max()}")
print(f"\nStd: \n{region_group.std().round(2)}")

### 2.2 Grouping by Multiple Columns

In [None]:
# Group by region and product
grouped = sales_data.groupby(['region', 'product'])['total'].sum()
print("Total sales by region and product:")
print(grouped)

In [None]:
# Convert to DataFrame with reset_index
grouped_df = sales_data.groupby(['region', 'product'])['total'].sum().reset_index()
print("As DataFrame:")
print(grouped_df)

In [None]:
# Unstack to create a matrix view
matrix = sales_data.groupby(['region', 'product'])['total'].sum().unstack()
print("Matrix view:")
print(matrix)

---

## 3. The `agg()` Method

The `agg()` method provides flexible aggregation capabilities.

### 3.1 Multiple Aggregations on One Column

In [None]:
# Multiple aggregations on total
result = sales_data.groupby('region')['total'].agg(['sum', 'mean', 'min', 'max', 'count'])
print("Multiple aggregations on 'total':")
print(result.round(2))

### 3.2 Different Aggregations for Different Columns

In [None]:
# Different aggregations per column using a dictionary
result = sales_data.groupby('region').agg({
    'quantity': 'sum',
    'total': ['sum', 'mean'],
    'discount': 'mean'
})
print("Different aggregations per column:")
print(result.round(2))

In [None]:
# Flatten column names
result.columns = ['_'.join(col).strip() for col in result.columns.values]
print("With flattened column names:")
print(result.round(2))

### 3.3 Named Aggregations

In [None]:
# Named aggregations for cleaner output
result = sales_data.groupby('region').agg(
    total_quantity=('quantity', 'sum'),
    total_sales=('total', 'sum'),
    avg_sale=('total', 'mean'),
    num_transactions=('total', 'count'),
    avg_discount=('discount', 'mean')
)
print("Named aggregations:")
print(result.round(2))

### 3.4 Custom Aggregation Functions

In [None]:
# Custom aggregation function
def range_func(x):
    return x.max() - x.min()

def coefficient_of_variation(x):
    return x.std() / x.mean() * 100

result = sales_data.groupby('region')['total'].agg(
    ['mean', 'std', range_func, coefficient_of_variation]
)
print("Custom aggregation functions:")
print(result.round(2))

In [None]:
# Lambda functions in aggregation
result = sales_data.groupby('region').agg({
    'total': [('total_sum', 'sum'), ('above_100', lambda x: (x > 100).sum())],
    'quantity': [('avg_qty', 'mean')]
})
print("Lambda aggregations:")
print(result.round(2))

---

## 4. The `transform()` Method

`transform()` returns a result with the same shape as the input, broadcasting group-level results back to each row.

In [None]:
# Add group mean as a new column
sales_data['region_avg'] = sales_data.groupby('region')['total'].transform('mean')
print("With region average:")
print(sales_data[['region', 'total', 'region_avg']].head(10).round(2))

In [None]:
# Calculate difference from group mean
sales_data['diff_from_avg'] = sales_data['total'] - sales_data['region_avg']
print("Difference from region average:")
print(sales_data[['region', 'total', 'region_avg', 'diff_from_avg']].head(10).round(2))

In [None]:
# Standardize within groups (z-score)
def standardize(x):
    return (x - x.mean()) / x.std()

sales_data['total_zscore'] = sales_data.groupby('region')['total'].transform(standardize)
print("Z-scores within region:")
print(sales_data[['region', 'total', 'total_zscore']].head(10).round(2))

In [None]:
# Percentage of group total
sales_data['pct_of_region'] = (sales_data['total'] / 
                               sales_data.groupby('region')['total'].transform('sum') * 100)
print("Percentage of region total:")
print(sales_data[['region', 'total', 'pct_of_region']].head(10).round(2))

In [None]:
# Rank within group
sales_data['rank_in_region'] = sales_data.groupby('region')['total'].rank(ascending=False)
print("Rank within region:")
print(sales_data[['region', 'total', 'rank_in_region']].sort_values(
    ['region', 'rank_in_region']).head(12))

---

## 5. Pivot Tables

Pivot tables provide a powerful way to summarize data.

In [None]:
# Basic pivot table
pivot = pd.pivot_table(
    sales_data,
    values='total',
    index='region',
    columns='product',
    aggfunc='sum'
)
print("Basic pivot table:")
print(pivot.round(2))

In [None]:
# Pivot table with multiple aggregations
pivot = pd.pivot_table(
    sales_data,
    values='total',
    index='region',
    columns='product',
    aggfunc=['sum', 'mean', 'count']
)
print("Pivot with multiple aggregations:")
print(pivot.round(2))

In [None]:
# Pivot table with margins (totals)
pivot = pd.pivot_table(
    sales_data,
    values='total',
    index='region',
    columns='product',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)
print("Pivot with totals:")
print(pivot.round(2))

In [None]:
# Pivot table with multiple values
pivot = pd.pivot_table(
    sales_data,
    values=['total', 'quantity'],
    index='region',
    columns='product',
    aggfunc='sum'
)
print("Pivot with multiple values:")
print(pivot)

In [None]:
# Pivot table with multiple index levels
pivot = pd.pivot_table(
    sales_data,
    values='total',
    index=['region', 'salesperson'],
    columns='product',
    aggfunc='sum',
    fill_value=0
)
print("Pivot with hierarchical index:")
print(pivot.round(2))

---

## 6. Filtering Groups

In [None]:
# Filter groups using filter()
# Keep only regions with total sales > 2000
filtered = sales_data.groupby('region').filter(lambda x: x['total'].sum() > 2000)
print(f"Original rows: {len(sales_data)}")
print(f"Filtered rows: {len(filtered)}")
print("\nRegions kept:")
print(filtered['region'].unique())

In [None]:
# Filter groups with more than N rows
# Keep only salespersons with more than 8 transactions
filtered = sales_data.groupby('salesperson').filter(lambda x: len(x) > 8)
print("Salespersons with more than 8 transactions:")
print(filtered.groupby('salesperson').size())

In [None]:
# Filter based on group statistics
# Keep regions where average sale is above 100
filtered = sales_data.groupby('region').filter(lambda x: x['total'].mean() > 100)
print("Regions with average sale > 100:")
print(filtered.groupby('region')['total'].mean().round(2))

---

## 7. Iterating Over Groups

In [None]:
# Iterate over groups
for name, group in sales_data.groupby('region'):
    print(f"\nRegion: {name}")
    print(f"  Rows: {len(group)}")
    print(f"  Total Sales: ${group['total'].sum():.2f}")

In [None]:
# Iterate over multiple grouping columns
for (region, product), group in sales_data.groupby(['region', 'product']):
    if len(group) > 5:  # Only show groups with more than 5 rows
        print(f"Region: {region}, Product: {product}, Sales: ${group['total'].sum():.2f}")

---

## Exercises

In [None]:
# Create exercise data - employee performance
np.random.seed(123)

employee_data = pd.DataFrame({
    'employee_id': range(1, 51),
    'name': [f'Employee_{i}' for i in range(1, 51)],
    'department': np.random.choice(['Sales', 'Engineering', 'Marketing', 'HR'], 50),
    'team': np.random.choice(['A', 'B', 'C'], 50),
    'salary': np.random.randint(40000, 120000, 50),
    'years_exp': np.random.randint(1, 20, 50),
    'performance_score': np.random.randint(60, 100, 50),
    'projects_completed': np.random.randint(1, 15, 50)
})

print("Employee Data (first 10 rows):")
print(employee_data.head(10))

### Exercise 1: Basic GroupBy

1. Calculate the average salary by department
2. Find the total projects completed by each team
3. Count the number of employees in each department

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
# 1. Average salary by department
print("Average salary by department:")
print(employee_data.groupby('department')['salary'].mean().round(2))

# 2. Total projects by team
print("\nTotal projects by team:")
print(employee_data.groupby('team')['projects_completed'].sum())

# 3. Employee count by department
print("\nEmployee count by department:")
print(employee_data.groupby('department').size())
# or: print(employee_data.groupby('department')['employee_id'].count())
```
</details>

### Exercise 2: Multiple Aggregations

Create a summary table by department that includes:
- Total number of employees
- Average salary
- Minimum and maximum performance score
- Total years of experience

Use named aggregations for clean column names.

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
summary = employee_data.groupby('department').agg(
    employee_count=('employee_id', 'count'),
    avg_salary=('salary', 'mean'),
    min_performance=('performance_score', 'min'),
    max_performance=('performance_score', 'max'),
    total_experience=('years_exp', 'sum')
)

print("Department Summary:")
print(summary.round(2))
```
</details>

### Exercise 3: Transform

1. Add a column showing the department average salary
2. Add a column showing how each employee's salary compares to their department average (as a percentage)
3. Add a column ranking employees by performance score within their department

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
df = employee_data.copy()

# 1. Department average salary
df['dept_avg_salary'] = df.groupby('department')['salary'].transform('mean')

# 2. Salary comparison (percentage of department average)
df['salary_vs_dept'] = (df['salary'] / df['dept_avg_salary'] * 100).round(1)

# 3. Performance rank within department (1 = highest)
df['dept_performance_rank'] = df.groupby('department')['performance_score'].rank(
    ascending=False, method='dense'
).astype(int)

print("Employee data with transforms:")
print(df[['name', 'department', 'salary', 'dept_avg_salary', 
          'salary_vs_dept', 'performance_score', 'dept_performance_rank']].head(15))
```
</details>

### Exercise 4: Pivot Table

Create a pivot table showing:
- Rows: Department
- Columns: Team
- Values: Average performance score

Include row and column totals.

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
pivot = pd.pivot_table(
    employee_data,
    values='performance_score',
    index='department',
    columns='team',
    aggfunc='mean',
    margins=True,
    margins_name='Average'
)

print("Performance Score by Department and Team:")
print(pivot.round(1))
```
</details>

### Exercise 5: Filter and Analyze

1. Filter to keep only departments with average performance score above 75
2. For the remaining departments, find the top performer (highest performance score) in each department
3. Display their name, department, and performance score

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
# 1. Filter departments with avg performance > 75
high_performing_depts = employee_data.groupby('department').filter(
    lambda x: x['performance_score'].mean() > 75
)

print("Departments with average performance > 75:")
print(high_performing_depts['department'].unique())

# 2 & 3. Find top performer in each remaining department
top_performers = high_performing_depts.loc[
    high_performing_depts.groupby('department')['performance_score'].idxmax()
]

print("\nTop performers by department:")
print(top_performers[['name', 'department', 'performance_score']])
```
</details>

---

## Summary

In this notebook, you learned:

1. **Split-Apply-Combine**:
   - `groupby()` creates a GroupBy object
   - Groups data by one or more columns
   - Apply functions, combine results

2. **Aggregations**:
   - Built-in: `sum()`, `mean()`, `count()`, `min()`, `max()`, `std()`
   - `agg()` for multiple aggregations
   - Named aggregations for clean output
   - Custom aggregation functions

3. **Transform**:
   - Returns same-shaped output
   - Broadcasts group results to rows
   - Useful for: group averages, z-scores, rankings, percentages

4. **Pivot Tables**:
   - `pd.pivot_table()` for summarization
   - Multiple values, indexes, columns
   - Margins for totals

5. **Filtering**:
   - `filter()` to keep/remove entire groups
   - Based on group-level conditions

---

## Next Steps

Continue to the next notebook: **[06_merging_and_joining.ipynb](06_merging_and_joining.ipynb)** to learn how to combine DataFrames using concat, merge, and join operations.