# Advanced Pandas Operations - Part 2: Aggregation Functions

## Week 3, Day 2 (Thursday) - April 24th, 2025

### Overview
This session builds on the GroupBy operations covered in Part 1 and dives deeper into the various aggregation functions available in Pandas. Aggregation functions allow you to summarize data in meaningful ways, similar to SQL's aggregate functions but with more flexibility and power.

### Learning Objectives
- Master a wide range of built-in aggregation functions in Pandas
- Create and apply custom aggregation functions
- Use multiple aggregation functions simultaneously
- Apply aggregations with and without grouping
- Understand the relationship between Pandas aggregations and SQL aggregations

### Prerequisites
- Python fundamentals (Week 1)
- Pandas Fundamentals I & II (Week 2, Day 2 & Week 3, Day 1)
- GroupBy operations (Week 3, Day 2, Part 1)
- SQL knowledge (prior to course)

## 1. Introduction to Aggregation Functions

Aggregation functions compute summary statistics over a set of values. While we briefly covered some aggregation methods in the GroupBy lesson, here we'll explore them more extensively. Let's start with a dataset to work with:

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create a sample sales dataset (more detailed than the one in Part 1)
np.random.seed(42)  # For reproducibility
n = 100  # Number of records

# Generate dates for Q1 2025
dates = pd.date_range('2025-01-01', '2025-03-31', periods=n)

# Create dictionary of data
data = {
    'transaction_id': [f'T{i:04d}' for i in range(1, n+1)],
    'date': dates,
    'customer_id': np.random.choice([f'C{i:03d}' for i in range(1, 21)], size=n),  # 20 customers
    'product_id': np.random.choice([f'P{i:03d}' for i in range(1, 51)], size=n),   # 50 products
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Books', 'Sports'], 
                                         size=n, p=[0.3, 0.25, 0.2, 0.15, 0.1]),  # Weighted categories
    'store_id': np.random.choice(['S01', 'S02', 'S03', 'S04'], size=n),
    'region': np.random.choice(['North', 'South', 'East', 'West'], size=n),
    'quantity': np.random.randint(1, 10, size=n),
    'unit_price': np.random.uniform(10, 500, size=n).round(2),
    'discount_pct': np.random.choice([0, 5, 10, 15, 20, 25], size=n),
    'payment_method': np.random.choice(['Credit Card', 'Debit Card', 'Cash', 'Mobile Payment'], size=n),
    'is_online': np.random.choice([True, False], size=n, p=[0.6, 0.4]),  # 60% online
    'order_status': np.random.choice(['Completed', 'Returned', 'Canceled'], size=n, p=[0.85, 0.1, 0.05]),
    'customer_rating': np.random.choice([1, 2, 3, 4, 5, None], size=n, p=[0.05, 0.1, 0.15, 0.3, 0.3, 0.1])
}

# Create DataFrame
sales_df = pd.DataFrame(data)

# Calculate derived columns
sales_df['discount_amount'] = (sales_df['unit_price'] * sales_df['discount_pct'] / 100).round(2)
sales_df['net_price'] = (sales_df['unit_price'] - sales_df['discount_amount']).round(2)
sales_df['total_amount'] = (sales_df['quantity'] * sales_df['net_price']).round(2)

# Add some additional time-related columns
sales_df['month'] = sales_df['date'].dt.month_name()
sales_df['day_of_week'] = sales_df['date'].dt.day_name()
sales_df['week'] = sales_df['date'].dt.isocalendar().week
sales_df['is_weekend'] = sales_df['day_of_week'].isin(['Saturday', 'Sunday'])

# Display the first few rows
print("Sample Sales DataFrame:")
print(sales_df.head())

# Display summary information
print("\nDataFrame Info:")
print(f"Shape: {sales_df.shape}")
print(f"Columns: {sales_df.columns.tolist()}")
print("\nColumn Data Types:")
print(sales_df.dtypes)

## 2. Basic Aggregation Functions

Let's start with the basic aggregation functions built into Pandas. These are similar to SQL's aggregate functions:

In [None]:
# Single column aggregations
print("Basic aggregations on unit_price:")
print(f"Count: {sales_df['unit_price'].count()}")
print(f"Sum: ${sales_df['unit_price'].sum():.2f}")
print(f"Mean: ${sales_df['unit_price'].mean():.2f}")
print(f"Median: ${sales_df['unit_price'].median():.2f}")
print(f"Standard Deviation: ${sales_df['unit_price'].std():.2f}")
print(f"Minimum: ${sales_df['unit_price'].min():.2f}")
print(f"Maximum: ${sales_df['unit_price'].max():.2f}")

# SQL equivalents:
# SELECT 
#     COUNT(unit_price),
#     SUM(unit_price),
#     AVG(unit_price),
#     -- No direct median in standard SQL
#     STDDEV(unit_price),
#     MIN(unit_price),
#     MAX(unit_price)
# FROM sales

In [None]:
# Multiple column aggregations with .agg()
basic_aggs = sales_df.agg({
    'unit_price': ['count', 'sum', 'mean', 'median', 'std', 'min', 'max'],
    'quantity': ['count', 'sum', 'mean', 'median', 'std', 'min', 'max'],
    'total_amount': ['count', 'sum', 'mean', 'median', 'std', 'min', 'max']
})

print("Multiple column aggregations:")
print(basic_aggs)

In [None]:
# Calculate summary statistics for all numeric columns
summary_stats = sales_df.describe()
print("Summary statistics for numeric columns:")
print(summary_stats)

# Calculate summary statistics for categorical columns
cat_summary = sales_df.describe(include=['object', 'bool'])
print("\nSummary statistics for categorical columns:")
print(cat_summary)

## 3. Advanced Aggregation Functions

Pandas provides many advanced aggregation functions beyond the basic ones:

In [None]:
# Calculate percentiles
percentiles = sales_df['total_amount'].quantile([0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
print("Percentiles of total_amount:")
print(percentiles)

# Interquartile Range (IQR)
q1 = sales_df['total_amount'].quantile(0.25)
q3 = sales_df['total_amount'].quantile(0.75)
iqr = q3 - q1
print(f"\nInterquartile Range (IQR) for total_amount: ${iqr:.2f}")

# Mode (most frequent value)
mode_product = sales_df['product_category'].mode()[0]
mode_rating = sales_df['customer_rating'].mode()[0]
print(f"\nMost common product category: {mode_product}")
print(f"Most common customer rating: {mode_rating}")

# Variance
price_variance = sales_df['unit_price'].var()
print(f"\nVariance of unit_price: {price_variance:.2f}")

# Skewness and Kurtosis
skew = sales_df['total_amount'].skew()
kurt = sales_df['total_amount'].kurt()
print(f"\nSkewness of total_amount: {skew:.4f}")
print(f"Kurtosis of total_amount: {kurt:.4f}")

# Plot a histogram to visualize distribution
plt.figure(figsize=(10, 6))
plt.hist(sales_df['total_amount'], bins=20, edgecolor='black', alpha=0.7)
plt.axvline(sales_df['total_amount'].mean(), color='red', linestyle='dashed', linewidth=1, label=f'Mean: ${sales_df["total_amount"].mean():.2f}')
plt.axvline(sales_df['total_amount'].median(), color='green', linestyle='dashed', linewidth=1, label=f'Median: ${sales_df["total_amount"].median():.2f}')
plt.title('Distribution of Total Sales Amount')
plt.xlabel('Total Amount ($)')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Calculating cumulative statistics
sales_df_sorted = sales_df.sort_values('date').reset_index(drop=True)

# Cumulative sum of sales over time
sales_df_sorted['cumulative_sales'] = sales_df_sorted['total_amount'].cumsum()

# Cumulative average price
sales_df_sorted['cumulative_avg_price'] = sales_df_sorted['unit_price'].expanding().mean().round(2)

# Display the data
print("Cumulative statistics:")
cumulative_cols = ['date', 'total_amount', 'cumulative_sales', 'unit_price', 'cumulative_avg_price']
print(sales_df_sorted[cumulative_cols].head(10))

# Visualize cumulative sales
plt.figure(figsize=(12, 6))
plt.plot(sales_df_sorted['date'], sales_df_sorted['cumulative_sales'], marker='', linewidth=2)
plt.title('Cumulative Sales Over Time (Q1 2025)')
plt.xlabel('Date')
plt.ylabel('Cumulative Sales ($)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

## 4. Aggregating with GroupBy

Now, let's combine what we learned in Part 1 (GroupBy) with more advanced aggregation techniques:

In [None]:
# Group by product category and calculate multiple statistics
category_stats = sales_df.groupby('product_category').agg({
    'transaction_id': 'count',            # Number of transactions
    'quantity': 'sum',                    # Total quantity sold
    'total_amount': ['sum', 'mean', 'median', 'std', 'min', 'max'],  # Various price stats
    'discount_pct': ['mean', 'median'],   # Average discount
    'customer_rating': ['mean', 'count', 'median']  # Rating stats
})

# Clean up column names
category_stats.columns = ['_'.join(col).strip() for col in category_stats.columns.values]
category_stats = category_stats.rename(columns={
    'transaction_id_count': 'num_transactions',
    'quantity_sum': 'total_quantity',
    'total_amount_sum': 'total_sales',
    'total_amount_mean': 'avg_transaction_value',
    'total_amount_median': 'median_transaction_value',
    'discount_pct_mean': 'avg_discount_pct',
    'customer_rating_mean': 'avg_rating',
    'customer_rating_count': 'num_ratings'
})

print("Detailed statistics by product category:")
print(category_stats)

# Visualize sales by category
plt.figure(figsize=(12, 6))
plt.bar(category_stats.index, category_stats['total_sales'])
plt.title('Total Sales by Product Category')
plt.xlabel('Product Category')
plt.ylabel('Total Sales ($)')
plt.grid(True, axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
# Monthly sales trends
monthly_sales = sales_df.groupby('month').agg({
    'transaction_id': 'count',
    'total_amount': 'sum',
    'customer_rating': ['mean', 'count']
})

# Clean up column names
monthly_sales.columns = ['_'.join(col).strip() for col in monthly_sales.columns.values]
monthly_sales = monthly_sales.rename(columns={
    'transaction_id_count': 'num_transactions',
    'total_amount_sum': 'total_sales',
    'customer_rating_mean': 'avg_rating',
    'customer_rating_count': 'num_ratings'
})

# Reorder by month
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 
               'July', 'August', 'September', 'October', 'November', 'December']
monthly_sales = monthly_sales.reindex(month_order)

print("Monthly sales statistics:")
print(monthly_sales)

# Visualize monthly trends
fig, ax1 = plt.subplots(figsize=(12, 6))

# Plot total sales on the first y-axis
color = 'tab:blue'
ax1.set_xlabel('Month')
ax1.set_ylabel('Total Sales ($)', color=color)
ax1.bar(monthly_sales.index, monthly_sales['total_sales'], color=color, alpha=0.7)
ax1.tick_params(axis='y', labelcolor=color)

# Create a second y-axis for average rating
ax2 = ax1.twinx()
color = 'tab:red'
ax2.set_ylabel('Average Rating', color=color)
ax2.plot(monthly_sales.index, monthly_sales['avg_rating'], color=color, marker='o', linewidth=2)
ax2.tick_params(axis='y', labelcolor=color)
ax2.set_ylim(1, 5)  # Set y-axis limits for ratings

plt.title('Monthly Sales and Average Customer Rating')
plt.grid(True, axis='y', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

## 5. Custom Aggregation Functions

While Pandas provides many built-in aggregation functions, you can also define your own custom functions:

In [None]:
# Define custom aggregation functions
def range_func(x):
    """Calculate the range (max - min)"""
    return x.max() - x.min()

def pct_change_func(x):
    """Calculate the percentage change from first to last value"""
    if len(x) > 1 and x.iloc[0] != 0:
        return ((x.iloc[-1] - x.iloc[0]) / x.iloc[0] * 100).round(2)
    return 0

def discount_savings(x):
    """Calculate total savings from discounts"""
    return (x * sales_df.loc[x.index, 'quantity']).sum().round(2)

def rating_distribution(x):
    """Calculate the distribution of ratings as a dictionary"""
    # Remove NaN values
    x = x.dropna()
    if len(x) == 0:
        return {}
    # Count occurrences of each rating
    counts = x.value_counts().to_dict()
    # Convert to percentages
    total = sum(counts.values())
    return {f"{k}": round(v/total*100, 1) for k, v in counts.items()}

# Apply custom functions to each product category
custom_stats = sales_df.groupby('product_category').agg({
    'unit_price': [range_func, 'mean'],
    'discount_amount': discount_savings,
    'customer_rating': rating_distribution
})

# Clean up column names
custom_stats.columns = ['_'.join(col).strip() for col in custom_stats.columns.values]
custom_stats = custom_stats.rename(columns={
    'unit_price_range_func': 'price_range',
    'unit_price_mean': 'avg_price',
    'discount_amount_discount_savings': 'total_discount_savings',
    'customer_rating_rating_distribution': 'rating_distribution'
})

print("Custom aggregation results:")
print(custom_stats)

In [None]:
# Time-based custom aggregations
weekly_data = sales_df.sort_values('date').set_index('date')
weekly_agg = weekly_data.resample('W').agg({
    'transaction_id': 'count',
    'total_amount': ['sum', 'mean', pct_change_func],
    'customer_rating': ['mean', 'count']
})

# Clean up column names
weekly_agg.columns = ['_'.join(col).strip() for col in weekly_agg.columns.values]
weekly_agg = weekly_agg.rename(columns={
    'transaction_id_count': 'num_transactions',
    'total_amount_sum': 'total_sales',
    'total_amount_mean': 'avg_transaction',
    'total_amount_pct_change_func': 'weekly_pct_change',
    'customer_rating_mean': 'avg_rating',
    'customer_rating_count': 'num_ratings'
})

print("Weekly aggregations with custom functions:")
print(weekly_agg)

# Plot weekly sales trend
plt.figure(figsize=(12, 6))
plt.plot(weekly_agg.index, weekly_agg['total_sales'], marker='o', linewidth=2)
plt.title('Weekly Sales Trend (Q1 2025)')
plt.xlabel('Week Ending')
plt.ylabel('Total Sales ($)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

## 6. Advanced GroupBy Aggregations

Let's explore more complex aggregation patterns with GroupBy:

In [None]:
# Nested grouping: region → store → product_category
nested_agg = sales_df.groupby(['region', 'store_id', 'product_category']).agg({
    'transaction_id': 'count',
    'total_amount': 'sum'
}).reset_index()

# Rename columns
nested_agg = nested_agg.rename(columns={
    'transaction_id': 'num_transactions',
    'total_amount': 'total_sales'
})

print("Nested grouping aggregation:")
print(nested_agg.head(15))

# Create a pivot table for easier analysis
pivot_region_store = pd.pivot_table(
    nested_agg, 
    values='total_sales',
    index=['region', 'store_id'],
    columns='product_category',
    aggfunc='sum',
    fill_value=0
)

print("\nPivot table of sales by region, store, and category:")
print(pivot_region_store)

In [None]:
# Analyzing sales by online vs. in-store and payment method
channel_payment_agg = sales_df.groupby(['is_online', 'payment_method']).agg({
    'transaction_id': 'count',
    'total_amount': ['sum', 'mean'],
    'customer_rating': ['mean', 'count']
}).reset_index()

# Clean up column names
channel_payment_agg.columns = ['_'.join(col).strip() for col in channel_payment_agg.columns.values]
channel_payment_agg = channel_payment_agg.rename(columns={
    'is_online_': 'is_online',
    'payment_method_': 'payment_method',
    'transaction_id_count': 'num_transactions',
    'total_amount_sum': 'total_sales',
    'total_amount_mean': 'avg_transaction',
    'customer_rating_mean': 'avg_rating',
    'customer_rating_count': 'num_ratings'
})

print("Sales by channel and payment method:")
print(channel_payment_agg)

# Create a grouped bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x='payment_method', y='total_sales', hue='is_online', data=channel_payment_agg)
plt.title('Sales by Payment Method and Channel')
plt.xlabel('Payment Method')
plt.ylabel('Total Sales ($)')
plt.legend(title='Is Online')
plt.grid(True, axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

## 7. Named Aggregations

Pandas 0.25+ introduced a more readable way to perform named aggregations, which we touched on in Part 1. Let's explore this further:

In [None]:
# Named aggregations
named_aggs = sales_df.groupby('product_category').agg(
    num_transactions=('transaction_id', 'count'),
    total_sales=('total_amount', 'sum'),
    avg_transaction=('total_amount', 'mean'),
    max_transaction=('total_amount', 'max'),
    total_quantity=('quantity', 'sum'),
    avg_rating=('customer_rating', 'mean'),
    avg_discount=('discount_pct', 'mean')
)

print("Named aggregations:")
print(named_aggs)

# Calculate a derived metric: average revenue per unit
named_aggs['revenue_per_unit'] = (named_aggs['total_sales'] / named_aggs['total_quantity']).round(2)

# Calculate another derived metric: discount impact ratio
named_aggs['discount_impact'] = (named_aggs['avg_discount'] / named_aggs['avg_transaction'] * 100).round(2)

print("\nWith derived metrics:")
print(named_aggs)

In [None]:
# Mixed named and callable aggregations
def top_products(x):
    """Returns the top 3 most common product IDs in a group"""
    return x.value_counts().nlargest(3).index.tolist()

mixed_aggs = sales_df.groupby('product_category').agg(
    num_transactions=('transaction_id', 'count'),
    total_sales=('total_amount', 'sum'),
    price_range=('unit_price', lambda x: x.max() - x.min()),
    popular_products=('product_id', top_products),
    rating_summary=('customer_rating', rating_distribution)
)

print("Mixed named and callable aggregations:")
print(mixed_aggs)

## 8. Advanced Statistical Aggregations

Let's perform some more advanced statistical aggregations:

In [None]:
# Correlation analysis between numeric variables
numeric_cols = ['quantity', 'unit_price', 'discount_pct', 'discount_amount', 
                'net_price', 'total_amount', 'customer_rating']
correlation_matrix = sales_df[numeric_cols].corr()

print("Correlation matrix:")
print(correlation_matrix.round(2))

# Visualize the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, center=0)
plt.title('Correlation Matrix of Numeric Variables')
plt.tight_layout()
plt.show()

In [None]:
# Statistical tests and aggregations
# Compare online vs. in-store sales
channel_comparison = sales_df.groupby('is_online').agg(
    num_transactions=('transaction_id', 'count'),
    total_sales=('total_amount', 'sum'),
    avg_transaction=('total_amount', 'mean'),
    std_transaction=('total_amount', 'std'),
    min_transaction=('total_amount', 'min'),
    max_transaction=('total_amount', 'max'),
    avg_rating=('customer_rating', 'mean')
)

print("Comparison of online vs. in-store sales:")
print(channel_comparison)

# Perform a t-test to see if online transaction values are different from in-store
from scipy import stats

online_sales = sales_df[sales_df['is_online']]['total_amount']
instore_sales = sales_df[~sales_df['is_online']]['total_amount']

t_stat, p_value = stats.ttest_ind(online_sales, instore_sales, equal_var=False)
print(f"\nT-test for difference in transaction amounts between online and in-store:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Statistically significant difference: {p_value < 0.05}")

# Visualize the distributions
plt.figure(figsize=(12, 6))
sns.histplot(data=sales_df, x='total_amount', hue='is_online', bins=20, element='step', common_norm=False, stat='density')
plt.axvline(online_sales.mean(), color='blue', linestyle='dashed', linewidth=1, label=f'Online Mean: ${online_sales.mean():.2f}')
plt.axvline(instore_sales.mean(), color='orange', linestyle='dashed', linewidth=1, label=f'In-store Mean: ${instore_sales.mean():.2f}')
plt.title('Distribution of Transaction Amounts by Channel')
plt.xlabel('Transaction Amount ($)')
plt.ylabel('Density')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

## 9. SQL to Pandas Aggregation Translation Guide

Let's expand our SQL-to-Pandas translation guide with aggregation-specific operations:

| SQL Aggregation | Pandas Equivalent |
|-----------------|-------------------|
| `COUNT(*)` | `df.shape[0]` or `len(df)` |
| `COUNT(column)` | `df['column'].count()` |
| `SUM(column)` | `df['column'].sum()` |
| `AVG(column)` | `df['column'].mean()` |
| `MIN(column)` | `df['column'].min()` |
| `MAX(column)` | `df['column'].max()` |
| `STDDEV(column)` | `df['column'].std()` |
| `VAR(column)` | `df['column'].var()` |
| No direct equivalent | `df['column'].median()` |
| No direct equivalent | `df['column'].quantile(0.25)` |
| No direct equivalent | `df['column'].mode()[0]` |
| `GROUP BY col1, col2` | `df.groupby(['col1', 'col2'])` |
| `SELECT col1, SUM(col2) FROM table GROUP BY col1` | `df.groupby('col1')['col2'].sum()` |
| `SELECT col1, AVG(col2), MAX(col3) FROM table GROUP BY col1` | `df.groupby('col1').agg({'col2': 'mean', 'col3': 'max'})` |
| `HAVING COUNT(*) > 10` | `df.groupby('col1').filter(lambda x: len(x) > 10)` |
| `ORDER BY SUM(col2) DESC` | `df.groupby('col1')['col2'].sum().sort_values(ascending=False)` |
| window function: `SUM(col2) OVER (PARTITION BY col1)` | `df.groupby('col1')['col2'].transform('sum')` |

## 10. Practice Exercises

### Exercise 1: Basic Aggregation
For each payment method in the `sales_df` DataFrame, calculate the total sales, average transaction value, number of transactions, and average customer rating.

In [None]:
# Your code here

### Exercise 2: Custom Aggregation Function
Create a custom aggregation function that calculates the percentage of high-value transactions (over $500) for each product category. Then apply this function in a groupby operation.

In [None]:
# Your code here

### Exercise 3: Multiple Aggregations
Group the sales data by both day of the week and is_online status. Calculate the total sales, number of transactions, and average transaction value for each group. Then create a visualization that compares online vs. in-store sales for each day of the week.

In [None]:
# Your code here

### Exercise 4: Statistical Aggregation
For each region, calculate the following statistics for transaction amounts: mean, median, standard deviation, interquartile range (IQR), and coefficient of variation (CV = standard deviation / mean). Which region has the most consistent transaction values?

In [None]:
# Your code here

### Exercise 5: SQL to Pandas Translation
Translate the following SQL query to Pandas code using aggregation functions:
```sql
SELECT 
    CASE 
        WHEN unit_price < 100 THEN 'Low'
        WHEN unit_price BETWEEN 100 AND 250 THEN 'Medium'
        ELSE 'High'
    END as price_category,
    COUNT(*) as num_transactions,
    SUM(total_amount) as total_sales,
    AVG(customer_rating) as avg_rating,
    SUM(CASE WHEN is_online = true THEN 1 ELSE 0 END) as online_count,
    SUM(CASE WHEN is_online = false THEN 1 ELSE 0 END) as instore_count
FROM sales
GROUP BY price_category
ORDER BY total_sales DESC
```

In [None]:
# Your code here

## Next Steps

In the next part of today's session, we'll explore:
- Part 3: Pivot tables and cross-tabulations

Continue to Part 3: Pivot Tables and Cross-Tabulations when you're ready to proceed.