# Advanced Pandas Operations - Part 3: Pivot Tables and Cross-Tabulations

## Week 3, Day 2 (Thursday) - April 24th, 2025

### Overview
This is the third part of our Advanced Pandas Operations session, focusing on pivot tables and cross-tabulations. These powerful tools allow you to reshape, summarize, and analyze data in ways similar to pivot tables in Excel and OLAP cubes in business intelligence systems.

### Learning Objectives
- Understand the concept of data reshaping and pivoting
- Master Pandas pivot table functionality
- Create and interpret cross-tabulations
- Apply aggregation functions in pivot tables
- Visualize pivot table results
- Translate SQL CUBE and ROLLUP concepts to Pandas

### Prerequisites
- Python fundamentals (Week 1)
- Pandas Fundamentals I & II (Week 2, Day 2 & Week 3, Day 1)
- GroupBy operations (Week 3, Day 2, Part 1)
- Aggregation functions (Week 3, Day 2, Part 2)
- SQL knowledge (prior to course)

## 1. Introduction to Pivot Tables and Data Reshaping

Pivot tables transform data from a long format to a wide format, making it easier to summarize and analyze. Let's begin by creating a dataset to work with:

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set Pandas display options
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', 1000)

# Create a sample sales dataset (similar to previous parts)
np.random.seed(42)  # For reproducibility
n = 200  # Number of records

# Generate dates for Q1 2025
dates = pd.date_range('2025-01-01', '2025-03-31', periods=n)

# Create dictionary of data
data = {
    'date': dates,
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Books', 'Sports'], size=n, p=[0.3, 0.25, 0.2, 0.15, 0.1]),
    'product_name': np.random.choice(['Laptop', 'Smartphone', 'T-shirt', 'Jeans', 'Lamp', 'Chair', 'Novel', 'Textbook', 'Basketball', 'Tennis Racket'], size=n),
    'store_id': np.random.choice(['S01', 'S02', 'S03', 'S04'], size=n),
    'region': np.random.choice(['North', 'South', 'East', 'West'], size=n),
    'sales_channel': np.random.choice(['Online', 'In-store'], size=n, p=[0.6, 0.4]),
    'quantity': np.random.randint(1, 6, size=n),
    'unit_price': np.random.uniform(10, 500, size=n).round(2),
    'customer_type': np.random.choice(['Regular', 'Premium', 'New'], size=n, p=[0.5, 0.3, 0.2])
}

# Create DataFrame
sales_df = pd.DataFrame(data)

# Calculate total sales
sales_df['total_sales'] = (sales_df['quantity'] * sales_df['unit_price']).round(2)

# Add time-related columns
sales_df['month'] = sales_df['date'].dt.month_name()
sales_df['week'] = sales_df['date'].dt.isocalendar().week
sales_df['day_of_week'] = sales_df['date'].dt.day_name()

# Display the first few rows
print("Sample Sales DataFrame:")
print(sales_df.head())

# Display summary information
print("\nDataFrame Info:")
print(f"Shape: {sales_df.shape}")
print(f"Columns: {sales_df.columns.tolist()}")

## 2. Basic Pivot Tables with pivot() Method

Let's start with the basic `pivot()` method, which rearranges data from long to wide format:

In [None]:
# Basic pivot: Product category by region
# First, aggregate the data to prevent duplicates
category_region_sales = sales_df.groupby(['product_category', 'region'])['total_sales'].sum().reset_index()
print("Aggregated data before pivoting:")
print(category_region_sales.head(10))

# Create a pivot table
sales_pivot = category_region_sales.pivot(index='product_category', columns='region', values='total_sales')
print("\nPivot table of sales by product category and region:")
print(sales_pivot)

# Visualize the pivot table as a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(sales_pivot, annot=True, fmt='.0f', cmap='YlGnBu', linewidths=.5)
plt.title('Sales by Product Category and Region')
plt.tight_layout()
plt.show()

In [None]:
# Pivot with multiple values columns
# First, aggregate the data to prevent duplicates
multi_metrics = sales_df.groupby(['product_category', 'sales_channel']).agg({
    'total_sales': 'sum',
    'quantity': 'sum'
}).reset_index()

# This approach won't work with multiple values columns:
# multi_pivot = multi_metrics.pivot(index='product_category', columns='sales_channel', values=['total_sales', 'quantity'])

# Instead, create separate pivots and concatenate them
sales_pivot = multi_metrics.pivot(index='product_category', columns='sales_channel', values='total_sales')
sales_pivot.columns = [f'sales_{col}' for col in sales_pivot.columns]

quantity_pivot = multi_metrics.pivot(index='product_category', columns='sales_channel', values='quantity')
quantity_pivot.columns = [f'quantity_{col}' for col in quantity_pivot.columns]

# Combine the two pivot tables
combined_pivot = pd.concat([sales_pivot, quantity_pivot], axis=1)
print("Combined pivot table with multiple metrics:")
print(combined_pivot)

# Add a total column for sales
combined_pivot['sales_Total'] = combined_pivot['sales_Online'] + combined_pivot['sales_In-store']
combined_pivot['quantity_Total'] = combined_pivot['quantity_Online'] + combined_pivot['quantity_In-store']

print("\nWith totals:")
print(combined_pivot)

### Limitations of the basic pivot() method

The `pivot()` method has several limitations:
1. It requires unique values in the index and columns combination
2. It doesn't perform any aggregation on its own
3. It only accepts a single values column
4. It doesn't handle missing values well

To overcome these limitations, Pandas provides the more powerful `pivot_table()` method.

## 3. Advanced Pivot Tables with pivot_table() Method

The `pivot_table()` method is more flexible and powerful than the basic `pivot()` method:

In [None]:
# Basic pivot table
basic_pivot_table = pd.pivot_table(
    sales_df,
    index='product_category',
    columns='region',
    values='total_sales',
    aggfunc='sum'
)

print("Basic pivot table:")
print(basic_pivot_table)

# Add row and column totals
basic_pivot_table['Total'] = basic_pivot_table.sum(axis=1)
basic_pivot_table.loc['Total'] = basic_pivot_table.sum()

print("\nPivot table with totals:")
print(basic_pivot_table)

In [None]:
# Pivot table with multiple aggregation functions
multi_agg_pivot = pd.pivot_table(
    sales_df,
    index='product_category',
    columns='region',
    values='total_sales',
    aggfunc=['sum', 'mean', 'count']
)

print("Pivot table with multiple aggregation functions:")
print(multi_agg_pivot)

# Flatten the hierarchical column index
multi_agg_pivot.columns = ['_'.join(col).strip() for col in multi_agg_pivot.columns.values]
print("\nWith flattened column names:")
print(multi_agg_pivot)

In [None]:
# Pivot table with multiple values columns
multi_values_pivot = pd.pivot_table(
    sales_df,
    index='product_category',
    columns='sales_channel',
    values=['total_sales', 'quantity'],
    aggfunc='sum'
)

print("Pivot table with multiple values columns:")
print(multi_values_pivot)

# Flatten the hierarchical column index
multi_values_pivot.columns = ['_'.join(col).strip() for col in multi_values_pivot.columns.values]
print("\nWith flattened column names:")
print(multi_values_pivot)

In [None]:
# Pivot table with multiple index levels
multi_index_pivot = pd.pivot_table(
    sales_df,
    index=['product_category', 'customer_type'],
    columns=['region', 'sales_channel'],
    values='total_sales',
    aggfunc='sum',
    fill_value=0  # Replace NaN with 0
)

print("Pivot table with multiple index and column levels:")
print(multi_index_pivot)

# Calculate subtotals for each product category
category_totals = sales_df.groupby('product_category')['total_sales'].sum()
print("\nProduct category totals:")
print(category_totals)

## 4. Custom Aggregation in Pivot Tables

You can use custom aggregation functions in pivot tables:

In [None]:
# Define custom aggregation functions
def price_range(x):
    """Calculate the range between max and min"""
    return x.max() - x.min()

def pct_of_total(x):
    """Calculate percentage of the total"""
    return (x.sum() / sales_df['total_sales'].sum() * 100).round(2)

# Pivot table with custom aggregation functions
custom_pivot = pd.pivot_table(
    sales_df,
    index='product_category',
    columns='region',
    values='total_sales',
    aggfunc={
        'total_sales': ['sum', 'mean', price_range, pct_of_total]
    },
    margins=True,  # Add row and column totals
    margins_name='Total'
)

print("Pivot table with custom aggregation functions:")
print(custom_pivot)

# Flatten hierarchical column index and clean up names
custom_pivot.columns = ['_'.join(col).strip() for col in custom_pivot.columns.values]
custom_pivot = custom_pivot.rename(columns={
    'total_sales_price_range': 'Price_Range',
    'total_sales_pct_of_total': 'Pct_of_Total'
})

print("\nWith flattened column names:")
print(custom_pivot)

## 5. Pivot Table Visualization

Pivot tables are excellent for creating insightful visualizations:

In [None]:
# Create a pivot table for monthly sales by product category
monthly_category_pivot = pd.pivot_table(
    sales_df,
    index='month',
    columns='product_category',
    values='total_sales',
    aggfunc='sum'
)

# Reorder months
month_order = ['January', 'February', 'March']
monthly_category_pivot = monthly_category_pivot.reindex(month_order)

print("Monthly sales by product category:")
print(monthly_category_pivot)

# Visualize as a stacked bar chart
monthly_category_pivot.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.title('Monthly Sales by Product Category')
plt.xlabel('Month')
plt.ylabel('Total Sales ($)')
plt.legend(title='Product Category')
plt.grid(True, axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

# Visualize as a heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(monthly_category_pivot, annot=True, fmt='.0f', cmap='viridis', linewidths=.5)
plt.title('Monthly Sales by Product Category')
plt.tight_layout()
plt.show()

In [None]:
# Create a pivot table for sales channel and customer type
channel_customer_pivot = pd.pivot_table(
    sales_df,
    index='customer_type',
    columns='sales_channel',
    values=['total_sales', 'quantity'],
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)

print("Sales by channel and customer type:")
print(channel_customer_pivot)

# Visualize the total_sales part as a grouped bar chart
channel_customer_data = pd.pivot_table(
    sales_df,
    index='customer_type',
    columns='sales_channel',
    values='total_sales',
    aggfunc='sum'
)

channel_customer_data.plot(kind='bar', figsize=(10, 6))
plt.title('Sales by Channel and Customer Type')
plt.xlabel('Customer Type')
plt.ylabel('Total Sales ($)')
plt.legend(title='Sales Channel')
plt.grid(True, axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

## 6. Cross-Tabulations with crosstab()

Cross-tabulations (or contingency tables) are a special kind of pivot table that shows the frequency distribution of categorical variables:

In [None]:
# Basic cross-tabulation
basic_crosstab = pd.crosstab(sales_df['product_category'], sales_df['region'])
print("Basic cross-tabulation of product category and region:")
print(basic_crosstab)

# Add row and column margins (totals)
crosstab_with_margins = pd.crosstab(
    sales_df['product_category'], 
    sales_df['region'],
    margins=True, 
    margins_name='Total'
)
print("\nCross-tabulation with margins:")
print(crosstab_with_margins)

# Normalize by rows (convert to percentages)
crosstab_normalized = pd.crosstab(
    sales_df['product_category'], 
    sales_df['region'],
    normalize='index'  # 'index' for row percentages, 'columns' for column percentages, 'all' for cell percentages
).round(2)
print("\nCross-tabulation with row percentages:")
print(crosstab_normalized)

# Visualize the cross-tabulation
plt.figure(figsize=(10, 6))
sns.heatmap(crosstab_normalized, annot=True, cmap='YlGnBu', fmt='.0%')
plt.title('Product Category Distribution by Region (Row Percentages)')
plt.tight_layout()
plt.show()

In [None]:
# Cross-tabulation with aggregation
sales_crosstab = pd.crosstab(
    [sales_df['product_category'], sales_df['customer_type']],  # Multiple row indices
    [sales_df['region'], sales_df['sales_channel']],  # Multiple column indices
    values=sales_df['total_sales'],
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)

print("Cross-tabulation with aggregation and multiple indices:")
print(sales_crosstab)

# Visualize a subset of the cross-tabulation
# Extract just the Online sales for visualization
online_sales = pd.crosstab(
    sales_df['product_category'],
    sales_df['region'],
    values=sales_df['total_sales'],
    aggfunc='sum',
    normalize='index'  # Row percentages
).round(2)

plt.figure(figsize=(10, 6))
sns.heatmap(online_sales, annot=True, cmap='YlOrRd', fmt='.0%')
plt.title('Sales Distribution by Region for Each Product Category')
plt.tight_layout()
plt.show()

## 7. Unstacking and Stacking Data

Pandas provides methods to reshape data between long and wide formats:

In [None]:
# Start with a multi-index DataFrame
grouped_data = sales_df.groupby(['product_category', 'region'])['total_sales'].sum()
print("Grouped data with multi-index:")
print(grouped_data.head(10))

# Unstack the data (convert from long to wide format)
unstacked_data = grouped_data.unstack(level='region')
print("\nUnstacked data (wide format):")
print(unstacked_data)

# Stack the data back (convert from wide to long format)
stacked_data = unstacked_data.stack()
print("\nStacked data (back to long format):")
print(stacked_data.head(10))

# Reset index to convert multi-index to columns
long_format = stacked_data.reset_index()
long_format.columns = ['product_category', 'region', 'total_sales']
print("\nLong format data with regular columns:")
print(long_format.head(10))

In [None]:
# Working with multiple levels
# Create a three-level groupby
multi_grouped = sales_df.groupby(['product_category', 'region', 'sales_channel'])['total_sales'].sum()
print("Three-level grouped data:")
print(multi_grouped.head(10))

# Unstack the last level
partially_unstacked = multi_grouped.unstack(level='sales_channel')
print("\nPartially unstacked data (last level):")
print(partially_unstacked.head(10))

# Unstack one more level
further_unstacked = partially_unstacked.unstack(level='region')
print("\nFurther unstacked data:")
print(further_unstacked)

## 8. Subtotals and Grand Totals (SQL ROLLUP equivalent)

In SQL, you can use ROLLUP to include subtotals in your results. Pandas doesn't have a direct equivalent, but you can achieve similar results:

In [None]:
# Create a pivot table with margins
rollup_pivot = pd.pivot_table(
    sales_df,
    index=['product_category', 'customer_type'],
    columns='region',
    values='total_sales',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)

print("Pivot table with margins (similar to SQL ROLLUP):")
print(rollup_pivot)

# SQL equivalent:
# SELECT 
#     product_category, 
#     customer_type, 
#     region, 
#     SUM(total_sales) as total_sales
# FROM sales
# GROUP BY 
#     ROLLUP(product_category, customer_type, region)

In [None]:
# For a more complete ROLLUP equivalent, we need to manually create subtotals
# First, get the total by product_category
category_totals = sales_df.groupby(['product_category'])['total_sales'].sum().reset_index()
category_totals['customer_type'] = 'Subtotal'

# Get the detailed data
detailed_data = sales_df.groupby(['product_category', 'customer_type'])['total_sales'].sum().reset_index()

# Combine the detailed data and subtotals
combined_data = pd.concat([detailed_data, category_totals])

# Add grand total
grand_total = pd.DataFrame({
    'product_category': ['Grand Total'],
    'customer_type': ['Total'],
    'total_sales': [sales_df['total_sales'].sum()]
})

all_data = pd.concat([combined_data, grand_total])

# Sort to get a nice ROLLUP-like arrangement
all_data['sort_key'] = np.where(all_data['customer_type'] == 'Subtotal', 1, 
                       np.where(all_data['customer_type'] == 'Total', 2, 0))
all_data = all_data.sort_values(['product_category', 'sort_key']).drop('sort_key', axis=1)

print("Manual ROLLUP equivalent:")
print(all_data)

## 9. Advanced Pivot Table Applications

Let's explore some more advanced applications of pivot tables:

In [None]:
# Pivot table by week and day of week
time_pivot = pd.pivot_table(
    sales_df,
    index='week',
    columns='day_of_week',
    values='total_sales',
    aggfunc='sum'
)

# Reorder days of week
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
time_pivot = time_pivot.reindex(columns=day_order)

print("Sales by week and day of week:")
print(time_pivot)

# Visualize as a heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(time_pivot, annot=True, fmt='.0f', cmap='viridis', linewidths=.5)
plt.title('Sales by Week and Day of Week')
plt.tight_layout()
plt.show()

In [None]:
# Create a cohort-like analysis by month and product category
# Calculate percentage of monthly total for each category
monthly_category_sales = pd.pivot_table(
    sales_df,
    index='month',
    columns='product_category',
    values='total_sales',
    aggfunc='sum'
)

# Reorder months
monthly_category_sales = monthly_category_sales.reindex(month_order)

# Calculate percentages of monthly total
monthly_pct = monthly_category_sales.div(monthly_category_sales.sum(axis=1), axis=0).round(4) * 100

print("Percentage of monthly sales by product category:")
print(monthly_pct)

# Visualize the percentage distribution
plt.figure(figsize=(12, 6))
monthly_pct.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.title('Monthly Sales Distribution by Product Category')
plt.xlabel('Month')
plt.ylabel('Percentage of Monthly Sales')
plt.legend(title='Product Category')
plt.grid(True, axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

## 10. SQL to Pandas Pivot Table Translation Guide

Here's a reference guide for translating SQL GROUP BY, CUBE, and ROLLUP operations to Pandas pivot tables:

| SQL Operation | Pandas Equivalent |
|--------------|-------------------|
| `SELECT col1, col2, SUM(col3) FROM table GROUP BY col1, col2` | `pd.pivot_table(df, index=['col1', 'col2'], values='col3', aggfunc='sum')` |
| `SELECT col1, col2, SUM(col3), AVG(col4) FROM table GROUP BY col1, col2` | `pd.pivot_table(df, index=['col1', 'col2'], values=['col3', 'col4'], aggfunc={'col3': 'sum', 'col4': 'mean'})` |
| `SELECT col1, col2, SUM(col3) FROM table GROUP BY col1, col2 WITH ROLLUP` | `pd.pivot_table(df, index=['col1', 'col2'], values='col3', aggfunc='sum', margins=True)` |
| `SELECT col1, col2, SUM(col3) FROM table GROUP BY CUBE(col1, col2)` | No direct equivalent; need to manually create the subtotals |
| `SELECT col1, col2, COUNT(*) FROM table GROUP BY col1, col2` | `pd.crosstab(df['col1'], df['col2'])` |
| `SELECT col1, col2, SUM(col3) FROM table GROUP BY col1, col2` | `pd.crosstab(df['col1'], df['col2'], values=df['col3'], aggfunc='sum')` |

## 11. Practice Exercises

### Exercise 1: Basic Pivot Table
Create a pivot table showing the total sales by customer type and product category. Include row and column totals.

In [None]:
# Your code here

### Exercise 2: Multiple Aggregations
Create a pivot table with region as the index and sales channel as the columns. Show both the total sales and average transaction value for each combination.

In [None]:
# Your code here

### Exercise 3: Cross-Tabulation
Create a cross-tabulation showing the distribution of sales channels across different customer types. Show the results as percentages, and visualize them with an appropriate chart.

In [None]:
# Your code here

### Exercise 4: Advanced Visualization
Create a pivot table showing the sales by product category and week. Visualize this as both a stacked bar chart and a heatmap. Which visualization is more effective for this data?

In [None]:
# Your code here

### Exercise 5: SQL to Pandas Translation
Translate the following SQL query to Pandas code using pivot tables:
```sql
SELECT
    product_category,
    customer_type,
    region,
    SUM(total_sales) as total_sales,
    COUNT(*) as num_transactions,
    AVG(unit_price) as avg_price
FROM sales
GROUP BY ROLLUP(product_category, customer_type, region)
```

In [None]:
# Your code here

## Next Steps

With the completion of these three parts on Advanced Pandas Operations:
1. GroupBy operations
2. Aggregation functions
3. Pivot tables and cross-tabulations

You now have a solid foundation in data transformation and analysis using Pandas. In next week's sessions, we'll explore:
- Data reshaping operations
- Merge, join, and concatenate
- Time series data manipulation

These skills will be essential as we move toward working with the Olist Brazilian E-commerce dataset for your capstone project.