# Python for Data Analysis - Week 3
## Major Group Assignment: Advanced Pandas Operations

**Due Date:** Thursday, April 24, 2025

### Overview
In this assignment, your group will apply advanced Pandas operations covered in today's lecture: GroupBy operations, aggregation functions, and pivot tables/cross-tabulations. You'll use these techniques to analyze the customer purchase dataset and derive meaningful business insights.

### Learning Objectives
By completing this assignment, you will be able to:
- Apply GroupBy operations to segment and analyze data
- Use various aggregation functions to summarize data
- Create and interpret pivot tables and cross-tabulations
- Translate SQL GROUP BY queries to equivalent Pandas operations
- Visualize aggregated data for better interpretation

### Dataset
You will continue working with the customer purchase dataset (`customer_purchase_data.csv`) that you used in the previous assignment, which contains information about customers, their demographics, and their purchase transactions.

### Assignment Structure
This assignment is divided into five parts:
1. Data Preparation
2. GroupBy Operations
3. Aggregation Functions
4. Pivot Tables and Cross-Tabulations
5. Business Analysis and Insights

### Submission Guidelines
- Submit your completed notebook via the course portal
- Include the names of all team members in the notebook
- Ensure all code cells are executed and outputs are visible
- Add comments to explain your code and reasoning
- One submission per group is sufficient

### Grading Criteria
- Correctness of code: 50%
- Quality of analysis and insights: 30%
- Visualizations and presentation: 15%
- Code documentation: 5%

Let's begin!

## Team Information

**Team Members:**
1. 
2. 
3. 
4. 

## Setup

First, let's import the necessary libraries and load the dataset.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 1000)

# For plotting in the notebook
%matplotlib inline

# Set plot styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')

In [4]:
# Load the dataset
import pandas as pd
df = pd.read_csv('../Data/customer_purchase_data.csv')

# Display the first few rows
df

Unnamed: 0,CustomerID,Gender,Age,Income,Occupation,Education,Region,MaritalStatus,PurchaseDate,ProductID,ProductName,Category,Subcategory,Price,Quantity,PaymentMethod
0,1001,Male,34,72000,Engineer,Bachelor,East,Married,2024-03-05,P001,Laptop HP Elite,Electronics,Computers,1200.5,1.0,Credit Card
1,1001,Male,34,72000,Engineer,Bachelor,East,Married,2024-02-15,P045,External Hard Drive,Electronics,Accessories,89.99,2.0,Credit Card
2,1002,Female,28,65000,Teacher,Master,West,Single,2024-03-10,P012,Yoga Mat,Sports,Fitness,35.5,1.0,PayPal
3,1003,Female,45,95000,Doctor,PhD,Central,Married,2024-03-07,P023,Coffee Maker,Home,Kitchen,149.99,,Debit Card
4,1003,Female,45,95000,Doctor,PhD,Central,Married,2024-03-15,P056,Professional Blender,Home,Kitchen,299.95,1.0,Credit Card
5,1003,Female,45,95000,Doctor,PhD,Central,Married,2024-01-22,P078,Smart Watch,Electronics,Wearables,249.99,1.0,Debit Card
6,1004,Male,52,120000,Lawyer,Master,East,Divorced,2024-03-12,P034,Business Suit,Clothing,Formal,599.99,1.0,Credit Card
7,1005,Female,21,35000,Student,High School,West,Single,2024-02-28,P045,External Hard Drive,Electronics,Accessories,89.99,1.0,PayPal
8,1005,Female,21,35000,Student,High School,West,Single,2024-03-14,P098,Wireless Earbuds,Electronics,Audio,129.95,1.0,PayPal
9,1006,Male,39,85000,Manager,Bachelor,Central,Married,2024-03-02,P056,Professional Blender,Home,Kitchen,299.95,,Credit Card


## Part 1: Data Preparation (10 points)

Before diving into the advanced operations, let's prepare the dataset by cleaning and transforming it. In this section, you'll handle missing values and create derived columns that will be useful for the analysis.

### 1.1 Data Cleaning

Perform the following data cleaning steps:
1. Convert the 'PurchaseDate' column to datetime format
2. Fill missing values in the 'Quantity' column with the median quantity for the corresponding product category
3. Check if there are any other missing values and handle them appropriately

In [None]:
# 1. Convert PurchaseDate to datetime
# Your code here

In [None]:
# 2. Fill missing Quantity values with median by category
# Your code here

In [None]:
# 3. Check and handle other missing values
# Your code here

### 1.2 Feature Engineering

Create the following derived columns that will be useful for the analysis:
1. 'TotalAmount': Price × Quantity
2. 'PurchaseYear': Extract year from PurchaseDate
3. 'PurchaseMonth': Extract month from PurchaseDate
4. 'PurchaseQuarter': Extract quarter from PurchaseDate (Q1, Q2, Q3, Q4)
5. 'AgeGroup': Categorize customers into age groups ('18-25', '26-35', '36-45', '46-55', '56+')
6. 'IncomeGroup': Categorize customers into income groups ('Low', 'Medium', 'High')

In [None]:
# 1. Create TotalAmount column
# Your code here

In [None]:
# 2-4. Extract PurchaseYear, PurchaseMonth, and PurchaseQuarter
# Your code here

In [None]:
# 5. Create AgeGroup column
# Your code here

In [None]:
# 6. Create IncomeGroup column
# Your code here

## Part 2: GroupBy Operations (25 points)

In this section, you'll use the GroupBy functionality to segment and analyze the data. GroupBy operations are similar to SQL's GROUP BY clause and are essential for aggregating data based on one or more columns.

### 2.1 Basic GroupBy Operations

Perform the following GroupBy operations:
1. Group by 'Category' and calculate the count, sum, mean, min, and max of 'Price'
2. Group by 'Region' and calculate the average 'Income' and 'Age' of customers
3. Group by 'AgeGroup' and 'Gender' and calculate the average 'TotalAmount' spent
4. Group by 'IncomeGroup' and count the number of purchases in each product category

In [None]:
# 1. Group by Category and calculate price statistics
# Your code here

In [None]:
# 2. Group by Region and calculate average Income and Age
# Your code here

In [None]:
# 3. Group by AgeGroup and Gender and calculate average TotalAmount
# Your code here

In [None]:
# 4. Group by IncomeGroup and count purchases by category
# Your code here

### 2.2 Multi-level GroupBy

Now, let's explore multi-level groupby operations:
1. Group by 'Region', 'Category', and 'Subcategory' to calculate the total sales amount
2. Group by 'PurchaseYear', 'PurchaseMonth', and 'Category' to analyze monthly sales trends
3. Group by 'MaritalStatus', 'Gender', and 'AgeGroup' to compare purchasing behavior
4. Group by 'PaymentMethod', 'Category', and 'IncomeGroup' to analyze payment preferences

In [None]:
# 1. Group by Region, Category, and Subcategory
# Your code here

In [None]:
# 2. Group by PurchaseYear, PurchaseMonth, and Category
# Your code here

In [None]:
# 3. Group by MaritalStatus, Gender, and AgeGroup
# Your code here

In [None]:
# 4. Group by PaymentMethod, Category, and IncomeGroup
# Your code here

### 2.3 Visualization of GroupBy Results

Create at least three visualizations to illustrate key insights from your GroupBy operations. For example:
1. Bar chart showing average purchase amount by category
2. Grouped bar chart comparing spending by gender across different age groups
3. Line chart showing sales trends over time by category

In [None]:
# Visualization 1
# Your code here

In [None]:
# Visualization 2
# Your code here

In [None]:
# Visualization 3
# Your code here

## Part 3: Aggregation Functions (25 points)

In this section, you'll explore various aggregation functions that can be applied to GroupBy objects to summarize data in different ways.

### 3.1 Standard Aggregation Functions

Apply the following standard aggregation functions to different groupby objects:
1. Use `agg()` with multiple functions (count, sum, mean, median, min, max) on 'TotalAmount' grouped by 'Category'
2. Use `describe()` to get a comprehensive summary of 'Price' grouped by 'Subcategory'
3. Use `size()` to count the number of purchases by 'Region' and 'PaymentMethod'
4. Use `nunique()` to count the number of unique customers in each 'Region' and 'Category'

In [None]:
# 1. Use agg() with multiple functions
# Your code here

In [None]:
# 2. Use describe() on Price grouped by Subcategory
# Your code here

In [None]:
# 3. Use size() to count purchases by Region and PaymentMethod
# Your code here

In [None]:
# 4. Use nunique() to count unique customers by Region and Category
# Your code here

### 3.2 Custom Aggregation Functions

Now, let's apply custom aggregation functions to our grouped data:
1. Define a custom function to calculate the range (max - min) of prices and apply it to data grouped by 'Category'
2. Define a custom function to calculate the coefficient of variation (std/mean) of 'TotalAmount' by 'PaymentMethod'
3. Use a lambda function to calculate the percentage of high-value orders (> $500) in each 'Category'
4. Use `agg()` with a dictionary to apply different aggregation functions to different columns grouped by 'AgeGroup' and 'Gender'

In [None]:
# 1. Custom function for price range by Category
# Your code here

In [None]:
# 2. Custom function for coefficient of variation by PaymentMethod
# Your code here

In [None]:
# 3. Lambda function for percentage of high-value orders
# Your code here

In [None]:
# 4. Using agg() with a dictionary for different aggregations
# Your code here

### 3.3 Group Filter Operations

Use the `filter()` method with custom functions to filter groups based on aggregated values:
1. Filter for categories with an average purchase amount greater than $300
2. Filter for regions with more than 10 unique customers
3. Filter for combinations of 'AgeGroup' and 'Gender' with at least 5 purchases
4. Filter for subcategories where the maximum price is at least twice the minimum price

In [None]:
# 1. Filter for categories with avg purchase > $300
# Your code here

In [None]:
# 2. Filter for regions with >10 unique customers
# Your code here

In [None]:
# 3. Filter for AgeGroup and Gender combinations with ≥5 purchases
# Your code here

In [None]:
# 4. Filter for subcategories where max price is ≥2× min price
# Your code here

## Part 4: Pivot Tables and Cross-Tabulations (25 points)

In this section, you'll create and analyze pivot tables and cross-tabulations, which are powerful tools for summarizing data and identifying patterns and relationships.

### 4.1 Basic Pivot Tables

Create the following pivot tables:
1. A pivot table with 'Category' as index, 'Region' as columns, and the sum of 'TotalAmount' as values
2. A pivot table with 'AgeGroup' as index, 'Gender' as columns, and the mean of 'Price' as values
3. A pivot table with 'PurchaseMonth' as index, 'Category' as columns, and the count of purchases as values
4. A pivot table with 'PaymentMethod' as index, 'IncomeGroup' as columns, and the sum of 'TotalAmount' as values

In [None]:
# 1. Pivot table: Category × Region with sum of TotalAmount
# Your code here

In [None]:
# 2. Pivot table: AgeGroup × Gender with mean of Price
# Your code here

In [None]:
# 3. Pivot table: PurchaseMonth × Category with count of purchases
# Your code here

In [None]:
# 4. Pivot table: PaymentMethod × IncomeGroup with sum of TotalAmount
# Your code here

### 4.2 Advanced Pivot Tables

Create more complex pivot tables with multiple levels and aggregations:
1. A pivot table with 'Region' and 'Category' as a multi-level index, 'Gender' as columns, and the sum of 'TotalAmount' as values
2. A pivot table with 'AgeGroup' as index, 'PaymentMethod' and 'Category' as a multi-level column, and the count of purchases as values
3. A pivot table with 'PurchaseMonth' as index, 'Category' as columns, and multiple aggregations (sum, mean, count) of 'TotalAmount' as values
4. A pivot table with 'MaritalStatus' and 'Gender' as a multi-level index, 'IncomeGroup' as columns, and the median of 'TotalAmount' as values

In [None]:
# 1. Multi-level index pivot table: Region × Category by Gender
# Your code here

In [None]:
# 2. Multi-level column pivot table: AgeGroup by PaymentMethod × Category
# Your code here

In [None]:
# 3. Pivot table with multiple aggregations
# Your code here

In [None]:
# 4. Multi-level index pivot table: MaritalStatus × Gender by IncomeGroup
# Your code here

### 4.3 Cross-Tabulations

Create the following cross-tabulations and analyze the results:
1. A cross-tabulation of 'Category' vs 'Region' showing the count of purchases
2. A cross-tabulation of 'PaymentMethod' vs 'AgeGroup' showing the count of purchases
3. A cross-tabulation of 'MaritalStatus' vs 'Gender' showing the count of purchases
4. Normalize each of the above cross-tabulations to show percentages instead of raw counts, and interpret the results

In [None]:
# 1. Cross-tabulation of Category vs Region
# Your code here

In [None]:
# 2. Cross-tabulation of PaymentMethod vs AgeGroup
# Your code here

In [None]:
# 3. Cross-tabulation of MaritalStatus vs Gender
# Your code here

In [None]:
# 4. Normalized cross-tabulations
# Your code here

**Interpretation of Cross-Tabulations:**

*Write your interpretation here*

### 4.4 Visualization of Pivot Tables

Create at least three visualizations to illustrate key insights from your pivot tables and cross-tabulations:

In [None]:
# Visualization 1
# Your code here

In [None]:
# Visualization 2
# Your code here

In [None]:
# Visualization 3
# Your code here

## Part 5: Business Analysis and Insights (15 points)

Based on your analysis in the previous sections, address the following business questions and provide actionable insights:

### 5.1 SQL to Pandas Translation

Translate the following SQL queries into equivalent Pandas code and execute them:

**SQL Query 1: Monthly Sales by Category**
```sql
SELECT 
    EXTRACT(YEAR FROM PurchaseDate) AS Year,
    EXTRACT(MONTH FROM PurchaseDate) AS Month,
    Category,
    COUNT(*) AS NumPurchases,
    SUM(Price * Quantity) AS TotalSales,
    AVG(Price * Quantity) AS AvgOrderValue
FROM 
    customer_purchase_data
GROUP BY 
    EXTRACT(YEAR FROM PurchaseDate),
    EXTRACT(MONTH FROM PurchaseDate),
    Category
ORDER BY 
    Year, Month, TotalSales DESC;
```

In [None]:
# Pandas equivalent of SQL Query 1
# Your code here

**SQL Query 2: Customer Demographics Analysis**
```sql
SELECT 
    AgeGroup,
    Gender,
    MaritalStatus,
    COUNT(DISTINCT CustomerID) AS NumCustomers,
    COUNT(*) AS NumPurchases,
    SUM(Price * Quantity) AS TotalSpend,
    AVG(Price * Quantity) AS AvgOrderValue,
    SUM(Price * Quantity) / COUNT(DISTINCT CustomerID) AS AvgSpendPerCustomer
FROM 
    customer_purchase_data
GROUP BY 
    AgeGroup, Gender, MaritalStatus
ORDER BY 
    TotalSpend DESC;
```

In [None]:
# Pandas equivalent of SQL Query 2
# Your code here

**SQL Query 3: Payment Method Analysis by Income Group**
```sql
SELECT 
    PaymentMethod,
    IncomeGroup,
    COUNT(*) AS NumPurchases,
    SUM(Price * Quantity) AS TotalSpend,
    AVG(Price * Quantity) AS AvgOrderValue,
    COUNT(*) * 100.0 / (SELECT COUNT(*) FROM customer_purchase_data WHERE IncomeGroup = cp.IncomeGroup) AS PctOfIncomeGroup
FROM 
    customer_purchase_data cp
GROUP BY 
    PaymentMethod, IncomeGroup
ORDER BY 
    IncomeGroup, NumPurchases DESC;
```

In [None]:
# Pandas equivalent of SQL Query 3
# Your code here

### 5.2 Business Insights

Based on your analysis, answer the following business questions:

**Question 1: Customer Segmentation**

Which customer segments (based on demographics) appear to be the most valuable to the business in terms of total spend and average order value? What strategies would you recommend to target these segments more effectively?

*Your answer here*

**Question 2: Product Category Performance**

Which product categories show the highest and lowest performance in terms of sales volume, total revenue, and average order value? Are there any seasonal patterns or regional differences in category performance? What recommendations would you make to optimize the product mix?

*Your answer here*

**Question 3: Payment Method Preferences**

How do payment method preferences vary across different customer segments (age, income, region)? What does this suggest about customer behavior? What strategies could be implemented to optimize payment options?

*Your answer here*

**Question 4: Sales Trends and Seasonality**

Based on your analysis of sales over time, what trends or patterns do you observe? Is there evidence of seasonality in certain product categories? How should the business plan inventory and marketing based on these patterns?

*Your answer here*

**Question 5: Regional Analysis**

How do purchasing patterns differ across regions? Which regions show the highest potential for growth? What region-specific strategies would you recommend?

*Your answer here*

### 5.3 Executive Summary

Provide a concise executive summary (300-500 words) of your key findings and recommendations based on your analysis. This should highlight the most important insights and actionable recommendations for the business.

**Executive Summary:**

*Your executive summary here*

## Conclusion

In this assignment, you've applied advanced Pandas operations, including GroupBy, aggregation functions, and pivot tables, to analyze customer purchase data and derive meaningful business insights. These techniques are powerful tools for data analysis and are widely used in business analytics, especially when translating analytical questions into data-driven answers.

Summarize what you've learned from this assignment and how you might apply these techniques in future data analysis projects.

*Your conclusion here*