# Python for Data Analysis - Week 3
## Minor Assignment: Pandas Fundamentals II

**Due Date:** Wednesday, April 23, 2025

### Overview
In this assignment, you will practice the core Pandas concepts covered in today's lecture: indexing and selection, filtering data, and handling missing values. You'll work with a customer purchase dataset to clean, transform, and extract insights from the data.

### Learning Objectives
By completing this assignment, you will be able to:
- Use different methods for indexing and selecting data in Pandas
- Apply filtering operations to extract specific subsets of data
- Identify and handle missing values using various techniques
- Apply these techniques to solve real-world data cleaning challenges

### Dataset
You will be working with a customer purchase dataset (`customer_purchase_data.csv`) containing information about customers, their demographics, and their purchase transactions.

### Submission Guidelines
- Submit your completed notebook via the course portal
- Include your name and student ID in the notebook
- Ensure all code cells are executed and outputs are visible
- Add comments to explain your code and reasoning

Let's begin!

## Student Information

**Name:**  
**Student ID:**  

## Setup

First, let's import the necessary libraries and load the dataset.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 1000)

# For plotting in the notebook
%matplotlib inline

In [None]:
# Load the dataset
df = pd.read_csv('../../Data/customer_purchase_data.csv')

# Display the first few rows
df.head()

## Part 1: Exploring the Dataset (10 points)

Before we dive into the main tasks, let's first explore the dataset to understand its structure and content.

### 1.1 Dataset Information

Examine the basic information about the dataset by answering the following questions:

1. How many rows and columns does the dataset have?
2. What are the column names and their data types?
3. Are there any missing values in the dataset? If so, in which columns?

In [None]:
# Check the shape of the dataset
# Your code here

In [None]:
# Display information about the dataset
# Your code here

In [None]:
# Check for missing values
# Your code here

## Part 2: Indexing and Selection (30 points)

In this section, you will practice various methods for selecting and indexing data in Pandas.

### 2.1 Basic Indexing

Use different indexing methods to extract the following from the dataset:

1. Select the 'CustomerID', 'Age', 'Income', and 'Region' columns using bracket notation
2. Select the same columns using dot notation
3. Select rows 10 through 20 (inclusive) using iloc
4. Select the first 5 customers who made purchases in the 'Electronics' category using loc

In [None]:
# 1. Select columns using bracket notation
# Your code here

In [None]:
# 2. Select columns using dot notation
# Your code here

In [None]:
# 3. Select rows 10 through 20 using iloc
# Your code here

In [None]:
# 4. Select the first 5 customers who made purchases in the 'Electronics' category
# Your code here

### 2.2 Advanced Indexing

Now, let's explore more advanced indexing techniques:

1. Set the 'CustomerID' column as the index of the DataFrame
2. Select all purchase information for customer with ID 1003 using the index
3. Multi-level indexing: Create a MultiIndex using 'Region' and 'Category' as index levels
4. Select all purchases in the 'East' region for the 'Electronics' category using the MultiIndex

In [None]:
# 1. Set 'CustomerID' as the index
# Your code here

In [None]:
# 2. Select all purchase information for customer 1003
# Your code here

In [None]:
# 3. Create a MultiIndex using 'Region' and 'Category'
# Your code here

In [None]:
# 4. Select all purchases in the 'East' region for the 'Electronics' category
# Your code here

### 2.3 Practical Application: Creating Customer Profiles

Now, use your indexing skills to create a customer profile DataFrame that contains the following information for each unique customer:
- CustomerID
- Gender
- Age
- Income
- Education
- Region
- MaritalStatus

Hint: You'll need to remove duplicate customer entries since the same customer may have made multiple purchases.

In [None]:
# Create customer profile DataFrame
# Your code here

## Part 3: Filtering Data (30 points)

In this section, you will practice applying filters to extract specific subsets of data.

### 3.1 Basic Filtering

Apply filters to find the following information:

1. Customers who are younger than 30 years old
2. Purchases made in the 'Electronics' category with a price greater than $500
3. Female customers who have made purchases in the 'Books' category
4. Customers from the 'West' region who are married

In [None]:
# Reset index if needed
if df.index.name == 'CustomerID':
    df = df.reset_index()

# 1. Customers younger than 30
# Your code here

In [None]:
# 2. Electronics purchases with price > $500
# Your code here

In [None]:
# 3. Female customers who purchased books
# Your code here

In [None]:
# 4. Married customers from West region
# Your code here

### 3.2 Advanced Filtering

Now let's apply more complex filtering conditions:

1. Find high-value customers (Income > $90,000) who have made purchases in the 'Furniture' or 'Electronics' categories
2. Find customers who made purchases in January 2024 (hint: extract month and year from the PurchaseDate)
3. Find customers who have made multiple purchases (more than one transaction)
4. Find the top 5 most expensive products purchased using 'Credit Card' as the payment method

In [None]:
# 1. High-value customers who purchased Furniture or Electronics
# Your code here

In [None]:
# Convert PurchaseDate to datetime if not already
if not pd.api.types.is_datetime64_dtype(df['PurchaseDate']):
    df['PurchaseDate'] = pd.to_datetime(df['PurchaseDate'])

# 2. Customers who made purchases in January 2024
# Your code here

In [None]:
# 3. Customers with multiple purchases
# Your code here

In [None]:
# 4. Top 5 most expensive products purchased with Credit Card
# Your code here

### 3.3 Practical Application: Customer Segmentation

Use filtering to segment customers based on the following criteria:

1. Create a 'CustomerValue' column that categorizes customers as follows:
   - 'High': Income > $90,000
   - 'Medium': Income between $60,000 and $90,000
   - 'Low': Income < $60,000
   
2. Create a 'AgeGroup' column that categorizes customers as follows:
   - 'Young': Age < 30
   - 'Middle-aged': Age between 30 and 45
   - 'Senior': Age > 45
   
3. Create a 'PurchaseFrequency' column that categorizes customers as follows:
   - 'Frequent': More than 2 purchases
   - 'Occasional': 1-2 purchases

4. Create a cross-tabulation of CustomerValue and AgeGroup to see the distribution of customers

In [None]:
# Create customer profile DataFrame if not already created
if 'customer_profiles' not in locals():
    customer_profiles = df[['CustomerID', 'Gender', 'Age', 'Income', 'Education', 'Region', 'MaritalStatus']].drop_duplicates(subset=['CustomerID'])

# 1. Create CustomerValue column
# Your code here

In [None]:
# 2. Create AgeGroup column
# Your code here

In [None]:
# 3. Create PurchaseFrequency column
# Your code here

In [None]:
# 4. Create cross-tabulation
# Your code here

## Part 4: Handling Missing Values (30 points)

In this section, you will identify and handle missing values in the dataset.

### 4.1 Identifying Missing Values

Let's first identify all missing values in the dataset:

1. Calculate the number of missing values in each column
2. Calculate the percentage of missing values in each column
3. Create a visualization to illustrate the missing values pattern

In [None]:
# 1. Count missing values in each column
# Your code here

In [None]:
# 2. Calculate percentage of missing values
# Your code here

In [None]:
# 3. Visualize missing values
# Your code here

### 4.2 Handling Missing Values

Now, let's apply different techniques to handle missing values:

1. Create a new DataFrame with rows that have missing values
2. Create a new DataFrame with rows that have no missing values
3. Fill missing Quantity values with the median quantity for that product category
4. Fill missing Price values with the mean price for that product category

In [None]:
# 1. Rows with missing values
# Your code here

In [None]:
# 2. Rows with no missing values
# Your code here

In [None]:
# 3. Fill missing Quantity with median by category
# Your code here

In [None]:
# 4. Fill missing Price with mean by category
# Your code here

### 4.3 Practical Application: Creating a Clean Dataset

Create a clean version of the dataset by applying the following steps:

1. Fill missing Quantity values with the median quantity for that product category
2. Fill missing Price values with the mean price for that product category
3. Create a 'TotalAmount' column that multiplies Price by Quantity
4. Convert PurchaseDate to datetime format if not already
5. Create a 'PurchaseMonth' and 'PurchaseYear' column
6. Create a 'CustomerSpend' DataFrame that shows the total amount spent by each customer

In [None]:
# Create a copy of the DataFrame to work with
clean_df = df.copy()

# 1. Fill missing Quantity values
# Your code here

In [None]:
# 2. Fill missing Price values
# Your code here

In [None]:
# 3. Create TotalAmount column
# Your code here

In [None]:
# 4. Convert PurchaseDate to datetime
# Your code here

In [None]:
# 5. Create PurchaseMonth and PurchaseYear columns
# Your code here

In [None]:
# 6. Create CustomerSpend DataFrame
# Your code here

## Bonus Challenge (10 extra points)

Analyze the purchasing patterns of customers based on demographic factors:

1. For each age group, determine the most popular product category (by number of purchases)
2. For each income level ('High', 'Medium', 'Low'), calculate the average spend per purchase
3. Compare spending patterns between male and female customers across different product categories
4. Identify which regions have the highest average purchase amounts

In [None]:
# 1. Most popular product category by age group
# Your code here

In [None]:
# 2. Average spend per purchase by income level
# Your code here

In [None]:
# 3. Spending patterns by gender across product categories
# Your code here

In [None]:
# 4. Regions with highest average purchase amounts
# Your code here

## Summary

In this assignment, you've practiced various techniques for indexing, selecting, and filtering data in Pandas, as well as identifying and handling missing values. These skills are essential for any data analysis project and form the foundation for more advanced data manipulation operations.

Summarize what you've learned from this assignment and how you might apply these techniques in future data analysis tasks.

*Your summary here*