# Pandas Exercises

## Overview
These exercises use realistic sales data with intentional "mess". 
You'll work with three CSV files in the **/files** folder:
- **products_master.csv** (8 rows): Product catalog with pricing
- **sales_records.csv** (260 rows): Transaction data
- **regions_dim.csv** (5 rows): Region code mappings

### Issues With Data (Which Need Solving)
- Some products have **no sales**
- Some sales reference **unknown products**
- Some regions exist only in one but **not** the other table

## Exercise 1: Basic Filtering & Boolean Logic
**Goal**: Filter sales data using multiple conditions

### Tasks:
1. Find all sales in AMER or EMEA regions
2. Filter for Plus (P-PLS) or Pro (P-PRO) products only
3. Show transactions with 20 or more units
4. Combine all three filters: AMER/EMEA + Plus/Pro + units ≥ 20

### Hints:
```python
# Multiple conditions with &, |, and parentheses
df[(df['region_code'].isin(['AMER', 'EMEA'])) & (df['units'] >= 20)]

# OR logic for products
df[df['product_id'].isin(['P-PLS', 'P-PRO'])]
```

### Expected Output:
- Task 1: ~130-150 rows
- Task 4: ~10-15 rows (highly filtered)

In [4]:
# Code here

## Exercise 2: Aggregation & Grouping
**Goal**: Summarize sales by different dimensions

### Tasks:
1. Calculate total revenue and units sold by product_id
2. Find average unit_price per region_code
3. Create a summary table: product_id × region_code with total revenue
4. Extract year-month from order_date and sum revenue by month
5. Find the top 3 best-selling products (by units) in each region

### Hints:
```python
# Basic groupby
df.groupby('product_id')['revenue'].sum()

# Multiple aggregations
df.groupby('region_code').agg({
    'revenue': 'sum',
    'units': 'mean'
})

# Year-month extraction
df['year_month'] = pd.to_datetime(df['order_date']).dt.to_period('M')

# Top N per group
df.groupby('region_code').apply(
    lambda x: x.nlargest(3, 'units')
).reset_index(drop=True)
```

### Expected Output:
- Task 1: 8-9 products (some in sales but not in master)
- Task 3: ~20-30 product-region combinations
- Task 5: 15 rows (3 products × 5 regions, though some regions may have fewer)

In [None]:
# Code here

## Exercise 3: Date Parsing & Time-Based Analysis
**Goal**: Work with date columns and extract time components

### Tasks:
1. Convert order_date to datetime format (it's already clean, use `pd.to_datetime()`)
2. Extract year, month, quarter, and day_of_week from order_date
3. Create a new column for "fiscal_quarter" (assuming fiscal year starts in April)
4. Calculate days_since_order (from October 22, 2025)
5. Group sales by quarter and calculate total revenue

### Hints:
```python
# Parse dates
df['order_date'] = pd.to_datetime(df['order_date'])

# Extract components
df['year'] = df['order_date'].dt.year
df['month'] = df['order_date'].dt.month
df['quarter'] = df['order_date'].dt.quarter
df['weekday'] = df['order_date'].dt.day_name()

# Fiscal quarter (Q1 = Apr-Jun, Q2 = Jul-Sep, etc.)
df['fiscal_quarter'] = ((df['order_date'].dt.month - 4) % 12 // 3) + 1

# Days since
from datetime import date
df['days_since'] = (date(2025, 10, 22) - df['order_date'].dt.date).dt.days
```

### Expected Output:
- Sales span from early 2024 to late 2025
- You should see 7-8 calendar quarters
- Most recent orders have days_since near 0

In [5]:
# Code here

## Exercise 3B (Advanced): Handling Messy Date Formats
**Goal**: Parse dates with multiple formats and handle errors

If you want extra practice, create a test dataset with these date variations:
```python
messy_dates = [
    "2025-03-05",          # ISO format
    "2025/03/15",          # Slash format
    "15-03-2025",          # DD-MM-YYYY
    "March 25, 2025",      # Text format
    "1719878400",          # Unix timestamp
    "2025-02-29",          # Invalid date!
    "Not a date"
]
```

### Hints:
```python
# Try multiple formats with coerce for errors
df['date_clean'] = pd.to_datetime(df['date_messy'], errors='coerce')

# For mixed formats, use infer_datetime_format=True (pandas < 2.0)
# or let pandas auto-detect in pandas 2.0+

# Unix timestamps need unit specification
pd.to_datetime(df['unix_timestamp'], unit='s')

# Check for NaT (Not a Time) values
df['date_clean'].isna().sum()
```

In [None]:
# Code here

## Exercise 4: Inner Join of Products with Sales
**Goal**: Understand inner joins and identify matched records

### Tasks:
1. Perform an inner join: sales → products on product_id
2. How many sales records are retained? How many are lost?
3. Which product_ids appear in sales but NOT in products_master?
4. Add product name and category to each sale
5. Calculate the discount: (list_price - unit_price) / list_price × 100

### Hints:
```python
# Inner join keeps only matching records
merged = sales_df.merge(
    products_df, 
    on='product_id', 
    how='inner'
)

# Find unmatched sales
unmatched = sales_df[~sales_df['product_id'].isin(products_df['product_id'])]

# Discount calculation
merged['discount_pct'] = (
    (merged['list_price'] - merged['unit_price']) / merged['list_price'] * 100
)
```

### Expected Output:
- Inner join should drop rows with P-TRI and P-OEM (unknown products)
- You'll lose ~20-30 sales records
- Products P-LGC and P-BTA won't appear (no sales for them)

In [None]:
# Code here

## Exercise 5: Left Join - Keeping All Sales
**Goal**: Preserve all sales records even if products are unknown

### Tasks:
1. Perform a left join: sales (left) → products (right) on product_id
2. Identify sales with missing product information (NaN in product name)
3. What's the total revenue from "unknown" products?
4. Fill missing product names with "Unknown Product"
5. Create a data quality flag: `is_orphan` = True if product info is missing

### Hints:
```python
# Left join keeps ALL rows from left table
merged_left = sales_df.merge(
    products_df, 
    on='product_id', 
    how='left'
)

# Find rows with missing product info
orphans = merged_left[merged_left['product'].isna()]

# Fill missing values
merged_left['product'].fillna('Unknown Product', inplace=True)

# Create flag
merged_left['is_orphan'] = merged_left['product'].isna()
```

### Expected Output:
- All 260 sales records retained
- ~20-30 orphan records (P-TRI and P-OEM)
- These orphans represent real revenue that needs investigation!

In [None]:
# Code here

## Exercise 6: Right Join - Finding Unsold Products
**Goal**: Identify products with no sales history

### Tasks:
1. Perform a right join: sales (left) → products (right) on product_id
2. Which products have no sales? (NaN in order_id)
3. What's the total "potential revenue" from unsold products' list prices?
4. Are discontinued products (status != 'Active') more likely to have no sales?

### Hints:
```python
# Right join keeps ALL rows from right table (products)
merged_right = sales_df.merge(
    products_df, 
    on='product_id', 
    how='right'
)

# Find products with no sales
no_sales = merged_right[merged_right['order_id'].isna()]

# Count by status
no_sales.groupby('status').size()
```

### Expected Output:
- P-LGC (Legacy, Discontinued) and P-BTA (Beta, Preview) have no sales
- These appear with all their product info but NaN sales data

In [None]:
# Code here

## Exercise 7: Full Outer Join - Complete Picture
**Goal**: See ALL products and ALL sales, matched or not

### Tasks:
1. Perform a full outer join: sales ↔ products on product_id
2. Count records in three categories:
   - Matched (both sales and product info)
   - Sales orphans (sales only)
   - Product orphans (products only)
3. Calculate what % of revenue comes from "complete" records

### Hints:
```python
# Full outer join keeps ALL rows from both tables
merged_full = sales_df.merge(
    products_df, 
    on='product_id', 
    how='outer',
    indicator=True  # Adds _merge column!
)

# Check merge results
merged_full['_merge'].value_counts()
# 'both': matched records
# 'left_only': sales orphans
# 'right_only': product orphans

# Revenue from complete records
complete_revenue = merged_full[
    merged_full['_merge'] == 'both'
]['revenue'].sum()
```

### Expected Output:
- Total rows > 260 (includes products with no sales)
- ~230-240 "both" records
- ~20-30 "left_only" (unknown products)
- 2 "right_only" (unsold products)

In [None]:
# Code here

## Exercise 8: Region Mapping with Left Join
**Goal**: Add region names to sales using the regions dimension

### Tasks:
1. Join sales → regions_dim on region_code (left join)
2. Are there any sales with unmapped region codes? Which ones?
3. Create a clean region column that shows the full region name
4. Calculate revenue by region (using the full name)
5. Identify the "LATAM" mystery: this code appears in sales but not in regions_dim

### Hints:
```python
# Join with regions
sales_with_regions = sales_df.merge(
    regions_df,
    on='region_code',
    how='left'
)

# Find unmapped regions
unmapped = sales_with_regions[sales_with_regions['region'].isna()]

# Group by full region name
sales_with_regions.groupby('region')['revenue'].sum()
```

### Expected Output:
- Most regions map correctly
- LATAM appears in sales but has no mapping (NaN region name)
- ANZ appears in regions_dim but has no sales

In [None]:
# Code here

## Exercise 9: Data Quality Check - Revenue Validation
**Goal**: Verify that revenue = units × unit_price

### Tasks:
1. Calculate expected_revenue = units × unit_price
2. Compare expected_revenue to recorded revenue
3. Flag records where the difference is > $0.01 (rounding tolerance)
4. What % of records have revenue discrepancies?
5. Investigate: are discrepancies related to specific products or regions?

### Hints:
```python
# Calculate expected revenue
df['expected_revenue'] = df['units'] * df['unit_price']

# Compare with tolerance
df['revenue_diff'] = abs(df['revenue'] - df['expected_revenue'])
df['has_discrepancy'] = df['revenue_diff'] > 0.01

# Analysis
df['has_discrepancy'].mean() * 100  # % with errors

# By product
df.groupby('product_id')['has_discrepancy'].mean()
```

### Expected Output:
- Most records should be accurate (expected_revenue ≈ revenue)
- Some records will have small rounding differences
- Flag any records with large discrepancies (>1%)

In [None]:
# Code here

## Exercise 10: String Cleaning for Better Joins
**Goal**: Handle common string matching issues

### Tasks:
1. Check for whitespace issues: leading/trailing spaces in product_id or region_code
2. Check for case inconsistencies (AMER vs amer vs Amer)
3. Create cleaned versions of string columns using `.str.strip()` and `.str.upper()`
4. Re-run joins with cleaned data and compare results
5. Create a function to clean string columns before joining

### Hints:
```python
# Check for whitespace
df['product_id'].str.contains('^ | $', regex=True).any()

# Clean strings
df['product_id_clean'] = df['product_id'].str.strip().str.upper()

# Reusable function
def clean_string_column(series):
    """Clean a string column for joining"""
    return series.str.strip().str.upper().str.replace(' +', ' ', regex=True)

# Apply before join
sales_df['product_id'] = clean_string_column(sales_df['product_id'])
products_df['product_id'] = clean_string_column(products_df['product_id'])
```

### Expected Output:
- The current data is already clean, so this won't find issues
- But in real data, this cleaning step is ESSENTIAL
- 30-50% of failed joins in production are due to whitespace/case issues!

### NOTE:
**ALWAYS clean string keys before joining**. Can cause joins to silently fail otherwise. 

In [None]:
# Code here

## Exercise 11: Multi-Table Join Chain
**Goal**: Combine all three tables in sequence

### Tasks:
1. Start with sales_df
2. Left join → products_master (on product_id)
3. Then left join → regions_dim (on region_code)
4. Create a comprehensive view with all dimensions
5. Calculate revenue by product category and region

### Hints:
```python
# Chain joins
result = (sales_df
    .merge(products_df, on='product_id', how='left')
    .merge(regions_df, on='region_code', how='left')
)

# Multi-level grouping
result.groupby(['category', 'region'])['revenue'].sum().unstack()
```

### Expected Output:
- A table with 260 rows and ~15 columns (all dimensions)
- Some NaN values where lookups failed
- Pivot table showing category × region performance

In [None]:
# Code here

## Exercise 12: Aggregation After Joins (Advanced)
**Goal**: Answer complex business questions using joined data

### Tasks:
1. What's the total revenue for each product category?
2. Which region has the highest average discount rate?
3. What's the average order size (units) for Core products vs. Add-ons?
4. Calculate revenue per product status (Active, Discontinued, Preview)
5. Find the month with the highest revenue for each product category

### Hints:
```python
# After joining all tables...

# Revenue by category
result.groupby('category')['revenue'].sum()

# Average discount by region (from Exercise 4)
result['discount_rate'] = (result['list_price'] - result['unit_price']) / result['list_price']
result.groupby('region')['discount_rate'].mean()

# Time-based aggregation
result['month'] = pd.to_datetime(result['order_date']).dt.to_period('M')
result.groupby(['category', 'month'])['revenue'].sum().groupby(level=0).idxmax()
```

### Expected Output:
- Core category should dominate revenue
- Discount rates vary by region
- Clear seasonal patterns may emerge

In [None]:
# Code here

## Exercise 13: Pivot Tables & Cross-Tabs
**Goal**: Create Excel-style pivot tables in pandas

### Tasks:
1. Create a pivot table: products (rows) × regions (columns), values = total revenue
2. Create another pivot: months (rows) × products (columns), values = units sold
3. Calculate percentage of total revenue by product within each region
4. Find which product-region combination has the highest average order value

### Hints:
```python
# Pivot table
pivot_revenue = result.pivot_table(
    index='product',
    columns='region',
    values='revenue',
    aggfunc='sum',
    fill_value=0
)

# Percentage of total
pivot_pct = pivot_revenue.div(pivot_revenue.sum(axis=0), axis=1) * 100

# Cross-tab (alternative)
pd.crosstab(result['product'], result['region'], values=result['revenue'], aggfunc='sum')
```

### Expected Output:
- A matrix showing revenue distribution
- Empty cells (0 or NaN) where product-region combinations don't exist
- Clear winners and losers by segment


In [None]:
# Code here

## Exercise 14: Unpivot (Melt) for Reshaping
**Goal**: Convert wide-format data back to long format

### Tasks:
1. Take the pivot table from Exercise 13 (products × regions)
2. Unpivot it back to long format using `.melt()`
3. Verify the total revenue matches the original
4. Filter to show only non-zero combinations

### Hints:
```python
# Melt (unpivot)
long_format = pivot_revenue.reset_index().melt(
    id_vars='product',
    var_name='region',
    value_name='revenue'
)

# Remove zeros
long_format = long_format[long_format['revenue'] > 0]

# Verify totals
print(f"Original: {result['revenue'].sum():.2f}")
print(f"After pivot/melt: {long_format['revenue'].sum():.2f}")
```

### Expected Output:
- Long format with ~150-180 non-zero product-region combinations
- Total revenue should match original

### When to Use:
- Pivot: For analysis and visualization (wide format)
- Melt: For database loading or further processing (long format)

In [None]:
# Code here

## Exercise 15: Finding Duplicates & Near-Duplicates
**Goal**: Identify and handle duplicate records

### Tasks:
1. Check for exact duplicates across all columns
2. Check for duplicates based only on order_id + product_id (same order, same product)
3. Find "near-duplicates": same order_id, product_id, units, but different revenue
4. Create a strategy to handle duplicates:
   - Keep first occurrence?
   - Keep highest revenue?
   - Flag for manual review?

### Hints:
```python
# Exact duplicates
duplicates = df[df.duplicated(keep=False)]

# Duplicates on specific columns
order_dupes = df[df.duplicated(subset=['order_id', 'product_id'], keep=False)]

# Find near-duplicates (same order/product, different revenue)
near_dupes = df.groupby(['order_id', 'product_id']).filter(lambda x: len(x) > 1)

# Remove duplicates
df_deduped = df.drop_duplicates(subset=['order_id', 'product_id'], keep='first')

# Or keep the one with highest revenue
df_deduped = df.sort_values('revenue', ascending=False).drop_duplicates(
    subset=['order_id', 'product_id'], 
    keep='first'
)
```

### Expected Output:
- The current dataset is clean (no duplicates)
- But know how to find them!
- In real data, 2-5% of rows are often duplicates

In [None]:
# Code here

## Bonus Exercise: Create a Reusable Data Cleaning Pipeline
**Goal**: Build a function that cleans and joins all data

### Task:
Create a function that:
1. Loads all three CSVs
2. Cleans string columns (strip, uppercase)
3. Parses dates
4. Performs all joins
5. Adds calculated fields (discount, margin, flags)
6. Returns a clean, analysis-ready DataFrame

### Template:
```python
def prepare_sales_data(sales_path, products_path, regions_path):
    """
    Load and prepare sales data for analysis
    
    Returns:
        pd.DataFrame: Clean, joined, analysis-ready data
    """
    # Load
    sales = pd.read_csv(sales_path)
    products = pd.read_csv(products_path)
    regions = pd.read_csv(regions_path)
    
    # Clean strings
    for df in [sales, products, regions]:
        for col in df.select_dtypes(include='object').columns:
            df[col] = df[col].str.strip().str.upper()
    
    # Parse dates
    sales['order_date'] = pd.to_datetime(sales['order_date'])
    
    # Join
    result = (sales
        .merge(products, on='product_id', how='left')
        .merge(regions, on='region_code', how='left')
    )
    
    # Add calculated fields
    result['discount_pct'] = ((result['list_price'] - result['unit_price']) 
                               / result['list_price'] * 100)
    result['is_orphan'] = result['product'].isna()
    
    # Add time dimensions
    result['year'] = result['order_date'].dt.year
    result['quarter'] = result['order_date'].dt.quarter
    result['month'] = result['order_date'].dt.month
    
    return result

# Use it
df = prepare_sales_data('sales_records.csv', 'products_master.csv', 'regions_dim.csv')

In [None]:
# Code here