# **AI TECH INSTITUTE** · *Intermediate AI & Data Science*
### Week 01 · Lab 01B — Data Wrangling
**Instructor:** Amir Charkhi  |  **Duration:** 45 minutes  |  **Difficulty:** ⭐⭐⭐☆☆

> **Goal:** Master groupby, merge, pivot, and real-world data cleaning.


## Learning Objectives
- Master groupby operations and aggregations
- Perform different types of merges and joins
- Reshape data with pivot and melt
- Handle real-world messy data

In [2]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("Ready for data wrangling! 🔧")

Ready for data wrangling! 🔧


## Part 1: GroupBy Mastery (15 minutes)

### Exercise 1.1 — Sales Team Performance (medium)
Analyze sales team performance across regions.

In [34]:
# Sales data
np.random.seed(42)
sales = pd.DataFrame({
    'salesperson': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana'], 100),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'product': np.random.choice(['A', 'B', 'C'], 100),
    'quantity': np.random.randint(1, 50, 100),
    'revenue': np.random.uniform(100, 5000, 100).round(2),
    'date': pd.date_range('2025-01-01', periods=100)
})

# TODO: Use groupby to find:
# 1. Total revenue by salesperson
revenue_by_person = sales.groupby('salesperson')['revenue'].sum().sort_values(ascending=False)
print(revenue_by_person)

# 2. Average sale amount by region
average_sales = sales.groupby('region')['revenue'].mean().sort_values(ascending=False)
print(average_sales)

# 3. Top product by quantity in each region
top_product = sales.groupby(['region', 'product'])['quantity'].sum().groupby('region').idxmax()
top_product_clean = top_product.apply(lambda x: x[1])  # Get second element of tuple
print(top_product_clean)

# 4. Sales performance by salesperson AND region (multi-level groupby)
# Your code here:
customer_summary = orders.groupby('customer_id')[['amount', 'items']].agg(['sum', 'mean', 'count'])
print(customer_summary)

salesperson
Diana      74760.62
Bob        59313.60
Charlie    53546.82
Alice      40206.96
Name: revenue, dtype: float64
region
East     2643.599333
South    2350.263500
North    2282.157308
West     1757.444167
Name: revenue, dtype: float64
region
East     A
North    A
South    C
West     C
Name: quantity, dtype: object
              amount                   items                
                 sum        mean count   sum      mean count
customer_id                                                 
1             745.33  248.443333     3    14  4.666667     3
2            1304.69  326.172500     4    23  5.750000     4
3            1246.39  311.597500     4    16  4.000000     4
4            2214.48  276.810000     8    46  5.750000     8
5            2716.72  301.857778     9    45  5.000000     9
6            1159.16  386.386667     3    17  5.666667     3
7            1070.53  267.632500     4    26  6.500000     4
8            1452.50  242.083333     6    23  3.833333     6
9    

### Exercise 1.2 — Custom Aggregations (medium)
Apply multiple aggregation functions simultaneously.

In [35]:
# Customer orders
orders = pd.DataFrame({
    'customer_id': np.random.randint(1, 21, 100),
    'order_date': pd.date_range('2025-06-01', periods=100),
    'amount': np.random.uniform(20, 500, 100).round(2),
    'items': np.random.randint(1, 10, 100),
    'category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Books'], 100)
})

# TODO: Create a customer summary with:
# 1. Total spending
customer_summary = orders.groupby('customer_id')['amount'].sum().nlargest(5)
print(customer_summary)

# 2. Average order value
aov = orders['amount'].mean().round(2)
print(f"Average order value: ${aov}")

# 3. Number of orders
total_orders = len(orders)  # or orders.shape[0]
print(f"Total number of orders: {total_orders}")

# 4. Most frequent category
frequent_category = orders['category'].value_counts().idxmax()
print(f"Most frequent category: {frequent_category}")

# 5. Days since last order
most_recent_order = orders['order_date'].max()
today = pd.Timestamp('2025-09-01')
days_since_last = (today - most_recent_order).days
print(f"Days since last order: {days_since_last}")

# Use .agg() with dictionary or list of functions
# Your code here:
customer_summary = orders.groupby('customer_id')['amount'].agg(['sum', 'mean', 'count'])
print(customer_summary)

customer_id
17    2288.07
4     2209.61
5     2111.84
12    2090.49
16    1868.49
Name: amount, dtype: float64
Average order value: $273.05
Total number of orders: 100
Most frequent category: Clothing
Days since last order: -7
                 sum        mean  count
customer_id                            
1            1747.46  291.243333      6
2            1009.81  252.452500      4
3            1679.48  279.913333      6
4            2209.61  276.201250      8
5            2111.84  301.691429      7
7            1113.72  278.430000      4
8             693.39  346.695000      2
9            1603.07  320.614000      5
10           1186.71  237.342000      5
11            360.46  120.153333      3
12           2090.49  261.311250      8
13           1094.00  218.800000      5
14            823.82  274.606667      3
15            670.96  335.480000      2
16           1868.49  266.927143      7
17           2288.07  254.230000      9
18           1320.14  330.035000      4
19           

### Exercise 1.3 — Transform vs Aggregate (hard)
Understand the difference between transform and aggregate operations.

In [46]:
# Store sales data
store_sales = pd.DataFrame({
    'store': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'month': ['Jan', 'Feb', 'Mar'] * 3,
    'sales': [1000, 1200, 1100, 800, 900, 950, 1500, 1600, 1550]
})

# TODO: Use groupby to:
# 1. Add a column showing each store's average sales (use transform)
store_sales['average_sales'] = store_sales.groupby('store')['sales'].transform('mean')
print(store_sales)

# 2. Add a column showing percentage of store's total sales
store_sales['percentage_total_sales'] = (store_sales['sales'] / store_sales.groupby('store')['sales'].transform('sum')) * 100
print(store_sales)

# 3. Add a column indicating if sales are above store average
store_sales['sales_above_store_average'] = store_sales['sales'] > store_sales['average_sales']
print(store_sales)

# 4. Rank months within each store by sales
# Your code here:
store_sales['months_ranking_bystore'] = store_sales.groupby('store')['sales'].rank(ascending=False)
print(store_sales)

  store month  sales  average_sales
0     A   Jan   1000    1100.000000
1     A   Feb   1200    1100.000000
2     A   Mar   1100    1100.000000
3     B   Jan    800     883.333333
4     B   Feb    900     883.333333
5     B   Mar    950     883.333333
6     C   Jan   1500    1550.000000
7     C   Feb   1600    1550.000000
8     C   Mar   1550    1550.000000
  store month  sales  average_sales  percentage_total_sales
0     A   Jan   1000    1100.000000               30.303030
1     A   Feb   1200    1100.000000               36.363636
2     A   Mar   1100    1100.000000               33.333333
3     B   Jan    800     883.333333               30.188679
4     B   Feb    900     883.333333               33.962264
5     B   Mar    950     883.333333               35.849057
6     C   Jan   1500    1550.000000               32.258065
7     C   Feb   1600    1550.000000               34.408602
8     C   Mar   1550    1550.000000               33.333333
  store month  sales  average_sales  per

## Part 2: Merging and Joining (15 minutes)

### Exercise 2.1 — Customer Database Integration (medium)
Merge customer information from multiple sources.

In [4]:
# Customer basic info
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 
             'diana@email.com', 'eve@email.com']
})

# Customer addresses
addresses = pd.DataFrame({
    'customer_id': [1, 2, 3, 6],  # Note: customer 6 doesn't exist, 4 & 5 missing
    'city': ['Perth', 'Sydney', 'Melbourne', 'Brisbane'],
    'state': ['WA', 'NSW', 'VIC', 'QLD']
})

# Customer orders
customer_orders = pd.DataFrame({
    'customer_id': [1, 1, 2, 3, 3, 3, 4],
    'order_total': [150, 200, 75, 300, 125, 180, 90]
})

# TODO: Perform different types of merges:
# 1. Inner join: customers with addresses
inner_result = customers.merge(addresses, on='customer_id', how='inner')
print("Inner join result:")
print(inner_result)

# 2. Left join: all customers with their addresses (if available)
left_result = customers.merge(addresses, on='customer_id', how='left')
print("Left join result:")
print(left_result)

# 3. Outer join: all records from both tables
outer_result = customers.merge(addresses, on='customer_id', how='outer')
print("Outer join result:")
print(outer_result)

# 4. Merge all three tables to create complete customer profile
# Your code here:
complete_profile = customers.merge(addresses, on='customer_id', how='left') \
                           .merge(customer_orders, on='customer_id', how='left')
print("Complete customer profile:")
print(complete_profile)

Inner join result:
   customer_id     name              email       city state
0            1    Alice    alice@email.com      Perth    WA
1            2      Bob      bob@email.com     Sydney   NSW
2            3  Charlie  charlie@email.com  Melbourne   VIC
Left join result:
   customer_id     name              email       city state
0            1    Alice    alice@email.com      Perth    WA
1            2      Bob      bob@email.com     Sydney   NSW
2            3  Charlie  charlie@email.com  Melbourne   VIC
3            4    Diana    diana@email.com        NaN   NaN
4            5      Eve      eve@email.com        NaN   NaN
Outer join result:
   customer_id     name              email       city state
0            1    Alice    alice@email.com      Perth    WA
1            2      Bob      bob@email.com     Sydney   NSW
2            3  Charlie  charlie@email.com  Melbourne   VIC
3            4    Diana    diana@email.com        NaN   NaN
4            5      Eve      eve@email.com  

### Exercise 2.2 — Product Catalog Merge (hard)
Handle complex merges with multiple keys and conditions.

In [9]:
# Product catalog
products = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P003', 'P004'],
    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics'],
    'base_price': [1000, 25, 75, 350]
})

# Store-specific pricing
store_prices = pd.DataFrame({
    'store_id': ['S1', 'S1', 'S1', 'S2', 'S2', 'S3'],
    'product_id': ['P001', 'P002', 'P003', 'P001', 'P004', 'P001'],
    'price_multiplier': [1.1, 1.0, 1.05, 0.95, 1.15, 1.2]
})

# Store information
stores = pd.DataFrame({
    'store_id': ['S1', 'S2', 'S3', 'S4'],
    'store_name': ['MegaMart', 'QuickShop', 'TechZone', 'BudgetBuy'],
    'location': ['Downtown', 'Suburb', 'Mall', 'Online']
})

# TODO: Create a complete price list:
# 1. Merge to get actual prices (base_price * multiplier) for each store
merged_data = products.merge(store_prices, on='product_id', how='inner')
merged_data['actual_price'] = merged_data['base_price'] * merged_data['price_multiplier']
print("Products with actual prices:")
print(merged_data[['product_name', 'store_id', 'base_price', 'price_multiplier', 'actual_price']])

# 2. Include store names and locations
complete_data = merged_data.merge(stores, on='store_id', how='left')
print("Complete price list with store info:")
print(complete_data[['store_name', 'location', 'product_name', 'base_price', 'price_multiplier', 'actual_price']])

# 3. Find products not available in certain stores
all_combinations = products[['product_id', 'product_name']].merge(
    stores[['store_id', 'store_name']], how='cross'
)
availability_check = all_combinations.merge(
    store_prices, on=['product_id', 'store_id'], how='left', indicator=True
)
not_available = availability_check[availability_check['_merge'] == 'left_only']
print("Products not available in certain stores:")
print(not_available[['product_name', 'store_name']].sort_values(['store_name', 'product_name']))

# 4. Calculate price variance across stores for each product
# Your code here:

price_variance = merged_data.groupby(['product_id', 'product_name'])['actual_price'].agg([
    'min', 'max', 'std', 'count'
]).round(2)
price_variance['price_range'] = price_variance['max'] - price_variance['min']

price_variance = price_variance.reset_index()

print("Price variance across stores:")
print(price_variance)

Products with actual prices:
  product_name store_id  base_price  price_multiplier  actual_price
0       Laptop       S1        1000              1.10       1100.00
1       Laptop       S2        1000              0.95        950.00
2       Laptop       S3        1000              1.20       1200.00
3        Mouse       S1          25              1.00         25.00
4     Keyboard       S1          75              1.05         78.75
5      Monitor       S2         350              1.15        402.50
Complete price list with store info:
  store_name  location product_name  base_price  price_multiplier  \
0   MegaMart  Downtown       Laptop        1000              1.10   
1  QuickShop    Suburb       Laptop        1000              0.95   
2   TechZone      Mall       Laptop        1000              1.20   
3   MegaMart  Downtown        Mouse          25              1.00   
4   MegaMart  Downtown     Keyboard          75              1.05   
5  QuickShop    Suburb      Monitor         

## Part 3: Pivoting and Reshaping (15 minutes)

### Exercise 3.1 — Sales Matrix Creation (medium)
Reshape data from long to wide format.

In [7]:
# Monthly sales by product and region (long format)
long_sales = pd.DataFrame({
    'month': ['Jan', 'Jan', 'Jan', 'Jan', 'Feb', 'Feb', 'Feb', 'Feb',
              'Mar', 'Mar', 'Mar', 'Mar'],
    'region': ['North', 'South', 'East', 'West'] * 3,
    'product': ['A', 'A', 'B', 'B'] * 3,
    'sales': np.random.randint(1000, 5000, 12)
})

print("Long format data:")
print(long_sales)
print()

# TODO: Reshape the data:
# 1. Pivot to show regions as columns, months as rows
print("1. PIVOT: Regions as columns, months as rows")
monthly_regional_sales = long_sales.groupby(['month', 'region'])['sales'].sum().reset_index()
pivot_regions = monthly_regional_sales.pivot(index='month', columns='region', values='sales')
print(pivot_regions)
print()

# 2. Create a pivot table with product-region sales totals
print("2. PIVOT TABLE: Product-Region sales totals")
pivot_table_product_region = pd.pivot_table(
    long_sales, 
    values='sales', 
    index='product', 
    columns='region', 
    aggfunc='sum'
)
print(pivot_table_product_region)
print()

# 3. Add row and column totals (margins)
print("3. PIVOT TABLE WITH MARGINS: Adding row and column totals")
print("   Using margins=True to add totals")

pivot_with_margins = pd.pivot_table(
    long_sales,
    values='sales',
    index='product',
    columns='region',
    aggfunc='sum',
    margins=True,  # This adds row and column totals
    margins_name='Total'  # Name for the margin rows/columns
)
print(pivot_with_margins)
print()

# 4. Calculate month-over-month growth for each region
# Your code here:
print("4. MONTH-OVER-MONTH GROWTH: Calculate growth rate for each region")
print("   Using pct_change() to calculate percentage change")

# Start with our pivoted data (regions as columns, months as rows)
print("Original monthly data by region:")
print(pivot_regions)
print()

# Calculate month-over-month growth (percentage change)
mom_growth = pivot_regions.pct_change() * 100  # Convert to percentage

print("Month-over-month growth (%):")
print(mom_growth.round(2))  # Round to 2 decimal places for readability
print()


Long format data:
   month region product  sales
0    Jan  North       A   1994
1    Jan  South       A   1220
2    Jan   East       B   3804
3    Jan   West       B   3014
4    Feb  North       A   1188
5    Feb  South       A   3265
6    Feb   East       B   1648
7    Feb   West       B   3946
8    Mar  North       A   4524
9    Mar  South       A   2674
10   Mar   East       B   4155
11   Mar   West       B   2402

1. PIVOT: Regions as columns, months as rows
region  East  North  South  West
month                           
Feb     1648   1188   3265  3946
Jan     3804   1994   1220  3014
Mar     4155   4524   2674  2402

2. PIVOT TABLE: Product-Region sales totals
region     East   North   South    West
product                                
A           NaN  7706.0  7159.0     NaN
B        9607.0     NaN     NaN  9362.0

3. PIVOT TABLE WITH MARGINS: Adding row and column totals
   Using margins=True to add totals
region     East   North   South    West  Total
product              

### Exercise 3.2 — Melt and Stack Operations (hard)
Convert wide format data to long format for analysis.

In [12]:
# Wide format grade data
grades_wide = pd.DataFrame({
    'student': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Math': [85, 78, 92, 88],
    'Science': [90, 82, 88, 85],
    'English': [78, 85, 80, 92],
    'History': [82, 80, 85, 88]
})

print("Wide format grades:")
print(grades_wide)
print()

# TODO: Reshape the data:
# 1. Melt to long format (student, subject, grade)
print("1. MELT: Convert wide format to long format")
print("   Using melt() to transform columns into rows")

grades_long = pd.melt(grades_wide, id_vars=['student'], var_name='subject', value_name='grade')

print(grades_long)
print()
# 2. Calculate average grade per subject
print("2. AVERAGE GRADE PER SUBJECT")
print("   Using groupby() to aggregate by subject")

subject_avg_df = grades_long.groupby('subject')['grade'].agg(['mean', 'count']).round(2)
subject_avg_df.columns = ['Average Grade', 'Student Count']
print("Detailed subject statistics:")
print(subject_avg_df)
print()

# 3. Find each student's best and worst subjects
print("3. EACH STUDENT'S BEST AND WORST SUBJECTS")
print("   Using groupby() with idxmax() and idxmin()")

student_performance = grades_long.groupby('student').agg({
    'grade': ['min', 'max', 'mean']
}).round(2)
student_performance.columns = ['Worst Grade', 'Best Grade', 'Average Grade']
print(student_performance)
print()


# 4. Create a ranking within each subject
# Your code here:
print("4. RANKING WITHIN EACH SUBJECT")
print("   Using rank() to rank students within each subject")

# Add ranking within each subject (1 = best grade in that subject)
grades_with_ranking = grades_long.copy()
grades_with_ranking['rank'] = grades_long.groupby('subject')['grade'].rank(
    method='min',        # How to handle ties (min rank for tied values)
    ascending=False      # False = highest grade gets rank 1
)

# Sort by subject and rank for easy viewing
grades_ranked = grades_with_ranking.sort_values(['subject', 'rank'])
print(grades_ranked)
print()

# Pivot table to see rankings clearly
print("Rankings in pivot table format (1 = best in subject):")
ranking_pivot = grades_with_ranking.pivot(index='student', columns='subject', values='rank')
print(ranking_pivot)
print()


Wide format grades:
   student  Math  Science  English  History
0    Alice    85       90       78       82
1      Bob    78       82       85       80
2  Charlie    92       88       80       85
3    Diana    88       85       92       88

1. MELT: Convert wide format to long format
   Using melt() to transform columns into rows
    student  subject  grade
0     Alice     Math     85
1       Bob     Math     78
2   Charlie     Math     92
3     Diana     Math     88
4     Alice  Science     90
5       Bob  Science     82
6   Charlie  Science     88
7     Diana  Science     85
8     Alice  English     78
9       Bob  English     85
10  Charlie  English     80
11    Diana  English     92
12    Alice  History     82
13      Bob  History     80
14  Charlie  History     85
15    Diana  History     88

2. AVERAGE GRADE PER SUBJECT
   Using groupby() to aggregate by subject
Detailed subject statistics:
         Average Grade  Student Count
subject                              
English       

### Exercise 3.3 — Cross-tabulation Analysis (hard)
Use crosstab for categorical analysis.

In [17]:
# Survey responses
np.random.seed(50)
survey = pd.DataFrame({
    'age_group': np.random.choice(['18-25', '26-35', '36-45', '46+'], 200),
    'product_preference': np.random.choice(['A', 'B', 'C'], 200),
    'satisfaction': np.random.choice(['Low', 'Medium', 'High'], 200),
    'would_recommend': np.random.choice(['Yes', 'No'], 200, p=[0.7, 0.3])
})

print("Survey Data Sample:")
print(survey.head(10))
print(f"\nDataset shape: {survey.shape}")
print(f"Number of responses: {len(survey)}")
print("\n" + "="*70 + "\n")

# TODO: Analyze survey data:
# 1. Create crosstab of age_group vs product_preference
print("1. CROSSTAB: Age Group vs Product Preference")
print("   Using pd.crosstab() to create frequency table")

age_product_crosstab = pd.crosstab(
    survey['age_group'], 
    survey['product_preference'],
    margins=True  # Add row and column totals
)

print(age_product_crosstab)
print()

# 2. Add percentages (normalize by row)
print("2. CROSSTAB WITH PERCENTAGES: Normalized by row")
print("   Using normalize='index' to get row percentages")

age_product_percent = pd.crosstab(
    survey['age_group'], 
    survey['product_preference'],
    normalize='index'  # Normalize by row (each row sums to 1)
) * 100  # Convert to percentage

print(age_product_percent.round(1))
print()

# 3. Create crosstab with satisfaction levels
print("3. CROSSTAB: Age Group vs Satisfaction Levels")
print("   Analyzing satisfaction distribution across age groups")

age_satisfaction = pd.crosstab(
    survey['age_group'],
    survey['satisfaction'],
    margins=True
)

print("Raw counts:")
print(age_satisfaction)
print()

print("Row percentages (satisfaction distribution within each age group):")
age_satisfaction_percent = pd.crosstab(
    survey['age_group'],
    survey['satisfaction'],
    normalize='index'
) * 100

print(age_satisfaction_percent.round(1))
print()

# 4. Analyze recommendation rates by age and product
# Your code here:
print("4. RECOMMENDATION RATES: By Age and Product")
print("   Multi-level analysis using multiple grouping variables")

print("Recommendation rates by age group:")
age_recommend = pd.crosstab(
    survey['age_group'],
    survey['would_recommend'],
    normalize='index'
) * 100

print(age_recommend.round(1))
print()

print("Recommendation rates by product preference:")
product_recommend = pd.crosstab(
    survey['product_preference'],
    survey['would_recommend'],
    normalize='index'
) * 100
print(product_recommend.round(1))
print()

Survey Data Sample:
  age_group product_preference satisfaction would_recommend
0     18-25                  B          Low              No
1     18-25                  B       Medium             Yes
2       46+                  B          Low              No
3     26-35                  C       Medium              No
4     26-35                  C          Low             Yes
5     36-45                  A          Low              No
6     18-25                  C         High             Yes
7     36-45                  A         High             Yes
8     26-35                  C         High             Yes
9     36-45                  C       Medium             Yes

Dataset shape: (200, 4)
Number of responses: 200


1. CROSSTAB: Age Group vs Product Preference
   Using pd.crosstab() to create frequency table
product_preference   A   B   C  All
age_group                          
18-25               15  17  17   49
26-35               15  13  12   40
36-45               17  21  19

## Part 4: Real-World Data Cleaning (15 minutes)

### Exercise 4.1 — Messy Contact Data (hard)
Clean real-world messy contact information.

In [25]:
import re

# Messy contact data
contacts = pd.DataFrame({
    'name': ['  John Smith  ', 'jane doe', 'BOB JOHNSON', 'Alice    Brown', 'charlie davis'],
    'email': ['John.Smith@GMAIL.com', 'JANE@COMPANY.COM', 'bob@email..com', 
             'alice@@email.net', 'charlie@'],
    'phone': ['0412-345-678', '(04) 9876 5432', '0401234567', '04 1111 2222', 'not provided'],
    'address': ['123 Main St, Perth', '456 Oak Ave', 'Sydney, NSW', None, '789 Pine Rd, Melbourne, VIC']
})

print("Messy data:")
print(contacts)
print()

# TODO: Clean the data:
cleaned_contacts = contacts.copy()

# 1. Standardize names (proper case, remove extra spaces)
print("1. CLEANING NAMES: Proper case and removing extra spaces")
print("   Using str.strip(), str.title(), and regex")

def clean_name(name):
    if pd.isna(name):
        return name
    # Remove extra spaces and convert to title case
    cleaned = re.sub(r'\s+', ' ', str(name).strip())  # Replace multiple spaces with single space
    return cleaned.title()  # Convert to proper case

cleaned_contacts['name_cleaned'] = cleaned_contacts['name'].apply(clean_name)

print("Before and after name cleaning:")
name_comparison = pd.DataFrame({
    'Original': contacts['name'],
    'Cleaned': cleaned_contacts['name_cleaned']
})
print(name_comparison)
print()

# 2. Validate and clean email addresses
print("2. CLEANING EMAIL ADDRESSES: Validation and standardization")
print("   Using regex patterns to validate and clean emails")

def clean_email(email):
    if pd.isna(email):
        return email, False
    
    email_str = str(email).strip().lower()
    
    # Basic email regex pattern
    email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    
    # Fix common issues
    # Remove double dots
    email_str = re.sub(r'\.+', '.', email_str)
    # Remove double @ symbols
    email_str = re.sub(r'@+', '@', email_str)
    
    # Check if email ends with @ (incomplete)
    if email_str.endswith('@'):
        return email_str, False
    
    # Validate with regex
    is_valid = bool(re.match(email_pattern, email_str))
    
    return email_str, is_valid

# Apply email cleaning
email_results = cleaned_contacts['email'].apply(clean_email)
cleaned_contacts['email_cleaned'] = [result[0] for result in email_results]
cleaned_contacts['email_valid'] = [result[1] for result in email_results]

print("Email cleaning results:")
email_comparison = pd.DataFrame({
    'Original': contacts['email'],
    'Cleaned': cleaned_contacts['email_cleaned'],
    'Valid': cleaned_contacts['email_valid']
})
print(email_comparison)
print()

# 3. Standardize phone numbers to single format
print("3. CLEANING PHONE NUMBERS: Standardizing to single format")
print("   Converting to format: 04XX XXX XXX")

def clean_phone(phone):
    if pd.isna(phone):
        return phone, False
    
    phone_str = str(phone).strip()
    
    # Check for non-numeric indicators
    if 'not provided' in phone_str.lower() or phone_str.lower() == 'nan':
        return None, False
    
    # Extract only digits
    digits_only = re.sub(r'\D', '', phone_str)
    
    # Australian mobile numbers should have 10 digits starting with 04
    if len(digits_only) == 10 and digits_only.startswith('04'):
        # Format as 04XX XXX XXX
        formatted = f"{digits_only[:4]} {digits_only[4:7]} {digits_only[7:]}"
        return formatted, True
    else:
        return phone_str, False

# Apply phone cleaning
phone_results = cleaned_contacts['phone'].apply(clean_phone)
cleaned_contacts['phone_cleaned'] = [result[0] for result in phone_results]
cleaned_contacts['phone_valid'] = [result[1] for result in phone_results]

print("Phone cleaning results:")
phone_comparison = pd.DataFrame({
    'Original': contacts['phone'],
    'Cleaned': cleaned_contacts['phone_cleaned'],
    'Valid': cleaned_contacts['phone_valid']
})
print(phone_comparison)
print()

# 4. Parse addresses to extract city and state
import pandas as pd
import numpy as np
import re

# Create the messy contact data
contacts = pd.DataFrame({
    'name': ['  John Smith  ', 'jane doe', 'BOB JOHNSON', 'Alice    Brown', 'charlie davis'],
    'email': ['John.Smith@GMAIL.com', 'JANE@COMPANY.COM', 'bob@email..com', 
             'alice@@email.net', 'charlie@'],
    'phone': ['0412-345-678', '(04) 9876 5432', '0401234567', '04 1111 2222', 'not provided'],
    'address': ['123 Main St, Perth', '456 Oak Ave', 'Sydney, NSW', None, '789 Pine Rd, Melbourne, VIC']
})

print("Original Messy Data:")
print(contacts)
print("\n" + "="*80 + "\n")

# Create a copy to work with
cleaned_contacts = contacts.copy()

# TODO 1: Standardize names (proper case, remove extra spaces)
print("1. CLEANING NAMES: Proper case and removing extra spaces")
print("   Using str.strip(), str.title(), and regex")

def clean_name(name):
    if pd.isna(name):
        return name
    # Remove extra spaces and convert to title case
    cleaned = re.sub(r'\s+', ' ', str(name).strip())  # Replace multiple spaces with single space
    return cleaned.title()  # Convert to proper case

cleaned_contacts['name_cleaned'] = cleaned_contacts['name'].apply(clean_name)

print("Before and after name cleaning:")
name_comparison = pd.DataFrame({
    'Original': contacts['name'],
    'Cleaned': cleaned_contacts['name_cleaned']
})
print(name_comparison)
print()

# TODO 2: Validate and clean email addresses
print("2. CLEANING EMAIL ADDRESSES: Validation and standardization")
print("   Using regex patterns to validate and clean emails")

def clean_email(email):
    if pd.isna(email):
        return email, False
    
    email_str = str(email).strip().lower()
    
    # Basic email regex pattern
    email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    
    # Fix common issues
    # Remove double dots
    email_str = re.sub(r'\.+', '.', email_str)
    # Remove double @ symbols
    email_str = re.sub(r'@+', '@', email_str)
    
    # Check if email ends with @ (incomplete)
    if email_str.endswith('@'):
        return email_str, False
    
    # Validate with regex
    is_valid = bool(re.match(email_pattern, email_str))
    
    return email_str, is_valid

# Apply email cleaning
email_results = cleaned_contacts['email'].apply(clean_email)
cleaned_contacts['email_cleaned'] = [result[0] for result in email_results]
cleaned_contacts['email_valid'] = [result[1] for result in email_results]

print("Email cleaning results:")
email_comparison = pd.DataFrame({
    'Original': contacts['email'],
    'Cleaned': cleaned_contacts['email_cleaned'],
    'Valid': cleaned_contacts['email_valid']
})
print(email_comparison)
print()

# TODO 3: Standardize phone numbers to single format
print("3. CLEANING PHONE NUMBERS: Standardizing to single format")
print("   Converting to format: 04XX XXX XXX")

def clean_phone(phone):
    if pd.isna(phone):
        return phone, False
    
    phone_str = str(phone).strip()
    
    # Check for non-numeric indicators
    if 'not provided' in phone_str.lower() or phone_str.lower() == 'nan':
        return None, False
    
    # Extract only digits
    digits_only = re.sub(r'\D', '', phone_str)
    
    # Australian mobile numbers should have 10 digits starting with 04
    if len(digits_only) == 10 and digits_only.startswith('04'):
        # Format as 04XX XXX XXX
        formatted = f"{digits_only[:4]} {digits_only[4:7]} {digits_only[7:]}"
        return formatted, True
    else:
        return phone_str, False

# Apply phone cleaning
phone_results = cleaned_contacts['phone'].apply(clean_phone)
cleaned_contacts['phone_cleaned'] = [result[0] for result in phone_results]
cleaned_contacts['phone_valid'] = [result[1] for result in phone_results]

print("Phone cleaning results:")
phone_comparison = pd.DataFrame({
    'Original': contacts['phone'],
    'Cleaned': cleaned_contacts['phone_cleaned'],
    'Valid': cleaned_contacts['phone_valid']
})
print(phone_comparison)
print()

# TODO 4: Parse addresses to extract city and state
print("4. PARSING ADDRESSES: Extracting city and state")
print("   Using string operations and regex to parse address components")

def parse_address(address):
    if pd.isna(address):
        return None, None, False
    
    address_str = str(address).strip()
    
    # Australian states (common abbreviations)
    aus_states = ['NSW', 'VIC', 'QLD', 'SA', 'WA', 'TAS', 'NT', 'ACT']
    
    # Street indicators that suggest this is a street address, not a city
    street_indicators = ['st', 'street', 'ave', 'avenue', 'rd', 'road', 'dr', 'drive', 
                        'ln', 'lane', 'cres', 'crescent', 'pl', 'place', 'ct', 'court']
    
    # Split by commas
    parts = [part.strip() for part in address_str.split(',')]
    
    city = None
    state = None
    
    if len(parts) >= 2:
        # Look for state in the last part
        last_part = parts[-1].upper()
        state_found = False
        
        for aus_state in aus_states:
            if aus_state in last_part:
                state = aus_state
                state_found = True
                # If state found, city is the second-to-last part
                if len(parts) >= 2:
                    city = parts[-2].title()
                break
        
        # If no state found, check if last part looks like a city (not a street address)
        if not state_found:
            last_part_lower = parts[-1].lower()
            # Only treat as city if it doesn't contain street indicators
            contains_street_indicator = any(indicator in last_part_lower for indicator in street_indicators)
            if not contains_street_indicator:
                city = parts[-1].title()
            
    elif len(parts) == 1:
        # Single part - check if it contains state
        single_part = parts[0]
        state_found = False
        
        for aus_state in aus_states:
            if aus_state in single_part.upper():
                state = aus_state
                # Extract city by removing state
                city_part = single_part.upper().replace(aus_state, '').replace(',', '').strip()
                if city_part:
                    city = city_part.title()
                state_found = True
                break
        
        # If no state found, check if it looks like a city (not a street address)
        if not state_found:
            single_part_lower = single_part.lower()
            contains_street_indicator = any(indicator in single_part_lower for indicator in street_indicators)
            # Only treat as city if it doesn't contain street indicators AND doesn't start with numbers
            if not contains_street_indicator and not re.match(r'^\d+', single_part.strip()):
                city = single_part.title()
    
    # Determine if parsing was successful (only if we found meaningful city/state info)
    parsed_successfully = bool(city or state)
    
    return city, state, parsed_successfully

# Apply address parsing
address_results = cleaned_contacts['address'].apply(parse_address)
cleaned_contacts['city'] = [result[0] for result in address_results]
cleaned_contacts['state'] = [result[1] for result in address_results]
cleaned_contacts['address_parsed'] = [result[2] for result in address_results]

print("Address parsing results:")
address_comparison = pd.DataFrame({
    'Original': contacts['address'],
    'City': cleaned_contacts['city'],
    'State': cleaned_contacts['state'],
    'Parsed': cleaned_contacts['address_parsed']
})
print(address_comparison)
print()

# 5. Create data quality flags for each record
# Your code here:
print("5. DATA QUALITY ASSESSMENT: Creating quality flags")
print("   Calculating completeness and validity scores")

# Calculate quality metrics for each record
cleaned_contacts['name_complete'] = cleaned_contacts['name_cleaned'].notna()
cleaned_contacts['email_complete'] = cleaned_contacts['email_cleaned'].notna()
cleaned_contacts['phone_complete'] = cleaned_contacts['phone_cleaned'].notna()
cleaned_contacts['address_complete'] = cleaned_contacts['address'].notna()

# Calculate overall quality score (0-100)
quality_weights = {
    'name_complete': 20,
    'email_valid': 30,
    'phone_valid': 25,
    'address_parsed': 25
}

def calculate_quality_score(row):
    score = 0
    if row['name_complete']:
        score += quality_weights['name_complete']
    if row['email_valid']:
        score += quality_weights['email_valid']
    if row['phone_valid']:
        score += quality_weights['phone_valid']
    if row['address_parsed']:
        score += quality_weights['address_parsed']
    return score

cleaned_contacts['quality_score'] = cleaned_contacts.apply(calculate_quality_score, axis=1)

# Create quality categories
def categorize_quality(score):
    if score >= 80:
        return 'High'
    elif score >= 60:
        return 'Medium'
    else:
        return 'Low'

cleaned_contacts['quality_category'] = cleaned_contacts['quality_score'].apply(categorize_quality)

print("Data Quality Assessment:")
quality_summary = cleaned_contacts[[
    'name_cleaned', 'email_valid', 'phone_valid', 'address_parsed', 
    'quality_score', 'quality_category'
]].copy()

print(quality_summary)
print()

# Final cleaned dataset
print("6. FINAL CLEANED DATASET")
print("="*50)

final_dataset = cleaned_contacts[[
    'name_cleaned', 'email_cleaned', 'phone_cleaned', 
    'city', 'state', 'quality_score', 'quality_category'
]].copy()

final_dataset.columns = ['Name', 'Email', 'Phone', 'City', 'State', 'Quality_Score', 'Quality_Category']
print(final_dataset)
print()

# Summary statistics
print("7. CLEANING SUMMARY STATISTICS")
print("="*40)

total_records = len(contacts)
valid_emails = cleaned_contacts['email_valid'].sum()
valid_phones = cleaned_contacts['phone_valid'].sum()
parsed_addresses = cleaned_contacts['address_parsed'].sum()
high_quality = (cleaned_contacts['quality_category'] == 'High').sum()

print(f"Total records processed: {total_records}")
print(f"Valid emails: {valid_emails}/{total_records} ({valid_emails/total_records*100:.1f}%)")
print(f"Valid phones: {valid_phones}/{total_records} ({valid_phones/total_records*100:.1f}%)")
print(f"Parsed addresses: {parsed_addresses}/{total_records} ({parsed_addresses/total_records*100:.1f}%)")
print(f"High quality records: {high_quality}/{total_records} ({high_quality/total_records*100:.1f}%)")
print(f"Average quality score: {cleaned_contacts['quality_score'].mean():.1f}")

print("\n" + "="*80)

Messy data:
             name                 email           phone  \
0    John Smith    John.Smith@GMAIL.com    0412-345-678   
1        jane doe      JANE@COMPANY.COM  (04) 9876 5432   
2     BOB JOHNSON        bob@email..com      0401234567   
3  Alice    Brown      alice@@email.net    04 1111 2222   
4   charlie davis              charlie@    not provided   

                       address  
0           123 Main St, Perth  
1                  456 Oak Ave  
2                  Sydney, NSW  
3                         None  
4  789 Pine Rd, Melbourne, VIC  

1. CLEANING NAMES: Proper case and removing extra spaces
   Using str.strip(), str.title(), and regex
Before and after name cleaning:
         Original        Cleaned
0    John Smith       John Smith
1        jane doe       Jane Doe
2     BOB JOHNSON    Bob Johnson
3  Alice    Brown    Alice Brown
4   charlie davis  Charlie Davis

2. CLEANING EMAIL ADDRESSES: Validation and standardization
   Using regex patterns to validate and c

### Exercise 4.2 — Duplicate Detection and Resolution (hard)
Find and handle duplicate records intelligently.

In [28]:
from difflib import SequenceMatcher
import re

# Customer records with potential duplicates
customers_dup = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'name': ['John Smith', 'J. Smith', 'Jane Doe', 'Jane M. Doe', 
            'Bob Johnson', 'Robert Johnson', 'Alice Brown', 'Alice B.'],
    'email': ['john@email.com', 'j.smith@email.com', 'jane@gmail.com', 'jane.doe@gmail.com',
             'bob@email.com', 'bob@email.com', 'alice@email.com', 'alice.brown@email.com'],
    'last_purchase': pd.to_datetime(['2025-01-15', '2025-02-20', '2025-03-10', '2025-03-12',
                                     '2025-01-20', '2025-02-15', '2025-03-01', '2025-03-05']),
    'total_spent': [500, 750, 1000, 200, 300, 450, 800, 150]
})

print("Customer records:")
print(customers_dup)
print()

# TODO: Handle duplicates:
# 1. Find exact email duplicates
print("1. EXACT EMAIL DUPLICATES")
print("   Finding records with identical email addresses")

# Find duplicated emails
email_duplicates = customers_dup[customers_dup.duplicated(subset=['email'], keep=False)]
email_duplicates_sorted = email_duplicates.sort_values('email')

print("Records with duplicate emails:")
print(email_duplicates_sorted)
print()

# Group by email to see duplicates clearly
print("Duplicate email groups:")
for email, group in customers_dup.groupby('email'):
    if len(group) > 1:
        print(f"\nEmail: {email}")
        print(group[['customer_id', 'name', 'last_purchase', 'total_spent']])
print()


# 2. Find potential name duplicates (similar names)
print("2. SIMILAR NAME DUPLICATES")
print("   Using string similarity to find potential name matches")

def name_similarity(name1, name2):
    """Calculate similarity between two names using SequenceMatcher"""
    # Normalize names (lowercase, remove extra spaces)
    name1_clean = re.sub(r'\s+', ' ', str(name1).lower().strip())
    name2_clean = re.sub(r'\s+', ' ', str(name2).lower().strip())
    
    # Calculate similarity ratio
    similarity = SequenceMatcher(None, name1_clean, name2_clean).ratio()
    return similarity

def find_name_duplicates(df, threshold=0.7):
    """Find potential name duplicates based on similarity threshold"""
    potential_duplicates = []
    
    for i in range(len(df)):
        for j in range(i+1, len(df)):
            name1 = df.iloc[i]['name']
            name2 = df.iloc[j]['name']
            similarity = name_similarity(name1, name2)
            
            if similarity >= threshold:
                potential_duplicates.append({
                    'customer_id_1': df.iloc[i]['customer_id'],
                    'name_1': name1,
                    'customer_id_2': df.iloc[j]['customer_id'], 
                    'name_2': name2,
                    'similarity': similarity
                })
    
    return pd.DataFrame(potential_duplicates)

# Find similar names
name_duplicates = find_name_duplicates(customers_dup, threshold=0.6)
print("Potential name duplicates (similarity >= 0.6):")
print(name_duplicates.round(3))
print()


# 3. Merge duplicate records (keep most recent, sum totals)
# 4. Create a deduplication report
# Your code here:


Customer records:
   customer_id            name                  email last_purchase  \
0            1      John Smith         john@email.com    2025-01-15   
1            2        J. Smith      j.smith@email.com    2025-02-20   
2            3        Jane Doe         jane@gmail.com    2025-03-10   
3            4     Jane M. Doe     jane.doe@gmail.com    2025-03-12   
4            5     Bob Johnson          bob@email.com    2025-01-20   
5            6  Robert Johnson          bob@email.com    2025-02-15   
6            7     Alice Brown        alice@email.com    2025-03-01   
7            8        Alice B.  alice.brown@email.com    2025-03-05   

   total_spent  
0          500  
1          750  
2         1000  
3          200  
4          300  
5          450  
6          800  
7          150  

1. EXACT EMAIL DUPLICATES
   Finding records with identical email addresses
Records with duplicate emails:
   customer_id            name          email last_purchase  total_spent
4       

## 🚀 Challenge: Complete Data Pipeline
Build an end-to-end data processing pipeline.

In [None]:
# E-commerce data pipeline challenge
# You have three data sources that need to be combined and analyzed

# Source 1: Order data
orders = pd.DataFrame({
    'order_id': range(1, 101),
    'customer_id': np.random.randint(1, 31, 100),
    'product_id': np.random.choice(['P1', 'P2', 'P3', 'P4', 'P5'], 100),
    'quantity': np.random.randint(1, 5, 100),
    'order_date': pd.date_range('2025-07-01', periods=100),
    'status': np.random.choice(['Completed', 'Pending', 'Cancelled'], 100, p=[0.8, 0.15, 0.05])
})

# Source 2: Product data
products = pd.DataFrame({
    'product_id': ['P1', 'P2', 'P3', 'P4', 'P5'],
    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Accessories'],
    'unit_price': [1200, 25, 80, 350, 120],
    'cost': [800, 15, 50, 250, 70]
})

# Source 3: Customer data
customers = pd.DataFrame({
    'customer_id': range(1, 31),
    'customer_name': [f'Customer_{i}' for i in range(1, 31)],
    'segment': np.random.choice(['Premium', 'Standard', 'Basic'], 30, p=[0.2, 0.5, 0.3]),
    'join_date': pd.date_range('2024-01-01', periods=30, freq='W')
})

# TODO: Build a complete analysis pipeline:
# 1. Merge all three datasets
# 2. Calculate order values and profit margins
# 3. Analyze sales by customer segment and product category
# 4. Find top customers and products
# 5. Calculate customer lifetime value
# 6. Create monthly sales trend
# 7. Identify cross-selling opportunities
# 8. Generate executive summary DataFrame
# Your code here:


## 📊 Lab Summary Checklist

**Core Skills Practiced:**
- [ ] GroupBy with single and multiple columns
- [ ] Custom aggregations with agg()
- [ ] Different types of merges (inner, left, outer)
- [ ] Pivot tables and reshaping
- [ ] Data cleaning and deduplication
- [ ] Complete data pipeline

**Self-Assessment:**
- I can group and aggregate data efficiently ✅
- I understand different join types ✅
- I can reshape data between wide and long formats ✅
- I can clean messy real-world data ✅
- I can build data processing pipelines ✅

## 🎯 What's Next?
**Lab 01C:** Advanced EDA techniques and statistical analysis!