# **AI TECH INSTITUTE** · *Intermediate AI & Data Science*
### Week 01 · Lab 01B — Data Wrangling
**Instructor:** Amir Charkhi  |  **Duration:** 45 minutes  |  **Difficulty:** ⭐⭐⭐☆☆

> **Goal:** Master groupby, merge, pivot, and real-world data cleaning.


## Learning Objectives
- Master groupby operations and aggregations
- Perform different types of merges and joins
- Reshape data with pivot and melt
- Handle real-world messy data

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("Ready for data wrangling! 🔧")

Ready for data wrangling! 🔧


## Part 1: GroupBy Mastery (15 minutes)

### Exercise 1.1 — Sales Team Performance (medium)
Analyze sales team performance across regions.

In [19]:
# Sales data
np.random.seed(42)
sales = pd.DataFrame({
    'salesperson': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana'], 100),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'product': np.random.choice(['A', 'B', 'C'], 100),
    'quantity': np.random.randint(1, 50, 100),
    'revenue': np.random.uniform(100, 5000, 100).round(2),
    'date': pd.date_range('2025-01-01', periods=100)
})
sales

Unnamed: 0,salesperson,region,product,quantity,revenue,date
0,Charlie,East,C,20,1227.91,2025-01-01
1,Diana,South,C,36,2546.05,2025-01-02
2,Alice,South,A,19,2902.82,2025-01-03
3,Charlie,West,A,26,3865.91,2025-01-04
4,Charlie,South,B,3,313.66,2025-01-05
...,...,...,...,...,...,...
95,Bob,South,A,33,2873.58,2025-04-06
96,Bob,South,C,28,877.37,2025-04-07
97,Diana,West,B,47,688.81,2025-04-08
98,Bob,North,A,33,1775.21,2025-04-09


In [20]:
# TODO: Use groupby to find:
# 1. Total revenue by salesperson
salesperson_unique = sales['salesperson'].unique().tolist()
print(f"Names of sales person: {salesperson_unique}")

revenue_per_salesperson = sales.groupby('salesperson')['revenue'].sum()
print(f"\nTotal revenue by sales person ($):\n{revenue_per_salesperson}")

Names of sales person: ['Charlie', 'Diana', 'Alice', 'Bob']

Total revenue by sales person ($):
salesperson
Alice      40206.96
Bob        59313.60
Charlie    53546.82
Diana      74760.62
Name: revenue, dtype: float64


In [21]:
# 2. Average sale amount by region
avg_sales_by_region = sales.groupby('region')['revenue'].mean().round(2)
print(f"Average sales by region ($): \n{avg_sales_by_region}")

Average sales by region ($): 
region
East     2643.60
North    2282.16
South    2350.26
West     1757.44
Name: revenue, dtype: float64


In [37]:
# 3. Top product by quantity in each region
product_quantity_by_region = sales.groupby(['region','product'])['quantity'].max()
print(f"Max product quantity by region (multi-index series): \n{product_quantity_by_region}")

df_product_quantity_by_region = product_quantity_by_region.reset_index(name="quantity")
print(f"\nMax product quantity by region (reset index - dataframe): \n{df_product_quantity_by_region}")

top_product_qty_by_region = df_product_quantity_by_region.loc[df_product_quantity_by_region.groupby('region')['quantity'].idxmax()]
print(f"\nTop product by quantity in each region: \n{top_product_qty_by_region}")

Max product quantity by region (multi-index series): 
region  product
East    A          45
        B          38
        C          41
North   A          41
        B          48
        C          49
South   A          42
        B          48
        C          38
West    A          48
        B          47
        C          49
Name: quantity, dtype: int32

Max product quantity by region (reset index - dataframe): 
   region product  quantity
0    East       A        45
1    East       B        38
2    East       C        41
3   North       A        41
4   North       B        48
5   North       C        49
6   South       A        42
7   South       B        48
8   South       C        38
9    West       A        48
10   West       B        47
11   West       C        49

Top product by quantity in each region: 
   region product  quantity
0    East       A        45
5   North       C        49
7   South       B        48
11   West       C        49


In [53]:
# 4. Sales performance by salesperson AND region (multi-level groupby)
sales_performance = sales.groupby(['salesperson','region'])['revenue'].max()
print(f"Revenue by sales person and region ($) \n{sales_performance}")

df_sales_performance = sales_performance.reset_index(name = 'revenue')
df_top_sales = df_sales_performance.loc[df_sales_performance.groupby('salesperson')['revenue'].idxmax()]
print(f"\nTop revenue by region per sales person ($): \n{df_top_sales}")

Revenue by sales person and region ($) 
salesperson  region
Alice        East      3601.52
             North     4258.48
             South     2902.82
             West      4829.73
Bob          East      3763.82
             North     4850.73
             South     4236.64
             West      2248.52
Charlie      East      3392.93
             North     4210.77
             South     4429.12
             West      4770.05
Diana        East      4911.02
             North     2156.93
             South     4973.30
             West      2044.24
Name: revenue, dtype: float64

Top revenue by region per sales person ($): 
   salesperson region  revenue
3        Alice   West  4829.73
5          Bob  North  4850.73
11     Charlie   West  4770.05
14       Diana  South  4973.30


### Exercise 1.2 — Custom Aggregations (medium)
Apply multiple aggregation functions simultaneously.

In [75]:
# Customer orders
orders = pd.DataFrame({
    'customer_id': np.random.randint(1, 21, 100),
    'order_date': pd.date_range('2025-05-01', periods=100),
    'amount': np.random.uniform(20, 500, 100).round(2),
    'items': np.random.randint(1, 10, 100),
    'category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Books'], 100)
})
orders


Unnamed: 0,customer_id,order_date,amount,items,category
0,10,2025-05-01,118.40,9,Food
1,3,2025-05-02,202.03,5,Books
2,9,2025-05-03,464.70,9,Electronics
3,13,2025-05-04,366.37,3,Food
4,18,2025-05-05,43.09,4,Books
...,...,...,...,...,...
95,12,2025-08-04,129.15,7,Food
96,18,2025-08-05,311.79,7,Books
97,14,2025-08-06,202.07,1,Clothing
98,11,2025-08-07,377.24,7,Books


In [62]:
orders.describe()

Unnamed: 0,customer_id,order_date,amount,items
count,100.0,100,100.0,100.0
mean,10.74,2025-07-20 12:00:00,273.0526,4.47
min,1.0,2025-06-01 00:00:00,25.45,1.0
25%,5.0,2025-06-25 18:00:00,154.43,2.0
50%,11.5,2025-07-20 12:00:00,284.81,5.0
75%,16.25,2025-08-14 06:00:00,402.505,6.25
max,20.0,2025-09-08 00:00:00,496.62,9.0
std,6.102856,,150.550964,2.583768


In [None]:
# TODO: Create a customer summary with:
# Use .agg() with dictionary or list of functions
# Your code here:

In [61]:
# 1. Total spending
unique_customer_id = orders['customer_id'].unique().tolist()
print(f"Amount of customers: {len(unique_customer_id)}")
print(f"\nUnique customer IDs: {unique_customer_id}")

total_spend_per_customer = orders.groupby('customer_id')['amount'].sum()
print(f"\nTotal spend per customer ($): \n{total_spend_per_customer}")
print(f"\nTotal spend from all customers ($): {total_spend_per_customer.sum()}")

Amount of customers: 19

Unique customer IDs: [15, 17, 14, 20, 5, 12, 16, 7, 4, 1, 10, 2, 19, 13, 18, 3, 11, 8, 9]

Total spend per customer ($): 
customer_id
1     1747.46
2     1009.81
3     1679.48
4     2209.61
5     2111.84
7     1113.72
8      693.39
9     1603.07
10    1186.71
11     360.46
12    2090.49
13    1094.00
14     823.82
15     670.96
16    1868.49
17    2288.07
18    1320.14
19    1797.28
20    1636.46
Name: amount, dtype: float64

Total spend from all customers ($): 27305.26


In [68]:
# 2. Average order value
avg_order_per_customer = round(orders.groupby('customer_id')['amount'].mean(),2)
print(f"Average order value per customer ($): \n{avg_order_per_customer}")
print(f"\nAverage overall order value ($): {avg_order_per_customer.mean():.2f}")

Average order value per customer ($): 
customer_id
1     291.24
2     252.45
3     279.91
4     276.20
5     301.69
7     278.43
8     346.70
9     320.61
10    237.34
11    120.15
12    261.31
13    218.80
14    274.61
15    335.48
16    266.93
17    254.23
18    330.04
19    256.75
20    327.29
Name: amount, dtype: float64

Average overall order value ($): 275.27


In [74]:
# 3. Number of orders
number_of_orders = orders.groupby('customer_id')['customer_id'].count()
print(f"Number of orders per customers: \n{number_of_orders}")
print(f"\nCustomer with highest order numbers: {number_of_orders.idxmax()}")

Number of orders per customers: 
customer_id
1     6
2     4
3     6
4     8
5     7
7     4
8     2
9     5
10    5
11    3
12    8
13    5
14    3
15    2
16    7
17    9
18    4
19    7
20    5
Name: customer_id, dtype: int64

Customer with highest order numbers: 17


In [73]:
# 4. Most frequent category
category_count = orders.groupby('category')['category'].count()
print(f"Orders count for each category: \n{category_count}")

most_frequent_category = category_count.idxmax()
print(f"\nMost frequently ordered category: {most_frequent_category}")

Orders count for each category: 
category
Books          17
Clothing       29
Electronics    28
Food           26
Name: category, dtype: int64

Most frequently ordered category: Clothing


In [80]:
# 5. Days since last order
last_order = orders.groupby('customer_id')['order_date'].max()
print(f"Customer's last order date: \n{last_order}")

today = pd.Timestamp.today().normalize()
today

Days_since_last_order = today - last_order
print(f"\nDays since last order: \n{Days_since_last_order}")

Customer's last order date: 
customer_id
1    2025-07-24
2    2025-07-22
3    2025-08-01
4    2025-07-08
5    2025-08-03
6    2025-07-31
8    2025-07-21
9    2025-08-08
10   2025-07-17
11   2025-08-07
12   2025-08-04
13   2025-07-20
14   2025-08-06
15   2025-07-30
16   2025-05-28
17   2025-08-02
18   2025-08-05
19   2025-06-26
20   2025-07-23
Name: order_date, dtype: datetime64[ns]

Days since last order: 
customer_id
1    38 days
2    40 days
3    30 days
4    54 days
5    28 days
6    31 days
8    41 days
9    23 days
10   45 days
11   24 days
12   27 days
13   42 days
14   25 days
15   32 days
16   95 days
17   29 days
18   26 days
19   66 days
20   39 days
Name: order_date, dtype: timedelta64[ns]


### Exercise 1.3 — Transform vs Aggregate (hard)
Understand the difference between transform and aggregate operations.

In [81]:
# Store sales data
store_sales = pd.DataFrame({
    'store': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'month': ['Jan', 'Feb', 'Mar'] * 3,
    'sales': [1000, 1200, 1100, 800, 900, 950, 1500, 1600, 1550]
})

In [82]:
store_sales

Unnamed: 0,store,month,sales
0,A,Jan,1000
1,A,Feb,1200
2,A,Mar,1100
3,B,Jan,800
4,B,Feb,900
5,B,Mar,950
6,C,Jan,1500
7,C,Feb,1600
8,C,Mar,1550


In [83]:
store_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   store   9 non-null      object
 1   month   9 non-null      object
 2   sales   9 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 348.0+ bytes


In [85]:
# TODO: Use groupby to:
# 1. Add a column showing each store's average sales (use transform)
store_sales['average_sales']=store_sales.groupby('store')['sales'].transform('mean').round(2)
store_sales

Unnamed: 0,store,month,sales,average_sales
0,A,Jan,1000,1100.0
1,A,Feb,1200,1100.0
2,A,Mar,1100,1100.0
3,B,Jan,800,883.33
4,B,Feb,900,883.33
5,B,Mar,950,883.33
6,C,Jan,1500,1550.0
7,C,Feb,1600,1550.0
8,C,Mar,1550,1550.0


In [86]:
# 2. Add a column showing percentage of store's total sales
store_sales['store_total_sales'] = store_sales.groupby('store')['sales'].transform('sum')
store_sales

Unnamed: 0,store,month,sales,average_sales,store_total_sales
0,A,Jan,1000,1100.0,3300
1,A,Feb,1200,1100.0,3300
2,A,Mar,1100,1100.0,3300
3,B,Jan,800,883.33,2650
4,B,Feb,900,883.33,2650
5,B,Mar,950,883.33,2650
6,C,Jan,1500,1550.0,4650
7,C,Feb,1600,1550.0,4650
8,C,Mar,1550,1550.0,4650


In [88]:
# 3. Add a column indicating if sales are above store average
store_sales['above_average']=store_sales['sales']>store_sales['average_sales']
store_sales

Unnamed: 0,store,month,sales,average_sales,store_total_sales,above_average
0,A,Jan,1000,1100.0,3300,False
1,A,Feb,1200,1100.0,3300,True
2,A,Mar,1100,1100.0,3300,False
3,B,Jan,800,883.33,2650,False
4,B,Feb,900,883.33,2650,True
5,B,Mar,950,883.33,2650,True
6,C,Jan,1500,1550.0,4650,False
7,C,Feb,1600,1550.0,4650,True
8,C,Mar,1550,1550.0,4650,False


In [97]:
# 4. Rank months within each store by sales
store_rank = store_sales.sort_values(by=['store','sales'],ascending=[True,False])
print(f"Rank months within each store by sales: \n{store_rank}")

Rank months within each store by sales: 
  store month  sales  average_sales  store_total_sales  above_average
1     A   Feb   1200        1100.00               3300           True
2     A   Mar   1100        1100.00               3300          False
0     A   Jan   1000        1100.00               3300          False
5     B   Mar    950         883.33               2650           True
4     B   Feb    900         883.33               2650           True
3     B   Jan    800         883.33               2650          False
7     C   Feb   1600        1550.00               4650           True
8     C   Mar   1550        1550.00               4650          False
6     C   Jan   1500        1550.00               4650          False


## Part 2: Merging and Joining (15 minutes)

### Exercise 2.1 — Customer Database Integration (medium)
Merge customer information from multiple sources.

In [98]:
# Customer basic info
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 
             'diana@email.com', 'eve@email.com']
})

# Customer addresses
addresses = pd.DataFrame({
    'customer_id': [1, 2, 3, 6],  # Note: customer 6 doesn't exist, 4 & 5 missing
    'city': ['Perth', 'Sydney', 'Melbourne', 'Brisbane'],
    'state': ['WA', 'NSW', 'VIC', 'QLD']
})

# Customer orders
customer_orders = pd.DataFrame({
    'customer_id': [1, 1, 2, 3, 3, 3, 4],
    'order_total': [150, 200, 75, 300, 125, 180, 90]
})

# TODO: Perform different types of merges:
# Your code here:

In [99]:
customers

Unnamed: 0,customer_id,name,email
0,1,Alice,alice@email.com
1,2,Bob,bob@email.com
2,3,Charlie,charlie@email.com
3,4,Diana,diana@email.com
4,5,Eve,eve@email.com


In [100]:
addresses

Unnamed: 0,customer_id,city,state
0,1,Perth,WA
1,2,Sydney,NSW
2,3,Melbourne,VIC
3,6,Brisbane,QLD


In [101]:
customer_orders

Unnamed: 0,customer_id,order_total
0,1,150
1,1,200
2,2,75
3,3,300
4,3,125
5,3,180
6,4,90


In [103]:
# 1. Inner join: customers with addresses
customer_address = pd.merge(
    customers,addresses,
    on = 'customer_id',
    how = 'inner'
)
customer_address

Unnamed: 0,customer_id,name,email,city,state
0,1,Alice,alice@email.com,Perth,WA
1,2,Bob,bob@email.com,Sydney,NSW
2,3,Charlie,charlie@email.com,Melbourne,VIC


In [104]:
# 2. Left join: all customers with their addresses (if available)
customer_address_v2 = pd.merge(
    customers,addresses,
    on = 'customer_id',
    how = 'left'
)
customer_address_v2

Unnamed: 0,customer_id,name,email,city,state
0,1,Alice,alice@email.com,Perth,WA
1,2,Bob,bob@email.com,Sydney,NSW
2,3,Charlie,charlie@email.com,Melbourne,VIC
3,4,Diana,diana@email.com,,
4,5,Eve,eve@email.com,,


In [105]:
# 3. Outer join: all records from both tables
customer_address_v3 = pd.merge(
    customers,addresses,
    on = 'customer_id',
    how = 'outer'
)
customer_address_v3

Unnamed: 0,customer_id,name,email,city,state
0,1,Alice,alice@email.com,Perth,WA
1,2,Bob,bob@email.com,Sydney,NSW
2,3,Charlie,charlie@email.com,Melbourne,VIC
3,4,Diana,diana@email.com,,
4,5,Eve,eve@email.com,,
5,6,,,Brisbane,QLD


In [108]:
# 4. Merge all three tables to create complete customer profile
complete_profile = pd.merge(
    customer_orders, customer_address_v3,
    on = 'customer_id',
    how = 'left'
)
complete_profile

Unnamed: 0,customer_id,order_total,name,email,city,state
0,1,150,Alice,alice@email.com,Perth,WA
1,1,200,Alice,alice@email.com,Perth,WA
2,2,75,Bob,bob@email.com,Sydney,NSW
3,3,300,Charlie,charlie@email.com,Melbourne,VIC
4,3,125,Charlie,charlie@email.com,Melbourne,VIC
5,3,180,Charlie,charlie@email.com,Melbourne,VIC
6,4,90,Diana,diana@email.com,,


### Exercise 2.2 — Product Catalog Merge (hard)
Handle complex merges with multiple keys and conditions.

In [109]:
# Product catalog
products = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P003', 'P004'],
    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics'],
    'base_price': [1000, 25, 75, 350]
})

# Store-specific pricing
store_prices = pd.DataFrame({
    'store_id': ['S1', 'S1', 'S1', 'S2', 'S2', 'S3'],
    'product_id': ['P001', 'P002', 'P003', 'P001', 'P004', 'P001'],
    'price_multiplier': [1.1, 1.0, 1.05, 0.95, 1.15, 1.2]
})

# Store information
stores = pd.DataFrame({
    'store_id': ['S1', 'S2', 'S3', 'S4'],
    'store_name': ['MegaMart', 'QuickShop', 'TechZone', 'BudgetBuy'],
    'location': ['Downtown', 'Suburb', 'Mall', 'Online']
})

# TODO: Create a complete price list:
# Your code here:

In [110]:
products

Unnamed: 0,product_id,product_name,category,base_price
0,P001,Laptop,Electronics,1000
1,P002,Mouse,Accessories,25
2,P003,Keyboard,Accessories,75
3,P004,Monitor,Electronics,350


In [111]:
store_prices

Unnamed: 0,store_id,product_id,price_multiplier
0,S1,P001,1.1
1,S1,P002,1.0
2,S1,P003,1.05
3,S2,P001,0.95
4,S2,P004,1.15
5,S3,P001,1.2


In [112]:
stores

Unnamed: 0,store_id,store_name,location
0,S1,MegaMart,Downtown
1,S2,QuickShop,Suburb
2,S3,TechZone,Mall
3,S4,BudgetBuy,Online


In [114]:
# 1. Merge to get actual prices (base_price * multiplier) for each store
merge_store_prices = pd.merge(
    store_prices,products,
    on = 'product_id',
    how = 'left'
)
merge_store_prices['actual_price']=merge_store_prices['price_multiplier']*merge_store_prices['base_price']
merge_store_prices

Unnamed: 0,store_id,product_id,price_multiplier,product_name,category,base_price,actual_price
0,S1,P001,1.1,Laptop,Electronics,1000,1100.0
1,S1,P002,1.0,Mouse,Accessories,25,25.0
2,S1,P003,1.05,Keyboard,Accessories,75,78.75
3,S2,P001,0.95,Laptop,Electronics,1000,950.0
4,S2,P004,1.15,Monitor,Electronics,350,402.5
5,S3,P001,1.2,Laptop,Electronics,1000,1200.0


In [121]:
# 2. Include store names and locations
merge_store_names_location = pd.merge(
    merge_store_prices, stores,
    on='store_id',
    how='left'
)
merge_store_names_location

Unnamed: 0,store_id,product_id,price_multiplier,product_name,category,base_price,actual_price,store_name,location
0,S1,P001,1.1,Laptop,Electronics,1000,1100.0,MegaMart,Downtown
1,S1,P002,1.0,Mouse,Accessories,25,25.0,MegaMart,Downtown
2,S1,P003,1.05,Keyboard,Accessories,75,78.75,MegaMart,Downtown
3,S2,P001,0.95,Laptop,Electronics,1000,950.0,QuickShop,Suburb
4,S2,P004,1.15,Monitor,Electronics,350,402.5,QuickShop,Suburb
5,S3,P001,1.2,Laptop,Electronics,1000,1200.0,TechZone,Mall


In [139]:
# 3. Find products not available in certain stores
# find all store x product combo
store_products_combo = pd.merge(
    stores[['store_id']],
    products[['product_id','base_price']],
    how='cross'
)
print(f"All possible store & product combos: \n{store_products_combo}")

# left join with store_prices
store_products_prices_combo = pd.merge(
    store_products_combo,
    store_prices,
    on=['store_id', 'product_id'],
    how='left'
)
print(f"\nLeft join prices table with above table (assuming products exist when price multipliers exist) \n{store_products_prices_combo}")

# not available products are the lines with missing price multipliers
missing_items = store_products_prices_combo[store_products_prices_combo['price_multiplier'].isna()][['store_id', 'product_id']]
print(f"\nMissing products (items without price multipliers): \n{missing_items}")


All possible store & product combos: 
   store_id product_id  base_price
0        S1       P001        1000
1        S1       P002          25
2        S1       P003          75
3        S1       P004         350
4        S2       P001        1000
5        S2       P002          25
6        S2       P003          75
7        S2       P004         350
8        S3       P001        1000
9        S3       P002          25
10       S3       P003          75
11       S3       P004         350
12       S4       P001        1000
13       S4       P002          25
14       S4       P003          75
15       S4       P004         350

Left join prices table with above table (assuming products exist when price multipliers exist) 
   store_id product_id  base_price  price_multiplier
0        S1       P001        1000              1.10
1        S1       P002          25              1.00
2        S1       P003          75              1.05
3        S1       P004         350               NaN
4    

In [143]:
# 4. Calculate price variance across stores for each product
# use the merged table from step 1
merge_store_prices

Unnamed: 0,store_id,product_id,price_multiplier,product_name,category,base_price,actual_price
0,S1,P001,1.1,Laptop,Electronics,1000,1100.0
1,S1,P002,1.0,Mouse,Accessories,25,25.0
2,S1,P003,1.05,Keyboard,Accessories,75,78.75
3,S2,P001,0.95,Laptop,Electronics,1000,950.0
4,S2,P004,1.15,Monitor,Electronics,350,402.5
5,S3,P001,1.2,Laptop,Electronics,1000,1200.0


In [165]:
print("Only product P001 is available across 3 stores.")
P001 = merge_store_prices.loc[merge_store_prices['product_id']=='P001',['store_id','actual_price']]
print(f"Actual price of P001 across stores: \n{P001}")

max_price = P001['actual_price'].max()
max_price_store = P001.loc[P001['actual_price']==max_price, 'store_id'].iloc[0]
min_price = P001['actual_price'].min()
min_price_store = P001.loc[P001['actual_price']==min_price, 'store_id'].iloc[0]
price_diff = max_price - min_price

print(f"\nHighest price is ${max_price} at store {max_price_store}")
print(f"Lowest price is ${min_price} at store {min_price_store}")
print(f"Price difference between both stores: ${price_diff}")

Only product P001 is available across 3 stores.
Actual price of P001 across stores: 
  store_id  actual_price
0       S1        1100.0
3       S2         950.0
5       S3        1200.0

Highest price is $1200.0 at store S3
Lowest price is $950.0 at store S2
Price difference between both stores: $250.0


## Part 3: Pivoting and Reshaping (15 minutes)

### Exercise 3.1 — Sales Matrix Creation (medium)
Reshape data from long to wide format.

In [166]:
# Monthly sales by product and region (long format)
long_sales = pd.DataFrame({
    'month': ['Jan', 'Jan', 'Jan', 'Jan', 'Feb', 'Feb', 'Feb', 'Feb',
              'Mar', 'Mar', 'Mar', 'Mar'],
    'region': ['North', 'South', 'East', 'West'] * 3,
    'product': ['A', 'A', 'B', 'B'] * 3,
    'sales': np.random.randint(1000, 5000, 12)
})

print("Long format data:")
print(long_sales)
print()

# TODO: Reshape the data:
# Your code here:

Long format data:
   month region product  sales
0    Jan  North       A   3040
1    Jan  South       A   3868
2    Jan   East       B   1454
3    Jan   West       B   3953
4    Feb  North       A   2708
5    Feb  South       A   4472
6    Feb   East       B   3969
7    Feb   West       B   4028
8    Mar  North       A   4547
9    Mar  South       A   1701
10   Mar   East       B   3681
11   Mar   West       B   4629



In [183]:
# order the month
#month_order = ["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]
month_order = ["Jan","Feb","Mar"]
long_sales['month']=pd.Categorical(long_sales['month'], categories = month_order , ordered = True)

In [177]:
# 1. Pivot to show regions as columns, months as rows

pivot1 = long_sales.pivot(index='month', columns='region', values='sales')
pivot1 = pivot1.sort_index(level='month')
pivot1

region,East,North,South,West
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,1454,3040,3868,3953
Feb,3969,2708,4472,4028
Mar,3681,4547,1701,4629


In [179]:
# 2. Create a pivot table with product-region sales totals
pivot2 = long_sales.pivot(index=['product','month'], columns='region', values='sales')
pivot2 = pivot2.sort_index(level=['product','month'])
pivot2

Unnamed: 0_level_0,region,East,North,South,West
product,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,Jan,,3040.0,3868.0,
A,Feb,,2708.0,4472.0,
A,Mar,,4547.0,1701.0,
B,Jan,1454.0,,,3953.0
B,Feb,3969.0,,,4028.0
B,Mar,3681.0,,,4629.0


In [184]:
# 3. Add row and column totals (margins)
pivot3 = long_sales.pivot_table(
    index = ['month'],
    columns = 'region',
    values = 'sales',
    aggfunc = 'sum',
    margins = True,
    margins_name = 'Total'
)
pivot3

  pivot3 = long_sales.pivot_table(


region,East,North,South,West,Total
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Jan,1454,3040,3868,3953,12315
Feb,3969,2708,4472,4028,15177
Mar,3681,4547,1701,4629,14558
Total,9104,10295,10041,12610,42050


In [193]:
# 4. Calculate month-over-month growth for each region
#using pivot 1 table from step 1
print("sales by region, $")
print(pivot1)
print()
monthly_growth = round(pivot1.pct_change()*100,2)
print("month on month growth, %")
print(monthly_growth)

sales by region, $
region  East  North  South  West
month                           
Jan     1454   3040   3868  3953
Feb     3969   2708   4472  4028
Mar     3681   4547   1701  4629

month on month growth, %
region    East  North  South   West
month                              
Jan        NaN    NaN    NaN    NaN
Feb     172.97 -10.92  15.62   1.90
Mar      -7.26  67.91 -61.96  14.92


### Exercise 3.2 — Melt and Stack Operations (hard)
Convert wide format data to long format for analysis.

In [194]:
# Wide format grade data
grades_wide = pd.DataFrame({
    'student': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Math': [85, 78, 92, 88],
    'Science': [90, 82, 88, 85],
    'English': [78, 85, 80, 92],
    'History': [82, 80, 85, 88]
})

print("Wide format grades:")
print(grades_wide)
print()

# TODO: Reshape the data:
# Your code here:

Wide format grades:
   student  Math  Science  English  History
0    Alice    85       90       78       82
1      Bob    78       82       85       80
2  Charlie    92       88       80       85
3    Diana    88       85       92       88



In [195]:
# 1. Melt to long format (student, subject, grade)
grades_long = pd.melt(
    grades_wide,
    id_vars = ['student'],
    var_name = 'subject',
    value_name = 'grade'
)
grades_long

Unnamed: 0,student,subject,grade
0,Alice,Math,85
1,Bob,Math,78
2,Charlie,Math,92
3,Diana,Math,88
4,Alice,Science,90
5,Bob,Science,82
6,Charlie,Science,88
7,Diana,Science,85
8,Alice,English,78
9,Bob,English,85


In [196]:
# 2. Calculate average grade per subject
avg_grade_per_subject = grades_long.groupby('subject')['grade'].mean()
avg_grade_per_subject

subject
English    83.75
History    83.75
Math       85.75
Science    86.25
Name: grade, dtype: float64

In [210]:
# 3. Find each student's best and worst subjects
best_grade = grades_long.groupby('student')['grade'].max().reset_index()
best_grade_subject = pd.merge(best_grade, grades_long, on=('student','grade'), how = 'inner')
print(f"Student's best subject: \n{best_grade_subject}")

worst_grade = grades_long.groupby('student')['grade'].min().reset_index()
worst_grade_subject = pd.merge(worst_grade, grades_long, on =('student','grade'), how = 'inner')
print(f"\nStudent's worst subject: \n{worst_grade_subject}")

Student's best subject: 
   student  grade  subject
0    Alice     90  Science
1      Bob     85  English
2  Charlie     92     Math
3    Diana     92  English

Student's worst subject: 
   student  grade  subject
0    Alice     78  English
1      Bob     78     Math
2  Charlie     80  English
3    Diana     85  Science


In [230]:
# 4. Create a ranking within each subject
ranking = grades_long.groupby(['subject','student'], as_index = False).agg(grade=('grade','max')).sort_values(['subject','grade'], ascending=[True,False])
ranking

Unnamed: 0,subject,student,grade
3,English,Diana,92
1,English,Bob,85
2,English,Charlie,80
0,English,Alice,78
7,History,Diana,88
6,History,Charlie,85
4,History,Alice,82
5,History,Bob,80
10,Math,Charlie,92
11,Math,Diana,88


In [232]:
# option 2:
ranking2 = grades_long.sort_values(['subject','grade'],ascending=[True,False])
ranking2

Unnamed: 0,student,subject,grade
11,Diana,English,92
9,Bob,English,85
10,Charlie,English,80
8,Alice,English,78
15,Diana,History,88
14,Charlie,History,85
12,Alice,History,82
13,Bob,History,80
2,Charlie,Math,92
3,Diana,Math,88


### Exercise 3.3 — Cross-tabulation Analysis (hard)
Use crosstab for categorical analysis.

In [233]:
# Survey responses
np.random.seed(50)
survey = pd.DataFrame({
    'age_group': np.random.choice(['18-25', '26-35', '36-45', '46+'], 200),
    'product_preference': np.random.choice(['A', 'B', 'C'], 200),
    'satisfaction': np.random.choice(['Low', 'Medium', 'High'], 200),
    'would_recommend': np.random.choice(['Yes', 'No'], 200, p=[0.7, 0.3])
})

# TODO: Analyze survey data:
# Your code here:

In [234]:
survey

Unnamed: 0,age_group,product_preference,satisfaction,would_recommend
0,18-25,B,Low,No
1,18-25,B,Medium,Yes
2,46+,B,Low,No
3,26-35,C,Medium,No
4,26-35,C,Low,Yes
...,...,...,...,...
195,18-25,C,Medium,Yes
196,18-25,B,High,Yes
197,36-45,C,High,Yes
198,18-25,B,Low,Yes


In [239]:
# 1. Create crosstab of age_group vs product_preference
crosstab_age_product = pd.crosstab(
    survey['age_group'],
    survey['product_preference'],
    margins = True,
    margins_name = 'Total'
)
crosstab_age_product

product_preference,A,B,C,Total
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
18-25,15,17,17,49
26-35,15,13,12,40
36-45,17,21,19,57
46+,13,21,20,54
Total,60,72,68,200


In [238]:
# 2. Add percentages (normalize by row)
row_percentage = pd.crosstab(
    survey['age_group'],
    survey['product_preference'],
    normalize = 'index'
).round(2)
row_percentage

product_preference,A,B,C
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
18-25,0.31,0.35,0.35
26-35,0.38,0.32,0.3
36-45,0.3,0.37,0.33
46+,0.24,0.39,0.37


In [240]:
# 3. Create crosstab with satisfaction levels
crosstab_age_satisfaction = pd.crosstab(
    survey['age_group'],
    survey['satisfaction'],
    margins = True,
    margins_name = 'Total'
)
crosstab_age_satisfaction

satisfaction,High,Low,Medium,Total
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
18-25,16,14,19,49
26-35,11,13,16,40
36-45,19,22,16,57
46+,12,22,20,54
Total,58,71,71,200


In [244]:
# 4. Analyze recommendation rates by age and product
recommendation_rates = pd.crosstab(
    index = [survey['age_group'], survey['product_preference']],
    columns = survey['would_recommend'],
    normalize='index'
).round(2)
recommendation_rates

Unnamed: 0_level_0,would_recommend,No,Yes
age_group,product_preference,Unnamed: 2_level_1,Unnamed: 3_level_1
18-25,A,0.2,0.8
18-25,B,0.24,0.76
18-25,C,0.47,0.53
26-35,A,0.4,0.6
26-35,B,0.46,0.54
26-35,C,0.25,0.75
36-45,A,0.18,0.82
36-45,B,0.29,0.71
36-45,C,0.21,0.79
46+,A,0.23,0.77


## Part 4: Real-World Data Cleaning (15 minutes)

### Exercise 4.1 — Messy Contact Data (hard)
Clean real-world messy contact information.

In [287]:
# Messy contact data
contacts = pd.DataFrame({
    'name': ['  John Smith  ', 'jane doe', 'BOB JOHNSON', 'Alice    Brown', 'charlie davis'],
    'email': ['John.Smith@GMAIL.com', 'JANE@COMPANY.COM', 'bob@email..com', 
             'alice@@email.net', 'charlie@'],
    'phone': ['0412-345-678', '(04) 9876 5432', '0401234567', '04 1111 2222', 'not provided'],
    'address': ['123 Main St, Perth', '456 Oak Ave', 'Sydney, NSW', None, '789 Pine Rd, Melbourne, VIC']
})

print("Messy data:")
print(contacts)
print()
# TODO: Clean the data:
# Your code here:

Messy data:
             name                 email           phone  \
0    John Smith    John.Smith@GMAIL.com    0412-345-678   
1        jane doe      JANE@COMPANY.COM  (04) 9876 5432   
2     BOB JOHNSON        bob@email..com      0401234567   
3  Alice    Brown      alice@@email.net    04 1111 2222   
4   charlie davis              charlie@    not provided   

                       address  
0           123 Main St, Perth  
1                  456 Oak Ave  
2                  Sydney, NSW  
3                         None  
4  789 Pine Rd, Melbourne, VIC  



In [288]:
# 1. Standardize names (proper case, remove extra spaces)
contacts = contacts.rename(columns={'name':'raw_name', 'address': 'raw_address'}) # keep the original name column
contacts[['first_name','last_name']]=contacts['raw_name'].str.strip().str.split(n=1, expand=True)  # remove extra spaces, split into 2 columns
contacts[['first_name','last_name']]=contacts[['first_name','last_name']].apply(lambda s: s.str.capitalize()) # capitalize first letter
contacts

Unnamed: 0,raw_name,email,phone,raw_address,first_name,last_name
0,John Smith,John.Smith@GMAIL.com,0412-345-678,"123 Main St, Perth",John,Smith
1,jane doe,JANE@COMPANY.COM,(04) 9876 5432,456 Oak Ave,Jane,Doe
2,BOB JOHNSON,bob@email..com,0401234567,"Sydney, NSW",Bob,Johnson
3,Alice Brown,alice@@email.net,04 1111 2222,,Alice,Brown
4,charlie davis,charlie@,not provided,"789 Pine Rd, Melbourne, VIC",Charlie,Davis


In [289]:
# 2. Validate and clean email addresses
# remove duplicated '@'
contacts['email']=contacts['email'].str.replace(r'@{2,}','@', regex=True) 
# remove duplicated '.'
contacts['email']=contacts['email'].str.replace(r'\.{2,}','.', regex=True) 
# split the string at '@'
parts = contacts['email'].str.split('@', n=1, expand=True) 
# when email domains exist (not blank or missing)
domains = parts[1].notna() & parts[1].str.strip().ne('')  
# transform characters after '@' to lower case
contacts.loc[domains,'email'] = parts.loc[domains,0] + '@' + parts.loc[domains,1].str.lower()  
# replace email with missing domain with NA
contacts.loc[~domains, 'email'] = pd.NA  

contacts

Unnamed: 0,raw_name,email,phone,raw_address,first_name,last_name
0,John Smith,John.Smith@gmail.com,0412-345-678,"123 Main St, Perth",John,Smith
1,jane doe,JANE@company.com,(04) 9876 5432,456 Oak Ave,Jane,Doe
2,BOB JOHNSON,bob@email.com,0401234567,"Sydney, NSW",Bob,Johnson
3,Alice Brown,alice@email.net,04 1111 2222,,Alice,Brown
4,charlie davis,,not provided,"789 Pine Rd, Melbourne, VIC",Charlie,Davis


In [290]:
# 3. Standardize phone numbers to single format
contacts['phone'] = (contacts['phone'].astype('string').str.strip()     # remove extra spaces
                 .mask(lambda x: x.str.fullmatch(r'(?i)\s*not\s*provided\s*'))    # replace 'not provided'
                 .str.replace(r'\D', '', regex=True)  # everything non-digit
                 .mask(lambda x: x == '')) # replace empty string with NA

In [282]:
contacts['phone']

0    0412345678
1    0498765432
2    0401234567
3    0411112222
4          <NA>
Name: phone, dtype: string

In [295]:
# 4. Parse addresses to extract city and state
city = r'Perth|Sydney|Melbourne|Adelaide|Hobart|Brisbane|Darwin|Canberra'
pattern2 = rf'\b({city})\b'
contacts['city']=(contacts['raw_address']
                  .str.extract(pattern2)[0]
                 )
# extract states from the raw address
states = r'ACT|NSW|NT|QLD|SA|TAS|VIC|WA'
pattern = rf'\b({states})\b'
contacts['state']=(contacts['raw_address']
                   .str.extract(pattern)[0]
                   .str.upper()
                  )
# assign states from the city
city_state_dict = {'Perth':'WA',
                  'Melbourne':'VIC',
                  'Adelaide':'SA',
                  'Sydney':'NSW',
                   'Hobart':'TAS',
                   'Brisbane':'QLD'
                  }
contacts['state']=contacts['state'].fillna(contacts['city'].map(city_state_dict))

In [296]:
contacts

Unnamed: 0,raw_name,email,phone,raw_address,first_name,last_name,state,city
0,John Smith,John.Smith@gmail.com,412345678.0,"123 Main St, Perth",John,Smith,WA,Perth
1,jane doe,JANE@company.com,498765432.0,456 Oak Ave,Jane,Doe,,
2,BOB JOHNSON,bob@email.com,401234567.0,"Sydney, NSW",Bob,Johnson,NSW,Sydney
3,Alice Brown,alice@email.net,411112222.0,,Alice,Brown,,
4,charlie davis,,,"789 Pine Rd, Melbourne, VIC",Charlie,Davis,VIC,Melbourne


In [299]:
# 5. Create data quality flags for each record
columns_check = ['email', 'phone', 'first_name', 'last_name','raw_address','city']
contacts['missing']=contacts[columns_check].isna().sum(axis=1)

#map the amount of missing data with data quality
data_quality={0:'good',1:'fair',2:'bad'}
contacts['data_quality']=contacts['missing'].map(data_quality)

contacts

Unnamed: 0,raw_name,email,phone,raw_address,first_name,last_name,state,city,missing,data_quality
0,John Smith,John.Smith@gmail.com,412345678.0,"123 Main St, Perth",John,Smith,WA,Perth,0,good
1,jane doe,JANE@company.com,498765432.0,456 Oak Ave,Jane,Doe,,,1,fair
2,BOB JOHNSON,bob@email.com,401234567.0,"Sydney, NSW",Bob,Johnson,NSW,Sydney,0,good
3,Alice Brown,alice@email.net,411112222.0,,Alice,Brown,,,2,bad
4,charlie davis,,,"789 Pine Rd, Melbourne, VIC",Charlie,Davis,VIC,Melbourne,2,bad


### Exercise 4.2 — Duplicate Detection and Resolution (hard)
Find and handle duplicate records intelligently.

In [300]:
# Customer records with potential duplicates
customers_dup = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'name': ['John Smith', 'J. Smith', 'Jane Doe', 'Jane M. Doe', 
            'Bob Johnson', 'Robert Johnson', 'Alice Brown', 'Alice B.'],
    'email': ['john@email.com', 'j.smith@email.com', 'jane@gmail.com', 'jane.doe@gmail.com',
             'bob@email.com', 'bob@email.com', 'alice@email.com', 'alice.brown@email.com'],
    'last_purchase': pd.to_datetime(['2025-01-15', '2025-02-20', '2025-03-10', '2025-03-12',
                                     '2025-01-20', '2025-02-15', '2025-03-01', '2025-03-05']),
    'total_spent': [500, 750, 1000, 200, 300, 450, 800, 150]
})

print("Customer records:")
print(customers_dup)
print()

# TODO: Handle duplicates:
# Your code here:

Customer records:
   customer_id            name                  email last_purchase  \
0            1      John Smith         john@email.com    2025-01-15   
1            2        J. Smith      j.smith@email.com    2025-02-20   
2            3        Jane Doe         jane@gmail.com    2025-03-10   
3            4     Jane M. Doe     jane.doe@gmail.com    2025-03-12   
4            5     Bob Johnson          bob@email.com    2025-01-20   
5            6  Robert Johnson          bob@email.com    2025-02-15   
6            7     Alice Brown        alice@email.com    2025-03-01   
7            8        Alice B.  alice.brown@email.com    2025-03-05   

   total_spent  
0          500  
1          750  
2         1000  
3          200  
4          300  
5          450  
6          800  
7          150  



In [307]:
# 1. Find exact email duplicates
exact_email_duplicates = customers_dup[customers_dup['email'].duplicated(keep=False)]
exact_email_duplicates

Unnamed: 0,customer_id,name,email,last_purchase,total_spent
4,5,Bob Johnson,bob@email.com,2025-01-20,300
5,6,Robert Johnson,bob@email.com,2025-02-15,450


In [308]:
# 2. Find potential name duplicates (similar names)
# rename column 'name' into 'full name'
customers_dup = customers_dup.rename(columns = {'name':'full_name'})

# remove extra spaces, split into 2 columns
customers_dup[['first_name','last_name']]= customers_dup['full_name'].str.strip().str.split(n=1, expand=True)

customers_dup

Unnamed: 0,customer_id,full_name,email,last_purchase,total_spent,first_name,last_name
0,1,John Smith,john@email.com,2025-01-15,500,John,Smith
1,2,J. Smith,j.smith@email.com,2025-02-20,750,J.,Smith
2,3,Jane Doe,jane@gmail.com,2025-03-10,1000,Jane,Doe
3,4,Jane M. Doe,jane.doe@gmail.com,2025-03-12,200,Jane,M. Doe
4,5,Bob Johnson,bob@email.com,2025-01-20,300,Bob,Johnson
5,6,Robert Johnson,bob@email.com,2025-02-15,450,Robert,Johnson
6,7,Alice Brown,alice@email.com,2025-03-01,800,Alice,Brown
7,8,Alice B.,alice.brown@email.com,2025-03-05,150,Alice,B.


In [315]:
customers_dup['block'] = customers_dup['last_name'].str.replace(r'[^A-Za-z]','', regex=True).str.lower().str[:4]

In [316]:
duplicated_name = (customers_dup[['customer_id','first_name','last_name','block']]
        .merge(customers_dup[['customer_id','first_name','last_name','block']],
               on='block', suffixes=('_a','_b'))
        .query('customer_id_a < customer_id_b'))
duplicated_name

Unnamed: 0,customer_id_a,first_name_a,last_name_a,block,customer_id_b,first_name_b,last_name_b
1,1,John,Smith,smit,2,J.,Smith
7,5,Bob,Johnson,john,6,Robert,Johnson


In [317]:
customers_dup

Unnamed: 0,customer_id,full_name,email,last_purchase,total_spent,first_name,last_name,block
0,1,John Smith,john@email.com,2025-01-15,500,John,Smith,smit
1,2,J. Smith,j.smith@email.com,2025-02-20,750,J.,Smith,smit
2,3,Jane Doe,jane@gmail.com,2025-03-10,1000,Jane,Doe,doe
3,4,Jane M. Doe,jane.doe@gmail.com,2025-03-12,200,Jane,M. Doe,mdoe
4,5,Bob Johnson,bob@email.com,2025-01-20,300,Bob,Johnson,john
5,6,Robert Johnson,bob@email.com,2025-02-15,450,Robert,Johnson,john
6,7,Alice Brown,alice@email.com,2025-03-01,800,Alice,Brown,brow
7,8,Alice B.,alice.brown@email.com,2025-03-05,150,Alice,B.,b


In [328]:
# 3. Merge duplicate records (keep most recent, sum totals)
# sum total spend for possible duplicated rows (matching column 'block')
customers_dup['sum_total_spent']=customers_dup.groupby('block')['total_spent'].transform('sum')
#most_recent
customers_dup['most_recent_purchase']=customers_dup.groupby('block')['last_purchase'].transform('max')
customers_dup


Unnamed: 0,customer_id,full_name,email,last_purchase,total_spent,first_name,last_name,block,sum_total_spent,most_recent_purchase
0,1,John Smith,john@email.com,2025-01-15,500,John,Smith,smit,1250,2025-02-20
1,2,J. Smith,j.smith@email.com,2025-02-20,750,J.,Smith,smit,1250,2025-02-20
2,3,Jane Doe,jane@gmail.com,2025-03-10,1000,Jane,Doe,doe,1000,2025-03-10
3,4,Jane M. Doe,jane.doe@gmail.com,2025-03-12,200,Jane,M. Doe,mdoe,200,2025-03-12
4,5,Bob Johnson,bob@email.com,2025-01-20,300,Bob,Johnson,john,750,2025-02-15
5,6,Robert Johnson,bob@email.com,2025-02-15,450,Robert,Johnson,john,750,2025-02-15
6,7,Alice Brown,alice@email.com,2025-03-01,800,Alice,Brown,brow,800,2025-03-01
7,8,Alice B.,alice.brown@email.com,2025-03-05,150,Alice,B.,b,150,2025-03-05


In [329]:
#identify duplicated rows
last_name_duplicates = customers_dup[customers_dup['block'].duplicated(keep=False)]
last_name_duplicates

Unnamed: 0,customer_id,full_name,email,last_purchase,total_spent,first_name,last_name,block,sum_total_spent,most_recent_purchase
0,1,John Smith,john@email.com,2025-01-15,500,John,Smith,smit,1250,2025-02-20
1,2,J. Smith,j.smith@email.com,2025-02-20,750,J.,Smith,smit,1250,2025-02-20
4,5,Bob Johnson,bob@email.com,2025-01-20,300,Bob,Johnson,john,750,2025-02-15
5,6,Robert Johnson,bob@email.com,2025-02-15,450,Robert,Johnson,john,750,2025-02-15


In [334]:
# identify rows to retain (when last purchase equals most recent purchase)
retain_rows = last_name_duplicates.sort_values('last_purchase').drop_duplicates(subset='block', keep ='last')
retain_rows

Unnamed: 0,customer_id,full_name,email,last_purchase,total_spent,first_name,last_name,block,sum_total_spent,most_recent_purchase
5,6,Robert Johnson,bob@email.com,2025-02-15,450,Robert,Johnson,john,750,2025-02-15
1,2,J. Smith,j.smith@email.com,2025-02-20,750,J.,Smith,smit,1250,2025-02-20


In [338]:
# remove duplicates from main df
customer_unique = (customers_dup.sort_values('last_purchase').drop_duplicates(subset='block', keep ='last')).sort_values('customer_id')
customer_unique=customer_unique.drop(columns =['block','total_spent','most_recent_purchase'])
customer_unique

Unnamed: 0,customer_id,full_name,email,last_purchase,first_name,last_name,sum_total_spent
1,2,J. Smith,j.smith@email.com,2025-02-20,J.,Smith,1250
2,3,Jane Doe,jane@gmail.com,2025-03-10,Jane,Doe,1000
3,4,Jane M. Doe,jane.doe@gmail.com,2025-03-12,Jane,M. Doe,200
5,6,Robert Johnson,bob@email.com,2025-02-15,Robert,Johnson,750
6,7,Alice Brown,alice@email.com,2025-03-01,Alice,Brown,800
7,8,Alice B.,alice.brown@email.com,2025-03-05,Alice,B.,150


In [355]:
# 4. Create a deduplication report
print(f"Original table has {len(customers_dup)} rows.")
print(f"\nPossible duplicated ({len(last_name_duplicates)}rows): \n{last_name_duplicates.iloc[:,0:5]}")
print(f"\nRetained ({len(retain_rows)}rows): \n{retain_rows.iloc[:,[0,1,2,3,8]]}")
print(f"\nDeduplicated table has {len(customer_unique)} rows: \n{customer_unique.iloc[:,[0,1,2,3,6]]}")

Original table has 8 rows.

Possible duplicated (4rows): 
   customer_id       full_name              email last_purchase  total_spent
0            1      John Smith     john@email.com    2025-01-15          500
1            2        J. Smith  j.smith@email.com    2025-02-20          750
4            5     Bob Johnson      bob@email.com    2025-01-20          300
5            6  Robert Johnson      bob@email.com    2025-02-15          450

Retained (2rows): 
   customer_id       full_name              email last_purchase  \
5            6  Robert Johnson      bob@email.com    2025-02-15   
1            2        J. Smith  j.smith@email.com    2025-02-20   

   sum_total_spent  
5              750  
1             1250  

Deduplicated table has 6 rows: 
   customer_id       full_name                  email last_purchase  \
1            2        J. Smith      j.smith@email.com    2025-02-20   
2            3        Jane Doe         jane@gmail.com    2025-03-10   
3            4     Jane M.

## 🚀 Challenge: Complete Data Pipeline
Build an end-to-end data processing pipeline.

In [356]:
# E-commerce data pipeline challenge
# You have three data sources that need to be combined and analyzed

# Source 1: Order data
orders = pd.DataFrame({
    'order_id': range(1, 101),
    'customer_id': np.random.randint(1, 31, 100),
    'product_id': np.random.choice(['P1', 'P2', 'P3', 'P4', 'P5'], 100),
    'quantity': np.random.randint(1, 5, 100),
    'order_date': pd.date_range('2025-07-01', periods=100),
    'status': np.random.choice(['Completed', 'Pending', 'Cancelled'], 100, p=[0.8, 0.15, 0.05])
})

# Source 2: Product data
products = pd.DataFrame({
    'product_id': ['P1', 'P2', 'P3', 'P4', 'P5'],
    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Accessories'],
    'unit_price': [1200, 25, 80, 350, 120],
    'cost': [800, 15, 50, 250, 70]
})

# Source 3: Customer data
customers = pd.DataFrame({
    'customer_id': range(1, 31),
    'customer_name': [f'Customer_{i}' for i in range(1, 31)],
    'segment': np.random.choice(['Premium', 'Standard', 'Basic'], 30, p=[0.2, 0.5, 0.3]),
    'join_date': pd.date_range('2024-01-01', periods=30, freq='W')
})

# TODO: Build a complete analysis pipeline:
# Your code here:

In [357]:
orders

Unnamed: 0,order_id,customer_id,product_id,quantity,order_date,status
0,1,5,P5,1,2025-07-01,Completed
1,2,20,P4,3,2025-07-02,Completed
2,3,18,P1,1,2025-07-03,Completed
3,4,26,P4,2,2025-07-04,Completed
4,5,18,P4,1,2025-07-05,Completed
...,...,...,...,...,...,...
95,96,17,P4,4,2025-10-04,Completed
96,97,1,P2,2,2025-10-05,Cancelled
97,98,28,P3,2,2025-10-06,Completed
98,99,17,P3,4,2025-10-07,Cancelled


In [358]:
products

Unnamed: 0,product_id,product_name,category,unit_price,cost
0,P1,Laptop,Electronics,1200,800
1,P2,Mouse,Accessories,25,15
2,P3,Keyboard,Accessories,80,50
3,P4,Monitor,Electronics,350,250
4,P5,Webcam,Accessories,120,70


In [359]:
customers

Unnamed: 0,customer_id,customer_name,segment,join_date
0,1,Customer_1,Basic,2024-01-07
1,2,Customer_2,Standard,2024-01-14
2,3,Customer_3,Standard,2024-01-21
3,4,Customer_4,Basic,2024-01-28
4,5,Customer_5,Basic,2024-02-04
...,...,...,...,...
25,26,Customer_26,Standard,2024-06-30
26,27,Customer_27,Premium,2024-07-07
27,28,Customer_28,Premium,2024-07-14
28,29,Customer_29,Standard,2024-07-21


In [362]:
# 1. Merge all three datasets
merge_df = pd.merge(orders, customers, on = 'customer_id', how ='left')
merge_df = pd.merge(merge_df, products, on = 'product_id', how = 'left')
merge_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   order_id       100 non-null    int64         
 1   customer_id    100 non-null    int32         
 2   product_id     100 non-null    object        
 3   quantity       100 non-null    int32         
 4   order_date     100 non-null    datetime64[ns]
 5   status         100 non-null    object        
 6   customer_name  100 non-null    object        
 7   segment        100 non-null    object        
 8   join_date      100 non-null    datetime64[ns]
 9   product_name   100 non-null    object        
 10  category       100 non-null    object        
 11  unit_price     100 non-null    int64         
 12  cost           100 non-null    int64         
dtypes: datetime64[ns](2), int32(2), int64(3), object(6)
memory usage: 9.5+ KB


In [433]:
# 2. Calculate order values and profit margins
# filter to only completed order
completed_df= merge_df.loc[merge_df['status']=='Completed'].copy()

# calculate order values
completed_df['order_values']= completed_df['quantity']* completed_df['unit_price']

# calculate profit margins
completed_df['profit_margin']=completed_df['quantity']*(completed_df['unit_price']-completed_df['cost'])
completed_df.head()

Unnamed: 0,order_id,customer_id,product_id,quantity,order_date,status,customer_name,segment,join_date,product_name,category,unit_price,cost,order_values,profit_margin
0,1,5,P5,1,2025-07-01,Completed,Customer_5,Basic,2024-02-04,Webcam,Accessories,120,70,120,50
1,2,20,P4,3,2025-07-02,Completed,Customer_20,Standard,2024-05-19,Monitor,Electronics,350,250,1050,300
2,3,18,P1,1,2025-07-03,Completed,Customer_18,Standard,2024-05-05,Laptop,Electronics,1200,800,1200,400
3,4,26,P4,2,2025-07-04,Completed,Customer_26,Standard,2024-06-30,Monitor,Electronics,350,250,700,200
4,5,18,P4,1,2025-07-05,Completed,Customer_18,Standard,2024-05-05,Monitor,Electronics,350,250,350,100


In [434]:
# 3. Analyze sales by customer segment and product category
unique_category = completed_df['category'].unique()
print(f"Categories: {unique_category}")

total_sales_category = completed_df.groupby(['segment','category'])['order_values'].sum()
print(f"\nTotal sales by product category in each segment ($): \n{total_sales_category}")

total_sales_segment = completed_df.groupby('segment')['order_values'].sum()
print(f"\nTotal sales by segment ($): \n{total_sales_segment}")


Categories: ['Accessories' 'Electronics']

Total sales by product category in each segment ($): 
segment   category   
Basic     Accessories     2875
          Electronics    13800
Premium   Accessories     2345
          Electronics    14550
Standard  Accessories     2425
          Electronics    22900
Name: order_values, dtype: int64

Total sales by segment ($): 
segment
Basic       16675
Premium     16895
Standard    25325
Name: order_values, dtype: int64


In [443]:
# 4. Find top customers and products
unique_customers = completed_df['customer_name'].unique()
print(f"There are total of {len(unique_customers)} customers with completed orders.")

total_sales_bycustomers = completed_df.groupby('customer_name')['order_values'].sum().sort_values(ascending=False)
print(f"\nTop 5 total sales per customers ($): \n{total_sales_bycustomers.nlargest(5)}")

unique_products = completed_df['product_name'].unique()
print(f"\nProduct names: {unique_products}")

total_sales_byproducts = completed_df.groupby('product_name')['order_values'].sum().sort_values(ascending=False)
print(f"\nTotal sales per products ($): \n{total_sales_byproducts}")

There are total of 28 customers with completed orders.

Top 5 total sales per customers ($): 
customer_name
Customer_29    8480
Customer_19    6585
Customer_1     6170
Customer_11    4940
Customer_6     3680
Name: order_values, dtype: int64

Product names: ['Webcam' 'Monitor' 'Laptop' 'Keyboard' 'Mouse']

Total sales per products ($): 
product_name
Laptop      34800
Monitor     16450
Keyboard     3680
Webcam       3240
Mouse         725
Name: order_values, dtype: int64


In [450]:
# 5. Calculate customer lifetime value

completed_df['last_purchase'] = completed_df.groupby('customer_name')['order_date'].transform('max')
completed_df['joined_duration']=round((completed_df['last_purchase']-completed_df['join_date'])/pd.Timedelta(days=365.25),2)

completed_df['total_profit'] = completed_df.groupby('customer_name')['profit_margin'].transform('sum')
completed_df['total_profit_peryear']=round(completed_df['total_profit']/completed_df['joined_duration'],2)
sorted_df = completed_df.groupby('customer_name')['total_profit_peryear'].sum()
print(f"Customer with highest total profit per year : {sorted_df.idxmax()} with {sorted_df.max():.2f} ($/yr)")


Customer with highest total profit per year : Customer_29 with 8085.72 ($/yr)


In [451]:
# 6. Create monthly sales trend
completed_df['month']=completed_df['order_date'].dt.month
monthly_trend = completed_df.groupby('month')['order_values'].sum()
print(f"Total order values by month ($): \n{monthly_trend}")

Total order values by month ($): 
month
7     16850
8     23165
9     15640
10     3240
Name: order_values, dtype: int64


In [452]:
# 7. Identify cross-selling opportunities
cross_selling = pd.crosstab(
    completed_df['segment'],
    completed_df['product_name'],
    margins = True,
    margins_name = 'Total'
)
cross_selling

product_name,Keyboard,Laptop,Monitor,Mouse,Webcam,Total
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Basic,7,5,4,4,3,23
Premium,6,3,7,4,4,24
Standard,6,6,6,3,6,27
Total,19,14,17,11,13,74


In [468]:
# 8. Generate executive summary DataFrame
# filter to top 5 customers based on the total sales
top5_customers=total_sales_bycustomers.nlargest(5)
df = top5_customers.reset_index(name='total_sales') 

# merge with customers table to obtain segment and join date
df = pd.merge(df, customers, on='customer_name', how='left')

# merge with completed_df to obtain total profit and total profit per year 
total_profit = completed_df.groupby('customer_name')['profit_margin'].sum()

df = pd.merge(df, total_profit, on ='customer_name', how='left')
df

Unnamed: 0,customer_name,total_sales,customer_id,segment,join_date,profit_margin
0,Customer_29,8480,29,Standard,2024-07-21,2830
1,Customer_19,6585,19,Premium,2024-05-12,2150
2,Customer_1,6170,1,Basic,2024-01-07,2020
3,Customer_11,4940,11,Basic,2024-03-17,1640
4,Customer_6,3680,6,Standard,2024-02-11,1230


## 📊 Lab Summary Checklist

**Core Skills Practiced:**
- [ ] GroupBy with single and multiple columns
- [ ] Custom aggregations with agg()
- [ ] Different types of merges (inner, left, outer)
- [ ] Pivot tables and reshaping
- [ ] Data cleaning and deduplication
- [ ] Complete data pipeline

**Self-Assessment:**
- I can group and aggregate data efficiently ✅
- I understand different join types ✅
- I can reshape data between wide and long formats ✅
- I can clean messy real-world data ✅
- I can build data processing pipelines ✅

## 🎯 What's Next?
**Lab 01C:** Advanced EDA techniques and statistical analysis!