# Exercise 4.8 Task: Grouping & Aggregating Data

This notebook contains the task work for Exercise 4.8.
The objective is to apply grouping and aggregation techniques
to the full Instacart dataset and analyze the results.

# Step 1:
Import libraries and load the merged dataset

In [12]:
# Import libraries
import pandas as pd
import numpy as np
import os

In [13]:
# Set base path to Prepared Data folder
path = r'/Users/jessduong/Documents/CF/Achievement 4_Python/12-2025 Instacart Basket Analysis/02 Data/Prepared Data'

In [3]:
# Load the merged dataset with derived columns
ords_prods_merge = pd.read_pickle(os.path.join(path, 'ords_prods_merge.pkl'))

In [11]:
# Quick shape check
ords_prods_merge.shape

(32435059, 19)

# Step 2: Aggregate mean of order_number by department_id (Full dataset)

In [6]:
# Calculate mean order_number grouped by department_id for the full dataset
dept_order_mean_full = (ords_prods_merge.groupby('department_id').agg({'order_number': ['mean']}))

In [14]:
# Check
dept_order_mean_full

Unnamed: 0_level_0,mean_order_number
department_id,Unnamed: 1_level_1
1.0,15.457838
2.0,17.27792
3.0,17.170395
4.0,17.811403
5.0,15.215751
6.0,16.439806
7.0,17.225802
8.0,15.34065
9.0,15.895474
10.0,20.197148


In [10]:
# Rename column for clarity
dept_order_mean_full.columns = ['mean_order_number']

In [9]:
# Check rename of dept_order_mean_full
dept_order_mean_full

Unnamed: 0_level_0,mean_order_number
department_id,Unnamed: 1_level_1
1.0,15.457838
2.0,17.27792
3.0,17.170395
4.0,17.811403
5.0,15.215751
6.0,16.439806
7.0,17.225802
8.0,15.34065
9.0,15.895474
10.0,20.197148


# Step 3. Analysis of Aggregated Results

In the lesson, the mean of the 'order_number' column grouped by 'department_id' was calculated using only a subset of the dataset (the first 1,000,000 rows). In this task, the same aggrefation was performed using the entire dataset. 

Usnig the full dataset results in more stable and representative mean values. With the subset, some departments may have been overrepresented or underrepresented depending on which rows were included. When the entire dataset is used, the average reflect the full range of customer ordering behavior.

Overall, the structure of the results remains the same. but the values from the full dataset provide a more accurate picture of average order counts by department.

# Step 4. Create loyalty_flag based on max_order thresholds

In [38]:
# Create max_order column using groupby and transform
# This stores the maximum order_number for each user_id
ords_prods_merge['max_order'] = (ords_prods_merge.groupby('user_id')['order_number'].transform('max'))

In [37]:
# Create loyalty_flag based on max_order thresholds

In [28]:
ords_prods_merge.loc[ords_prods_merge['max_order'] > 40,'loyalty_flag'] = 'Loyal customer'

In [26]:
ords_prods_merge.loc[(ords_prods_merge['max_order'] <= 40) & (ords_prods_merge['max_order'] > 10),'loyalty_flag'
] = 'Regular customer'

In [18]:
ords_prods_merge.loc[ords_prods_merge['max_order'] <= 10,'loyalty_flag'
] = 'New customer'

In [27]:
# Confirm loyalty_flag distribution
ords_prods_merge['loyalty_flag'].value_counts(dropna = False)

loyalty_flag
Regular customer    15891507
Loyal customer      10294027
New customer         6249525
Name: count, dtype: int64

In [29]:
# Spot-check loyalty flag against max_order
ords_prods_merge[['user_id', 'max_order', 'loyalty_flag']].drop_duplicates().head(10)

Unnamed: 0,user_id,max_order,loyalty_flag
0,1,10,New customer
59,2,14,Regular customer
254,3,12,Regular customer
342,4,5,New customer
360,5,4,New customer
397,6,3,New customer
411,7,20,Regular customer
617,8,3,New customer
666,9,3,New customer
742,10,5,New customer


The loyalty_flag column was validated using value counts and a spot-check
against max_order to ensure correct classification.

# Step 5. Group by loyalty_flag and summarize prices

In [31]:
# Checking column names
ords_prods_merge.columns

Index(['order_id', 'user_id', 'eval_set', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_previous_order', 'first_order_flag',
       'product_id', 'add_to_cart_order', 'reordered', 'product_name',
       'aisle_id', 'department_id', 'prices', 'price_range_loc', 'busiest_day',
       'busiest_days_2', 'busiest_period_of_day', 'max_order', 'loyalty_flag'],
      dtype='object')

In [33]:
# Compare product price statistics by loyalty group
price_by_loyalty = (ords_prods_merge.groupby('loyalty_flag').agg({'prices': ['mean', 'median', 'min', 'max']}))

In [34]:
# Check column names
price_by_loyalty

Unnamed: 0_level_0,prices,prices,prices,prices
Unnamed: 0_level_1,mean,median,min,max
loyalty_flag,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Loyal customer,7.773575,7.4,1.0,25.0
New customer,7.801206,7.4,1.0,25.0
Regular customer,7.798262,7.4,1.0,25.0


In [35]:
# Rename columns for readability
price_by_loyalty.columns = ['mean_price','median_price','min_price','max_price']

In [36]:
# Check for rename of columns
price_by_loyalty

Unnamed: 0_level_0,mean_price,median_price,min_price,max_price
loyalty_flag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Loyal customer,7.773575,7.4,1.0,25.0
New customer,7.801206,7.4,1.0,25.0
Regular customer,7.798262,7.4,1.0,25.0


Step 5. Analysis of Product Prices by Loyalty Group

Descriptive statistics were calculated for product prices across loyal, regular, and new customer groups. The results show that while the mediam product price is identical across all three groups ($7.40), small differences exist in mean prices. 

New Customers exhibit the highest average product price, followed closely by regular customers, while loyal customers have the lowest average price. However, these differences are minimal and all groups share the same minimum and maxiumum prices. 

Overall, this sugests that loyalty status does not meaningfully impact the price point of products purchased. Differences in customer value are therefore more likely driven by purchase frequency and total order volume rather than higher-priced items.


# Step 6. Calculate mean product price per user

In [39]:
# Calculate mean product price per user
ords_prods_merge['mean_price_user'] = (ords_prods_merge.groupby('user_id')['prices'].transform('mean'))

In [40]:
# Create spending flag based on average product price

In [45]:
ords_prods_merge.loc[ords_prods_merge['mean_price_user'] < 10,'spending_flag'] = 'Low spender'

In [47]:
ords_prods_merge.loc[ords_prods_merge['mean_price_user'] >= 10,'spending_flag'] = 'High spender'

In [49]:
# Check distribution
ords_prods_merge['spending_flag'].value_counts(dropna = False)

spending_flag
Low spender     32315179
High spender      119880
Name: count, dtype: int64

Most loyal customers are not high spenders per item, suggesting loyalty is driven by convenience rather than premium purchasing.

# Step 7. Order frequency flag (Median-based)

In [73]:
# Calculate median days since prior order per user
ords_prods_merge['median_days_since_prior_order'] = (ords_prods_merge.groupby('user_id')['days_since_previous_order'].transform('median'))

In [54]:
# Create order frequency flag based on median days since previous order

In [74]:
# Non-frequent customer
ords_prods_merge.loc[ords_prods_merge['median_days_since_prior_order'] > 20,'order_frequency_flag'] = 'Non-frequent customer'

In [76]:
# Regular customer
ords_prods_merge.loc[(ords_prods_merge['median_days_since_prior_order'] > 10) &(ords_prods_merge['median_days_since_prior_order'] <= 20),'order_frequency_flag'] = 'Regular customer'

In [78]:
# Frequent customer
ords_prods_merge.loc[ords_prods_merge['median_days_since_prior_order'] <= 10,'order_frequency_flag'] = 'Frequent customer'

In [69]:
# validate frequency flag creations
ords_prods_merge['order_frequency_flag'].value_counts(dropna = False)

order_frequency_flag
Frequent customer        22816041
Regular customer          6929012
Non-frequent customer     2690006
Name: count, dtype: int64

In [79]:
# Validation check
ords_prods_merge[['user_id', 'median_days_since_prior_order', 'order_frequency_flag']].drop_duplicates().head(10)

Unnamed: 0,user_id,median_days_since_prior_order,order_frequency_flag
0,1,20.0,Regular customer
59,2,13.0,Regular customer
254,3,9.0,Frequent customer
342,4,17.0,Regular customer
360,5,11.0,Regular customer
397,6,6.0,Frequent customer
411,7,8.5,Frequent customer
617,8,30.0,Non-frequent customer
666,9,6.0,Frequent customer
742,10,23.0,Non-frequent customer


# Exercise 4.8 Summary: Grouping & Aggregation

In this exercise, I used grouping and aggregation techniques in Python to derive customer-level features related to loyalty, spending behavior, and order frequency. These derived variables were validated through summary statistics and spot checks and will be used to support further analysis in the final project.

Step 9. Export as pickle

In [80]:
# Step 9: Export the final dataframe as a pickle file

ords_prods_merge.to_pickle(os.path.join(path, 'ords_prods_merge.pkl'))