# Grouping Data & Aggregating Variables

## This script contains the following points:

1. Import libraries and ords_prods_merge dataframe.


2. Find the aggregated mean of the “order_number” column grouped by “department_id” for the entire dataframe.


3. Analyze the results of Step 2 in a markdown cell.


4. Create a loyalty flag for existing customers using the transform() and loc() functions.


5. Check the basic statistics of the product prices for each loyalty category ("loyal customer," "regular customer," and "new customer") to determine whether the prices of products purchased by loyal customers differ from those purchased by regular or new customers.


6. Create a spending flag for each user based on the average price across all their orders.


7. Create an order frequency flag that marks the regularity of a user’s ordering behavior according to the median in the “days_since_prior_order” column.


9. Export dataframe as a pickle file and store it correctly in “Prepared Data” folder.


## 1. Import libraries and ords_prods_merge dataframe.

In [1]:
# Import libraries

import pandas as pd
import numpy as np
import os

In [2]:
#Turn project folder path into a string

path = r'C:\Users\danie\Desktop\CareerFoundry\Achievement 4-Python\11-2023 Instacart Basket Analysis'
path

'C:\\Users\\danie\\Desktop\\CareerFoundry\\Achievement 4-Python\\11-2023 Instacart Basket Analysis'

In [3]:
# Import ords_prods_merge dataframe from Task 4.7.

ords_prods_merge = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged_derived.pkl'))

## 2. Find the aggregated mean of the “order_number” column grouped by “department_id” for the entire dataframe.

In [5]:
ords_prods_merge.groupby('department_id').agg({'order_number': ['mean']})

Unnamed: 0_level_0,order_number
Unnamed: 0_level_1,mean
department_id,Unnamed: 1_level_2
1,15.457838
2,17.27792
3,17.170395
4,17.811403
5,15.215751
6,16.439806
7,17.225802
8,15.34065
9,15.895474
10,20.197148


## 3. Analyze the results of Step 2 in a markdown cell.

Unlike the subset in the course content, the above results have averages for all departments. In addition, the averages are generally lower than the averages calculated in the subset. This makes sense because the subset did not reflect the breadth of the entire data set.

## 4. Create a loyalty flag for existing customers using the transform() and loc() functions.

In [7]:
# Split data into groups based on user_id column, apply transform() to order_number to find maximum orders for each user, and place results in new column, "max_order."

ords_prods_merge['max_order'] = ords_prods_merge.groupby(['user_id'])['order_number'].transform(np.max)

  ords_prods_merge['max_order'] = ords_prods_merge.groupby(['user_id'])['order_number'].transform(np.max)


In [35]:
# Check output of new max_order column.

ords_prods_merge.head(15)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge,...,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,mean_prices,spending_flag,median_days_since_prior_order,order_frequency_flag
0,2539329,1,1,2,8,11.114836,196,1,0,both,...,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.367797,Low spender,20.0,Regular customer
1,2398795,1,2,3,7,15.0,196,1,1,both,...,Mid-range product,Regularly busy,Slowest days,Average orders,10,New customer,6.367797,Low spender,20.0,Regular customer
2,473747,1,3,3,12,21.0,196,1,1,both,...,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.0,Regular customer
3,2254736,1,4,4,7,29.0,196,1,1,both,...,Mid-range product,Least busy,Slowest days,Average orders,10,New customer,6.367797,Low spender,20.0,Regular customer
4,431534,1,5,4,15,28.0,196,1,1,both,...,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.0,Regular customer
5,3367565,1,6,2,7,19.0,196,1,1,both,...,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.367797,Low spender,20.0,Regular customer
6,550135,1,7,1,9,20.0,196,1,1,both,...,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.0,Regular customer
7,3108588,1,8,1,14,14.0,196,2,1,both,...,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.0,Regular customer
8,2295261,1,9,1,16,0.0,196,4,1,both,...,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.0,Regular customer
9,2550362,1,10,4,8,30.0,196,1,1,both,...,Mid-range product,Least busy,Slowest days,Average orders,10,New customer,6.367797,Low spender,20.0,Regular customer


Customer loyalty flag criteria: 
If the maximum number of orders is over 40, then the customer is labeled "loyal customer." 
If the maximum number of orders is greater than 10 but less than or equal to 40, then the customer is labeled "regular customer." 
If the maximum number of orders is less than or equal to 10, then the customer is labeled "new customer."

In [36]:
# Create flag for customer loyalty with the above criteria.

ords_prods_merge.loc[ords_prods_merge['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'

In [37]:
ords_prods_merge.loc[(ords_prods_merge['max_order'] <= 40) & (ords_prods_merge['max_order'] > 10), 'loyalty_flag'] = 'Regular customer'

In [38]:
ords_prods_merge.loc[ords_prods_merge['max_order'] <= 10, 'loyalty_flag'] = 'New customer'

In [39]:
# Check frequency counts of loyalty flag.

ords_prods_merge['loyalty_flag'].value_counts(dropna = False)

loyalty_flag
Regular customer    15876776
Loyal customer      10284093
New customer         6243990
Name: count, dtype: int64

## 5. Check the basic statistics of the product prices for each loyalty category ("loyal customer," "regular customer," and "new customer") to determine whether the prices of products purchased by loyal customers differ from those purchased by regular or new customers.

In [16]:
# Check basic statistics of product prices for customer loyalty flag.

ords_prods_merge.groupby('loyalty_flag').agg({'prices': ['mean', 'min', 'max']})

Unnamed: 0_level_0,prices,prices,prices
Unnamed: 0_level_1,mean,min,max
loyalty_flag,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Loyal customer,10.386336,1.0,99999.0
New customer,13.29467,1.0,99999.0
Regular customer,12.495717,1.0,99999.0


## 6. Create a spending flag for each user based on the average price across all their orders.

In [19]:
# Split data into groups based on user_id column, apply transform() to prices to find mean prices for each user, and place results in new column, "mean_prices."

ords_prods_merge['mean_prices'] = ords_prods_merge.groupby(['user_id'])['prices'].transform(np.mean)

  ords_prods_merge['mean_prices'] = ords_prods_merge.groupby(['user_id'])['prices'].transform(np.mean)


In [22]:
# Check output of mean_prices column.

ords_prods_merge.head(15)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge,...,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,mean_prices
0,2539329,1,1,2,8,11.114836,196,1,0,both,...,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.367797
1,2398795,1,2,3,7,15.0,196,1,1,both,...,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Average orders,10,New customer,6.367797
2,473747,1,3,3,12,21.0,196,1,1,both,...,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797
3,2254736,1,4,4,7,29.0,196,1,1,both,...,77,7,9.0,Mid-range product,Least busy,Slowest days,Average orders,10,New customer,6.367797
4,431534,1,5,4,15,28.0,196,1,1,both,...,77,7,9.0,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797
5,3367565,1,6,2,7,19.0,196,1,1,both,...,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.367797
6,550135,1,7,1,9,20.0,196,1,1,both,...,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797
7,3108588,1,8,1,14,14.0,196,2,1,both,...,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797
8,2295261,1,9,1,16,0.0,196,4,1,both,...,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797
9,2550362,1,10,4,8,30.0,196,1,1,both,...,77,7,9.0,Mid-range product,Least busy,Slowest days,Average orders,10,New customer,6.367797


User spending criteria:
If the mean of the prices of products purchased by a user is lower than 10, then flag them as a “Low spender.”
If the mean of the prices of products purchased by a user is higher than or equal to 10, then flag them as a “High spender.”

In [23]:
# Create a flag for user spending based on the above criteria.

ords_prods_merge.loc[ords_prods_merge['mean_prices'] < 10, 'spending_flag'] = 'Low spender'

  ords_prods_merge.loc[ords_prods_merge['mean_prices'] < 10, 'spending_flag'] = 'Low spender'


In [25]:
ords_prods_merge.loc[ords_prods_merge['mean_prices'] >= 10, 'spending_flag'] = 'High spender'

In [26]:
# Check frequency counts of spending flag.

ords_prods_merge['spending_flag'].value_counts(dropna = False)

spending_flag
Low spender     31770614
High spender      634245
Name: count, dtype: int64

## 7. Create an order frequency flag that marks the regularity of a user’s ordering behavior according to the median in the “days_since_prior_order” column.

In [28]:
# Split data into groups based on user_id column, apply transform() to prices to find median "days_since_prior_order" for each user, and place results in new column, "median_days_since_prior_order."

ords_prods_merge['median_days_since_prior_order'] = ords_prods_merge.groupby(['user_id'])['days_since_prior_order'].transform(np.median)

  ords_prods_merge['median_days_since_prior_order'] = ords_prods_merge.groupby(['user_id'])['days_since_prior_order'].transform(np.median)


In [29]:
# Check output of median_days_since_prior_order column.

ords_prods_merge.head(15)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge,...,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,mean_prices,spending_flag,median_days_since_prior_order
0,2539329,1,1,2,8,11.114836,196,1,0,both,...,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.367797,Low spender,20.0
1,2398795,1,2,3,7,15.0,196,1,1,both,...,9.0,Mid-range product,Regularly busy,Slowest days,Average orders,10,New customer,6.367797,Low spender,20.0
2,473747,1,3,3,12,21.0,196,1,1,both,...,9.0,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.0
3,2254736,1,4,4,7,29.0,196,1,1,both,...,9.0,Mid-range product,Least busy,Slowest days,Average orders,10,New customer,6.367797,Low spender,20.0
4,431534,1,5,4,15,28.0,196,1,1,both,...,9.0,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.0
5,3367565,1,6,2,7,19.0,196,1,1,both,...,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.367797,Low spender,20.0
6,550135,1,7,1,9,20.0,196,1,1,both,...,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.0
7,3108588,1,8,1,14,14.0,196,2,1,both,...,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.0
8,2295261,1,9,1,16,0.0,196,4,1,both,...,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.0
9,2550362,1,10,4,8,30.0,196,1,1,both,...,9.0,Mid-range product,Least busy,Slowest days,Average orders,10,New customer,6.367797,Low spender,20.0


Order frequency flag criteria: If the median of “days_since_prior_order” is higher than 20, then the customer should be labeled “Non-frequent customer.”
If the median is higher than 10 and lower than or equal to 20, then the customer should be labeled “Regular customer.”
If the median is lower than or equal to 10, then the customer should be labeled “Frequent customer.”

In [30]:
# Create a flag for user spending based on the above criteria.

ords_prods_merge.loc[ords_prods_merge['median_days_since_prior_order'] > 20, 'order_frequency_flag'] = 'Non-frequent customer'

  ords_prods_merge.loc[ords_prods_merge['median_days_since_prior_order'] > 20, 'order_frequency_flag'] = 'Non-frequent customer'


In [32]:
ords_prods_merge.loc[(ords_prods_merge['median_days_since_prior_order'] <= 20) & (ords_prods_merge['median_days_since_prior_order'] > 10), 'order_frequency_flag'] = 'Regular customer'

In [33]:
ords_prods_merge.loc[ords_prods_merge['median_days_since_prior_order'] <= 10, 'order_frequency_flag'] = 'Frequent customer'

In [40]:
# Check frequency counts of order frequency flag.

ords_prods_merge['order_frequency_flag'].value_counts(dropna = False)

order_frequency_flag
Frequent customer        20535136
Regular customer          9168905
Non-frequent customer     2700818
Name: count, dtype: int64

## 8. Export dataframe as a pickle file and store it correctly in “Prepared Data” folder.

In [41]:
ords_prods_merge.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged_agg.pkl'))