# Exercise 4.8 Task

## This script will cover the following:
### Step 1 - Create Notebook and Import Data
### Step 2 - Aggregate the Mean of order_id grouped by each Column in a subset

### Step 3 - Analyze the Result
### Step 4 - Create loyalty flag using transform() and loc()
### Step 5 - Spending Habits by Customer Loyalty Type (with troubleshooting for prices column included)
### Step 6 - Spending Flag
### Step 7 - Determining Frequent v. Non-Frequent Users of the App
### Step 8 - Export Data

In [1]:
# Import Libraries

import os
import pandas as pd
import numpy as np

In [2]:
# Set Path

path = r'C:\Users\Josh Wattay\anaconda3\Instacart Basket Analysis'

In [3]:
# Import Data

ords_prods_merge = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge_4.8.pkl'))

In [4]:
# Check output

ords_prods_merge.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,orders_time_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,_merge,price_range_loc,busiest_days,busiest_period_of_day,max_order,loyalty_flag,average_purchase_cost,spending_flag,median_days_btwn_orders,order_frequency_flag
0,2539329,1,1,2,8,,196,1,0,Soda,...,both,Mid-range product,Regularly busy,Average Orders,10,New customer,6.367535,Low spender,20.5,Non-frequent customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,both,Mid-range product,Slowest days,Average Orders,10,New customer,6.367535,Low spender,20.5,Non-frequent customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,both,Mid-range product,Slowest days,Most Orders,10,New customer,6.367535,Low spender,20.5,Non-frequent customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,both,Mid-range product,Slowest days,Average Orders,10,New customer,6.367535,Low spender,20.5,Non-frequent customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,both,Mid-range product,Slowest days,Most Orders,10,New customer,6.367535,Low spender,20.5,Non-frequent customer


In [5]:
ords_prods_merge.tail()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,orders_time_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,_merge,price_range_loc,busiest_days,busiest_period_of_day,max_order,loyalty_flag,average_purchase_cost,spending_flag,median_days_btwn_orders,order_frequency_flag
32404854,1320836,202557,17,2,15,1.0,43553,2,1,Orange Energy Shots,...,both,Low-range product,Regularly busy,Most Orders,31,Regular customer,6.905718,Low spender,8.0,Frequent customer
32404855,31526,202557,18,5,11,3.0,43553,2,1,Orange Energy Shots,...,both,Low-range product,Regularly busy,Most Orders,31,Regular customer,6.905718,Low spender,8.0,Frequent customer
32404856,758936,203436,1,2,7,,42338,4,0,"Zucchini Chips, Pesto",...,both,Mid-range product,Regularly busy,Average Orders,3,New customer,7.631527,Low spender,15.0,Regular customer
32404857,2745165,203436,2,3,5,15.0,42338,16,1,"Zucchini Chips, Pesto",...,both,Mid-range product,Slowest days,Fewest Orders,3,New customer,7.631527,Low spender,15.0,Regular customer
32404858,3093936,205420,1,4,14,,28818,8,0,Hot Oatmeal Multigrain Raisin,...,both,Mid-range product,Slowest days,Most Orders,16,Regular customer,7.684844,Low spender,13.0,Regular customer


In [6]:
# Shape Check

ords_prods_merge.shape

(32404859, 23)

### Step 2 - Aggregate the Mean of order_number Grouped by department_id for entire dataset

In [7]:
# Apply groupby() and agg() function

ords_prods_merge.groupby('department_id').agg({'order_number': ['mean']})

Unnamed: 0_level_0,order_number
Unnamed: 0_level_1,mean
department_id,Unnamed: 1_level_2
1,15.457838
2,17.27792
3,17.170395
4,17.811403
5,15.215751
6,16.439806
7,17.225802
8,15.34065
9,15.895474
10,20.197148


### Step 3 - Analyze the Result

#### This result is notably different from the result of a subset created with 1 million records. 
#### Firstly, all 21 departments are represented in this result. 
#### Secondly, the averages for the departments that were represented in the subset are all different.
#### This makes sense, given that we are aggregating about 32.4 times the amount of records as the 1 million record subset. 

### Step 4 - Create loyalty flag using transform() and loc()

In [8]:
# These steps were completed during the exercise using the following code (which I will not repeat hear for practical purposes)

#Creating a column for max_order using transform()

ords_prods_merge['max_order'] = ords_prods_merge.groupby(['user_id'])['order_number'].transform(np.max)

#Creating a loyalty Flag using loc()

ords_prods_merge.loc[ords_prods_merge['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'
ords_prods_merge.loc[(ords_prods_merge['max_order'] <= 40) & (ords_prods_merge['max_order'] > 10), 'loyalty_flag'] = 'Regular customer'
ords_prods_merge.loc[ords_prods_merge['max_order'] <= 10, 'loyalty_flag'] = 'New customer'

  ords_prods_merge['max_order'] = ords_prods_merge.groupby(['user_id'])['order_number'].transform(np.max)


In [9]:
# Frequency Check

ords_prods_merge['loyalty_flag'].value_counts(dropna = False)

loyalty_flag
Regular customer    15876776
Loyal customer      10284093
New customer         6243990
Name: count, dtype: int64

In [10]:
# Aggregate Check using Sum of Results

15876776 + 10284093 + 6243990

32404859

In [11]:
# Shape Check to Verify Aggregate Check Results

ords_prods_merge.shape

(32404859, 23)

The 32,404,859 records from the shape check MATCH the results from our aggregate check. Therefore, all records are accounted for.

### Step 5 - Spending Habits by Customer Loyalty Type (with prices column troubleshooting)

#### The marketing team at Instacart wants to know whether there’s a difference between the spending habits of the three types of customers you identified. Use the loyalty flag you created and check the basic statistics of the product prices for each loyalty category (Loyal Customer, Regular Customer, and New Customer). What you’re trying to determine is whether the prices of products purchased by loyal customers differ from those purchased by regular or new customers.

In [12]:
# Lets group by "loyalty_flag" and find the mean/average, minimum, and maximum values for the "prices" column

ords_prods_merge.groupby('loyalty_flag').agg({'prices': ['mean', 'min', 'max']})

Unnamed: 0_level_0,prices,prices,prices
Unnamed: 0_level_1,mean,min,max
loyalty_flag,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Loyal customer,7.775655,1.0,25.0
New customer,7.80426,1.0,25.0
Regular customer,7.801007,1.0,25.0


In [13]:
# Check for division by zero

if (ords_prods_merge['prices'] == 0).any():
    print("Warning: Division by zero may occur. Check your data.")

In [14]:
# Check data type of prices column

ords_prods_merge['prices'].dtype

dtype('float16')

This explains why we were having infinite results when trying to calculate the mean.
I will now change the datatype to int32 and reattempt.

In [15]:
# Verify descriptive stats of prices column

ords_prods_merge['prices'].describe()

  return dtype.type(n)
  return umr_sum(a, axis, dtype, out, keepdims, initial, where)
  the_mean = the_sum / count if count > 0 else np.nan
  return dtype.type(n)


count    3.240486e+07
mean              NaN
std      0.000000e+00
min      1.000000e+00
25%      4.199219e+00
50%      7.398438e+00
75%      1.129688e+01
max      2.500000e+01
Name: prices, dtype: float64

In [16]:
ords_prods_merge['prices'].max()

25.0

In [17]:
mask_inf_values = ords_prods_merge['prices'] == float('inf')

In [18]:
# Creating a boolean mask for rows with infinite values in the prices column

In [19]:
mask_inf_values.value_counts()

prices
False    32404859
Name: count, dtype: int64

In [20]:
# Use the boolean mask to filter the DataFrame

rows_with_inf_values = ords_prods_merge[mask_inf_values]

In [21]:
# Display the rows with infinite values

print(rows_with_inf_values)

Empty DataFrame
Columns: [order_id, user_id, order_number, orders_day_of_week, orders_time_of_day, days_since_prior_order, product_id, add_to_cart_order, reordered, product_name, aisle_id, department_id, prices, _merge, price_range_loc, busiest_days, busiest_period_of_day, max_order, loyalty_flag, average_purchase_cost, spending_flag, median_days_btwn_orders, order_frequency_flag]
Index: []

[0 rows x 23 columns]


In [22]:
# Sort the DataFrame by the 'prices' column in descending order

df_sorted = ords_prods_merge.sort_values(by='prices', ascending=False)

In [23]:
df_sorted.head(5130)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,orders_time_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,_merge,price_range_loc,busiest_days,busiest_period_of_day,max_order,loyalty_flag,average_purchase_cost,spending_flag,median_days_btwn_orders,order_frequency_flag
22675639,2346836,104650,18,0,11,26.0,40486,43,0,Chicken Tenders,...,both,High-range product,Busiest days,Most Orders,22,Regular customer,7.148770,Low spender,11.0,Regular customer
18506618,2551921,34636,4,2,8,18.0,9020,9,0,Boneless Skinless Chicken Thighs,...,both,High-range product,Regularly busy,Average Orders,7,New customer,7.430911,Low spender,18.0,Regular customer
18506620,576874,34643,3,3,20,24.0,9020,3,0,Boneless Skinless Chicken Thighs,...,both,High-range product,Slowest days,Average Orders,27,Regular customer,9.028491,Low spender,9.0,Frequent customer
18506621,950934,34643,11,1,9,8.0,9020,22,1,Boneless Skinless Chicken Thighs,...,both,High-range product,Busiest days,Most Orders,27,Regular customer,9.028491,Low spender,9.0,Frequent customer
18506622,613500,34643,12,2,11,8.0,9020,9,1,Boneless Skinless Chicken Thighs,...,both,High-range product,Regularly busy,Most Orders,27,Regular customer,9.028491,Low spender,9.0,Frequent customer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18508245,1657919,65038,13,0,21,3.0,9020,11,1,Boneless Skinless Chicken Thighs,...,both,High-range product,Busiest days,Average Orders,26,Regular customer,7.427334,Low spender,7.0,Frequent customer
18508246,592147,65082,8,1,13,7.0,9020,15,0,Boneless Skinless Chicken Thighs,...,both,High-range product,Busiest days,Most Orders,16,Regular customer,8.772306,Low spender,16.0,Regular customer
18508247,2671829,65084,5,2,16,9.0,9020,19,0,Boneless Skinless Chicken Thighs,...,both,High-range product,Regularly busy,Most Orders,8,New customer,7.591705,Low spender,9.0,Frequent customer
18508248,1021170,65084,7,4,12,8.0,9020,14,1,Boneless Skinless Chicken Thighs,...,both,High-range product,Slowest days,Most Orders,8,New customer,7.591705,Low spender,9.0,Frequent customer


In [24]:
mask_product_id_21553 = ords_prods_merge['product_id'] == 21553

In [25]:
# Use the boolean mask to filter the DataFrame

records_with_product_id_21553 = ords_prods_merge[mask_product_id_21553]

In [26]:
print(records_with_product_id_21553)

          order_id  user_id  order_number  orders_day_of_week  \
10030345    912404       17            12                   2   
10030346    603376       17            22                   6   
10030347   3264360      135             2                   2   
10030348    892534      135             3                   0   
10030349    229704      342             8                   1   
...            ...      ...           ...                 ...   
10034769   3172853   205650            18                   1   
10034770   2504315   205818             3                   5   
10034771   1108388   205818             5                   4   
10034772   1916142   206049             1                   2   
10034773    379732   206049             4                   1   

          orders_time_of_day  days_since_prior_order  product_id  \
10030345                  14                     5.0       21553   
10030346                  16                     4.0       21553   
10030347       

In [27]:
# Update the price values for rows with infinite values

# I will replace 'new_value' with the desired value for these records, wich is the highest non-infinite value of 14896.0

new_value = 14896.0  # Replace with desired value
ords_prods_merge.loc[mask_inf_values, 'prices'] = new_value

In [28]:
# Create a boolean mask for rows with price equal to 14896.0
mask_price_14896 = ords_prods_merge['prices'] == 14896.0

In [29]:
# Update the price values for rows with price equal to 14896.0
new_price_value = 25.00
ords_prods_merge.loc[mask_price_14896, 'prices'] = new_price_value

#### I have discovered a problem in the prices column. 
#### The 2% Milk has an infinite value in the prices column.
#### The Cottage Cheese has a value of 14896. Obviously this is an error, or the worlds best cottage cheese.
#### Therefore, above I used a boolean mask technique to locate and change the infinite values of the Milk to the values of the Cottage Cheese.
#### I then used the same technique to change the erroneous value of the milk and cottage cheese from 14896 to 25.
#### The value of 25 was determined by searching through the dataframe for the next highest value below 14896 by using the head() function to scan.
#### I will now re-attempt finding aggregate values for prices grouped by customer loyalty type.

In [30]:
# Re-attempt to find mean, min, and max for customer types 

ords_prods_merge.groupby('loyalty_flag').agg({'prices': ['mean', 'min', 'max']})

Unnamed: 0_level_0,prices,prices,prices
Unnamed: 0_level_1,mean,min,max
loyalty_flag,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Loyal customer,7.775655,1.0,25.0
New customer,7.80426,1.0,25.0
Regular customer,7.801007,1.0,25.0


#### Based on our data, there is no noticeable difference between the spending habits of customers based on the loyalty_flag criteria.

### Step 6 - Spending Flag

#### The team now wants to target different types of spenders in their marketing campaigns. This can be achieved by looking at the prices of the items people are buying. Create a spending flag for each user based on the average price across all their orders using the following criteria:

#### If the mean of the prices of products purchased by a user is lower than 10, then flag them as a “Low spender.”

#### If the mean of the prices of products purchased by a user is higher than or equal to 10, then flag them as a “High spender.”

In [31]:
# Lets start by using the transform() function to determine the mean price of products based on each user

# We will create a new column with this information called 'average_purchase_cost'

ords_prods_merge['average_purchase_cost'] = ords_prods_merge.groupby(['user_id'])['prices'].transform(np.mean)

  ords_prods_merge['average_purchase_cost'] = ords_prods_merge.groupby(['user_id'])['prices'].transform(np.mean)


In [32]:
# Check output for user_id and new column

ords_prods_merge[['user_id', 'average_purchase_cost']].head(50)

Unnamed: 0,user_id,average_purchase_cost
0,1,6.367535
1,1,6.367535
2,1,6.367535
3,1,6.367535
4,1,6.367535
5,1,6.367535
6,1,6.367535
7,1,6.367535
8,1,6.367535
9,1,6.367535


#### Now we can use the loc() function to create a Spending Flag that meets the criteria provided by the stakeholders.

In [33]:
# For High spender category

ords_prods_merge.loc[ords_prods_merge['average_purchase_cost'] >= 10, 'spending_flag'] = 'High spender'

In [34]:
# For Low spender category

ords_prods_merge.loc[ords_prods_merge['average_purchase_cost'] < 10, 'spending_flag'] = 'Low spender'

In [35]:
# Check output of columns

ords_prods_merge[['user_id', 'average_purchase_cost', 'spending_flag']].head(50)

Unnamed: 0,user_id,average_purchase_cost,spending_flag
0,1,6.367535,Low spender
1,1,6.367535,Low spender
2,1,6.367535,Low spender
3,1,6.367535,Low spender
4,1,6.367535,Low spender
5,1,6.367535,Low spender
6,1,6.367535,Low spender
7,1,6.367535,Low spender
8,1,6.367535,Low spender
9,1,6.367535,Low spender


In [36]:
# Frequency Check

ords_prods_merge['spending_flag'].value_counts(dropna = False)

spending_flag
Low spender     32282827
High spender      122032
Name: count, dtype: int64

In [37]:
# Aggregate Check
32282827 + 122032

32404859

In [38]:
# Shape Check to verify Aggregate Check

ords_prods_merge.shape

(32404859, 23)

The 32,404,859 result of the Aggregate check MATCHES our dataframe Shape check. Therefore, all records are accounted for.

A vast majority of users average purchase is less than $10.

In [39]:
ords_prods_spending_flag = ords_prods_merge['spending_flag'].value_counts()

ords_prods_spending_flag.to_clipboard()

### Step 7 - Determining Frequent v. Non-Frequent Users of the App

#### In order to send relevant notifications to users within the app (for instance, asking users if they want to buy the same item again), the Instacart team wants you to determine frequent versus non-frequent customers. Create an order frequency flag that marks the regularity of a user’s ordering behavior according to the median in the “days_since_prior_order” column. The criteria for the flag should be as follows:

#### If the median of “days_since_prior_order” is higher than 20, then the customer should be labeled a “Non-frequent customer.”
#### If the median is higher than 10 and lower than or equal to 20, then the customer should be labeled a "Regular customer."
#### If the median is lower than or equal to 10, then the customer should be labeled a "Frequent customer.".”

In [40]:
# Lets start by using transform() again with a groupby() to determine a users median days since prior order

# We will create a median_days_btwn_orders column to give each user the median values needed to qualify for the Order Frequency Flag

ords_prods_merge['median_days_btwn_orders'] = ords_prods_merge.groupby(['user_id'])['days_since_prior_order'].transform(np.median)

  ords_prods_merge['median_days_btwn_orders'] = ords_prods_merge.groupby(['user_id'])['days_since_prior_order'].transform(np.median)


In [41]:
# Check output for user_id and new column

ords_prods_merge[['user_id', 'median_days_btwn_orders']].head(50)

Unnamed: 0,user_id,median_days_btwn_orders
0,1,20.5
1,1,20.5
2,1,20.5
3,1,20.5
4,1,20.5
5,1,20.5
6,1,20.5
7,1,20.5
8,1,20.5
9,1,20.5


#### Now we can use the loc() function to set the conditions for our Order Frequency Flag.

In [42]:
# For Non-frequent customer

ords_prods_merge.loc[ords_prods_merge['median_days_btwn_orders'] > 20, 'order_frequency_flag'] = 'Non-frequent customer'

In [43]:
# For Regular customer

ords_prods_merge.loc[(ords_prods_merge['median_days_btwn_orders'] > 10) & (ords_prods_merge['median_days_btwn_orders'] <= 20), 'order_frequency_flag'] = 'Regular customer'

In [44]:
# For Frequent customer

ords_prods_merge.loc[ords_prods_merge['median_days_btwn_orders'] <= 10, 'order_frequency_flag'] = 'Frequent customer'

In [45]:
# Check output for new order_frequency_flag column with user_id and median_days_btwn_orders

ords_prods_merge[['user_id', 'median_days_btwn_orders', 'order_frequency_flag']].head(50)

Unnamed: 0,user_id,median_days_btwn_orders,order_frequency_flag
0,1,20.5,Non-frequent customer
1,1,20.5,Non-frequent customer
2,1,20.5,Non-frequent customer
3,1,20.5,Non-frequent customer
4,1,20.5,Non-frequent customer
5,1,20.5,Non-frequent customer
6,1,20.5,Non-frequent customer
7,1,20.5,Non-frequent customer
8,1,20.5,Non-frequent customer
9,1,20.5,Non-frequent customer


In [46]:
# Frequency Check

ords_prods_merge['order_frequency_flag'].value_counts(dropna = False)

order_frequency_flag
Frequent customer        21559853
Regular customer          7208564
Non-frequent customer     3636437
nan                             5
Name: count, dtype: int64

In [49]:
ords_prods_freq_flag = ords_prods_merge['order_frequency_flag'].value_counts(dropna = False)

ords_prods_freq_flag.to_clipboard()

In [47]:
# Aggregate Check

21559853 + 7208564 + 3636437 + 5

32404859

In [48]:
# Shape Check to verify Aggregate Check

ords_prods_merge.shape

(32404859, 23)

#### The 32,404,859 rows from our shape check MATCH the results of the aggregate check. Therefore, all records are accounted for.
#### It should be noted that there is a count of 5 in the nan column. This indicates first time purchases by new users who have never made another purchase. 

### 8. Export Data

In [None]:
ords_prods_merge.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'ords_prods_merge_4.8.pkl'))