### This script contains the following points:

#### 1. Importing libraries
#### 2. Importing orders_products_merged
#### 3. User_order_number aggregated mean for department_id
#### 4. Create max_order column using transform()
#### 5. Assigning loyalty label to user IDs using loc()
#### 6. Descriptive statistics for prices by loyalty flag
#### 7. Create 'average_price' column for each user_id
#### 8. Create 'median_days_prior_order' column for each user
#### 9. Exporting ords_prods_merge to pkl

# 1. Importing libraries

In [1]:
import pandas as pd
import numpy as np
import os

# 2. Import orders_products_merged

In [2]:
# importing PKL order_products_combined from prepared data
ords_prods_merge = pd.read_pickle(r'C:\Users\kevan\Documents\Career Foundry\Data Immersion\Achievement 4\Instacart Basket Analysis\02 Data\Prepared Data\orders_products_merged.pkl')

In [3]:
# assigning main project path to variable 'path'
path = r'C:\Users\kevan\Documents\Career Foundry\Data Immersion\Achievement 4\Instacart Basket Analysis'

# 3. User_order_number aggregated mean for department_id

In [6]:
# average user order numbers by dept ID for ENTIRE dataframe
ords_prods_merge.groupby('department_id').agg({'order_number': ['mean']})

Unnamed: 0_level_0,order_number
Unnamed: 0_level_1,mean
department_id,Unnamed: 1_level_2
1,15.457838
2,17.27792
3,17.170395
4,17.811403
5,15.215751
6,16.439806
7,17.225802
8,15.34065
9,15.895474
10,20.197148


# 4. Create max_order column using transform()

In [7]:
# grouping user_id and user_order_number to generate maximum orders for each user
# creating a new column "max_order" to place results of aggregation

ords_prods_merge['max_order'] = ords_prods_merge.groupby(['user_id'])['order_number'].transform(np.max)

In [8]:
#adjusting the max rows to be displayed using the head function to assist with quality checks

pd.options.display.max_rows = None

# 5. Assigning loyalty label to user IDs using loc()

In [9]:
# using max_order value to assign loyalty flags to each user
# users with max order greater than 40 are loyal customers

ords_prods_merge.loc[ords_prods_merge['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'

In [10]:
# users with max orders between 11-40 are regular customers

ords_prods_merge.loc[(ords_prods_merge['max_order'] <= 40) & (ords_prods_merge['max_order'] > 10), 'loyalty_flag'] = 'Regular customer'

In [11]:
# users with max orders 10 or less are new customers

ords_prods_merge.loc[ords_prods_merge['max_order'] <= 10, 'loyalty_flag'] = 'New customer'

In [12]:
ords_prods_merge['loyalty_flag'].value_counts(dropna = False)

Regular customer    15876776
Loyal customer      10284093
New customer         6243990
Name: loyalty_flag, dtype: int64

# 6. Descriptive statistics for prices by loyalty flag

In [13]:
ords_prods_merge.groupby('loyalty_flag').agg({'prices': ['mean', 'count', 'std', 'min', 'max']})

Unnamed: 0_level_0,prices,prices,prices,prices,prices
Unnamed: 0_level_1,mean,count,std,min,max
loyalty_flag,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Loyal customer,10.386336,10284093,328.017787,1.0,99999.0
New customer,13.29467,6243990,597.560299,1.0,99999.0
Regular customer,12.495717,15876776,539.720919,1.0,99999.0


### Loyal customers purchase the lowest priced products on average among all three groups. The standard deviation is the smallest for this group, which indiciates that loyal customers tend to purchase similarly priced items.

### New customers have the least amount of orders (count) but have the highest average for prices. Their standard deviation is the largest indicating more variability in product prices.

### Regular customers have the largest count of orders. Their average price is not as large as new customers, but it isn't far off. Regular customers have a smaller standard deviation than new customers, but it is still much larger than loyal customers. 

### All three groups have users that have purchased the highest priced items and the lowest priced items.

# 7. Create 'average_price' column for each user_id

In [14]:
# calculating the average product price for each user_id

ords_prods_merge['average_price'] = ords_prods_merge.groupby(['user_id'])['prices'].transform(np.mean)

In [15]:
# assigning spending flags to low spenders if average price is less than 10

ords_prods_merge.loc[ords_prods_merge['average_price'] < 10, 'spending_flag'] = 'Low spender'

In [16]:
# assigning spending flags to high spenders if average price is greater than or equal to 10

ords_prods_merge.loc[ords_prods_merge['average_price'] >= 10, 'spending_flag'] = 'High spender'

In [17]:
# checking counts of flags

ords_prods_merge['spending_flag'].value_counts(dropna = False)

Low spender     31770614
High spender      634245
Name: spending_flag, dtype: int64

In [18]:
# checking dataframe to verify new column and flags are correct
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_period_of_day,max_order,loyalty_flag,average_price,spending_flag
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,Mid-range product,Regularly busy,Average orders,10,New customer,6.367797,Low spender
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Average orders,10,New customer,6.367797,Low spender
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Most orders,10,New customer,6.367797,Low spender
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Average orders,10,New customer,6.367797,Low spender
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Most orders,10,New customer,6.367797,Low spender


# 8. Create 'median_days_prior_order' column for each user

In [19]:
# calculating the median days since prior order for each user

ords_prods_merge['median_days_prior_order'] = ords_prods_merge.groupby(['user_id'])['days_since_prior_order'].transform(np.median)

In [20]:
# assigning frequency flags to non-frequent customers if days since prior order is greater than 20

ords_prods_merge.loc[ords_prods_merge['median_days_prior_order'] > 20, 'frequency_flag'] = 'Non-frequent customer'

In [21]:
# assigning frequency flags to regular customers if days since prior order is greater than 10 and less than or equal to 20

ords_prods_merge.loc[(ords_prods_merge['median_days_prior_order'] > 10) & (ords_prods_merge['median_days_prior_order'] <= 20), 'frequency_flag'] = 'Regular customer'

In [22]:
# assigning frequency flags to frequent customers if days since prior order is less than or equal to 10

ords_prods_merge.loc[ords_prods_merge['median_days_prior_order'] <= 10, 'frequency_flag'] = 'Frequent customer'

In [23]:
# checking counts of flags

ords_prods_merge['frequency_flag'].value_counts(dropna = False)

Frequent customer        21559853
Regular customer          7208564
Non-frequent customer     3636437
NaN                             5
Name: frequency_flag, dtype: int64

### 5 NaN values likely indicate that there are 5 users where the user(s) did not have a prior order

In [24]:
# checking dataframe to verify new column and flags are correct
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,prices,price_range_loc,busiest_day,busiest_period_of_day,max_order,loyalty_flag,average_price,spending_flag,median_days_prior_order,frequency_flag
0,2539329,1,1,2,8,,196,1,0,Soda,...,9.0,Mid-range product,Regularly busy,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,9.0,Mid-range product,Regularly busy,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,9.0,Mid-range product,Regularly busy,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,9.0,Mid-range product,Regularly busy,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,9.0,Mid-range product,Regularly busy,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer


# 9. Exporting ords_prods_merge to pkl

In [25]:
ords_prods_merge.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_merged.pkl'))