### This Script Contains The Following Points ###

### 1. Imports and Subset Creation 
### 2. Grouping Data in Subset
### 3. Aggregating Data
### 4. Aggregating with Transform
### 5. Deriving Columns with loc()
### 6. Spending Flag

## 1. Imports and Subset Creation

In [60]:
import pandas as pd
import numpy as np
import os

In [61]:
# creating path
path = r'/Users/kimkmiz/Documents/Instacart Basket Analysis 2024'

In [62]:
#import orders_products_merged from previous lesson
ords_prods_merge = pd.read_pickle(os.path.join(path, '02 Data', 'IC24 Prepared Data', 'ords_prods_merge_derived.pkl'))

In [63]:
#create a subset called df of the first million entries
df = ords_prods_merge[:1000000]

In [64]:
# Check shape of new subset
df.shape

(1000000, 18)

In [65]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,product_id,product_name,aisle_id,department_id,prices,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,add_to_cart_order,reordered,_merge,price_range_loc,busiest_days,busiest_period_of_day
0,0,1,Chocolate Sandwich Cookies,61,19,5.8,3139998,138,28,6,11,3.0,5,0,both,Mid-range product,Regularly busy,Most orders
1,0,1,Chocolate Sandwich Cookies,61,19,5.8,1977647,138,30,6,17,20.0,1,1,both,Mid-range product,Regularly busy,Average orders
2,0,1,Chocolate Sandwich Cookies,61,19,5.8,389851,709,2,0,21,6.0,20,0,both,Mid-range product,Busiest days,Average orders
3,0,1,Chocolate Sandwich Cookies,61,19,5.8,652770,764,1,3,13,,10,0,both,Mid-range product,Slowest days,Most orders
4,0,1,Chocolate Sandwich Cookies,61,19,5.8,1813452,764,3,4,17,9.0,11,1,both,Mid-range product,Slowest days,Average orders
5,0,1,Chocolate Sandwich Cookies,61,19,5.8,1701441,777,16,1,7,26.0,7,0,both,Mid-range product,Busiest days,Average orders
6,0,1,Chocolate Sandwich Cookies,61,19,5.8,1871483,825,3,2,14,30.0,2,0,both,Mid-range product,Regularly busy,Most orders
7,0,1,Chocolate Sandwich Cookies,61,19,5.8,1290456,910,12,3,10,30.0,1,0,both,Mid-range product,Slowest days,Most orders
8,0,1,Chocolate Sandwich Cookies,61,19,5.8,369558,1052,10,1,20,19.0,1,0,both,Mid-range product,Busiest days,Average orders
9,0,1,Chocolate Sandwich Cookies,61,19,5.8,589712,1052,15,1,12,15.0,2,1,both,Mid-range product,Busiest days,Most orders


### Observations:
- the subset is long
- the unnamed column is not needed
- the first 10 rows are all from the same aisle, same product 'chocolate sandwich cookies'

## 2. Grouping Data in Subset

In [68]:
#group data from subset by product name
df.groupby('product_name')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1493241d0>

In [69]:
#this function is not visible
#this is only the first step of the workflow
#need to aggregate or apply a function next

## 3. Aggregating Data

In [71]:
#split data into groups based on 'department_id'
#apply the agg() function to each group to get mean values of the 'order_number' column
df.groupby('department_id').agg({'order_number': ['mean']})

Unnamed: 0_level_0,order_number
Unnamed: 0_level_1,mean
department_id,Unnamed: 1_level_2
1,15.577493
2,17.320781
3,16.084944
4,17.530458
5,14.763075
6,16.658449
7,17.03159
8,15.076662
9,15.44758
10,18.681852


In [72]:
# The groupby() function is assigned to the df dataframe, creating a pandas object for the 'department_id'
# The agg() function is applied to this object, returning the mean of the given column 'order_id'
# The table lists all possible 'department_id' values from the df with corresponding “order_number” means

In [73]:
# There are some aggregations that can be conducted without the use of the agg() function
df.groupby('department_id')['order_number'].mean()

department_id
1     15.577493
2     17.320781
3     16.084944
4     17.530458
5     14.763075
6     16.658449
7     17.031590
8     15.076662
9     15.447580
10    18.681852
11    15.447411
12    14.327957
13    16.548642
14    16.960241
15    16.121948
16    17.803851
17    15.593633
18    19.674252
19    16.899756
20    16.255442
21    25.535479
Name: order_number, dtype: float64

In [74]:
#When using agg(), put the column you want to aggregate inside the parenthesis of the agg()
#When using mean(), simply index the column with square brackets, then follow it with the function you want to use after the dot

In [75]:
#Dot notation vs. square brackets
#df.groupby('department_id').order_number.mean()
#Using dot notation results in the same output , however, there are reasons you shouldn't
#Square brackets stand out and are more readable 
#Square brackets have no other role in Python beyond indexing, so using dot notation can make it slower since it has to work harder to understand

In [76]:
#perform multiple aggregations at once
#find mean, min and max
df.groupby('department_id').agg({'order_number': ['mean', 'min', 'max']})

Unnamed: 0_level_0,order_number,order_number,order_number
Unnamed: 0_level_1,mean,min,max
department_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,15.577493,1,99
2,17.320781,1,96
3,16.084944,1,99
4,17.530458,1,99
5,14.763075,1,99
6,16.658449,1,99
7,17.03159,1,99
8,15.076662,1,98
9,15.44758,1,99
10,18.681852,1,99


## 4. Aggregating with Transform

In [78]:
#Create a flag for 'loyalty' customers
#'Loyalty' customers are those who come back time and time again to use the service or buy products
#Loyal customer = max orders are over 40
#Regular customer = max orders are over 10 but less than or equal to 40
#New Customer = max orders are less than or equal to 10

In [79]:
ords_prods_merge['max_order'] = ords_prods_merge.groupby(['user_id'])['order_number'].transform(np.max)

  ords_prods_merge['max_order'] = ords_prods_merge.groupby(['user_id'])['order_number'].transform(np.max)


In [97]:
ords_prods_merge.head(15)

Unnamed: 0.1,Unnamed: 0,product_id,product_name,aisle_id,department_id,prices,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,add_to_cart_order,reordered,_merge,price_range_loc,busiest_days,busiest_period_of_day,max_order
0,0,1,Chocolate Sandwich Cookies,61,19,5.8,3139998,138,28,6,11,3.0,5,0,both,Mid-range product,Regularly busy,Most orders,32
1,0,1,Chocolate Sandwich Cookies,61,19,5.8,1977647,138,30,6,17,20.0,1,1,both,Mid-range product,Regularly busy,Average orders,32
2,0,1,Chocolate Sandwich Cookies,61,19,5.8,389851,709,2,0,21,6.0,20,0,both,Mid-range product,Busiest days,Average orders,5
3,0,1,Chocolate Sandwich Cookies,61,19,5.8,652770,764,1,3,13,,10,0,both,Mid-range product,Slowest days,Most orders,3
4,0,1,Chocolate Sandwich Cookies,61,19,5.8,1813452,764,3,4,17,9.0,11,1,both,Mid-range product,Slowest days,Average orders,3
5,0,1,Chocolate Sandwich Cookies,61,19,5.8,1701441,777,16,1,7,26.0,7,0,both,Mid-range product,Busiest days,Average orders,26
6,0,1,Chocolate Sandwich Cookies,61,19,5.8,1871483,825,3,2,14,30.0,2,0,both,Mid-range product,Regularly busy,Most orders,9
7,0,1,Chocolate Sandwich Cookies,61,19,5.8,1290456,910,12,3,10,30.0,1,0,both,Mid-range product,Slowest days,Most orders,12
8,0,1,Chocolate Sandwich Cookies,61,19,5.8,369558,1052,10,1,20,19.0,1,0,both,Mid-range product,Busiest days,Average orders,20
9,0,1,Chocolate Sandwich Cookies,61,19,5.8,589712,1052,15,1,12,15.0,2,1,both,Mid-range product,Busiest days,Most orders,20


In [99]:
#New 'max_order' column at the end of df
#Each value in this column corresponds to the max number of orders made by each user id

In [101]:
#Check output
ords_prods_merge.head(100)

Unnamed: 0.1,Unnamed: 0,product_id,product_name,aisle_id,department_id,prices,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,add_to_cart_order,reordered,_merge,price_range_loc,busiest_days,busiest_period_of_day,max_order
0,0,1,Chocolate Sandwich Cookies,61,19,5.8,3139998,138,28,6,11,3.0,5,0,both,Mid-range product,Regularly busy,Most orders,32
1,0,1,Chocolate Sandwich Cookies,61,19,5.8,1977647,138,30,6,17,20.0,1,1,both,Mid-range product,Regularly busy,Average orders,32
2,0,1,Chocolate Sandwich Cookies,61,19,5.8,389851,709,2,0,21,6.0,20,0,both,Mid-range product,Busiest days,Average orders,5
3,0,1,Chocolate Sandwich Cookies,61,19,5.8,652770,764,1,3,13,,10,0,both,Mid-range product,Slowest days,Most orders,3
4,0,1,Chocolate Sandwich Cookies,61,19,5.8,1813452,764,3,4,17,9.0,11,1,both,Mid-range product,Slowest days,Average orders,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,1,Chocolate Sandwich Cookies,61,19,5.8,602103,10831,8,3,11,23.0,5,0,both,Mid-range product,Slowest days,Most orders,10
96,0,1,Chocolate Sandwich Cookies,61,19,5.8,49629,11119,18,1,14,30.0,1,0,both,Mid-range product,Busiest days,Most orders,23
97,0,1,Chocolate Sandwich Cookies,61,19,5.8,317888,11186,13,5,16,2.0,8,0,both,Mid-range product,Regularly busy,Most orders,26
98,0,1,Chocolate Sandwich Cookies,61,19,5.8,682486,11243,16,3,13,0.0,2,0,both,Mid-range product,Slowest days,Most orders,43


## 5. Deriving Columns with loc()

In [106]:
#create a flag that assigns a “loyalty” label to a user ID based on its corresponding max order value

In [108]:
ords_prods_merge.loc[ords_prods_merge['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'

In [110]:
ords_prods_merge.loc[(ords_prods_merge['max_order'] <= 40) & (ords_prods_merge['max_order'] >10), 'loyalty_flag'] = 'Regular customer'

In [112]:
ords_prods_merge.loc[ords_prods_merge['max_order'] <= 10, 'loyalty_flag'] = 'New customer'

In [114]:
#Check value counts
ords_prods_merge['loyalty_flag'].value_counts(dropna = False)

loyalty_flag
Regular customer    15876776
Loyal customer      10284093
New customer         6243990
Name: count, dtype: int64

**Observations:**
- Most customers are within the 'regular customer' category
- least customers considered 'new customer'

In [131]:
#Determine if the prices of products purchased by loyal customers differ from those purchased by regular or new customers
ords_prods_merge.groupby('loyalty_flag').agg({'prices' : ['mean', 'min', 'max']})

Unnamed: 0_level_0,prices,prices,prices
Unnamed: 0_level_1,mean,min,max
loyalty_flag,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Loyal customer,7.772831,1.0,25.0
New customer,7.80032,1.0,25.0
Regular customer,7.797431,1.0,25.0


**Observations:**
- There is not much difference in the prices of products purchased by loyal, regular or new customers
- For each customer type, the average price of products purchased is about $7.80

In [117]:
#Check flags assigned properly
ords_prods_merge[['user_id', 'loyalty_flag', 'order_number']].head(60)

Unnamed: 0,user_id,loyalty_flag,order_number
0,138,Regular customer,28
1,138,Regular customer,30
2,709,New customer,2
3,764,New customer,1
4,764,New customer,3
5,777,Regular customer,16
6,825,New customer,3
7,910,Regular customer,12
8,1052,Regular customer,10
9,1052,Regular customer,15


In [119]:
#Export df with new columns
ords_prods_merge.to_pickle(os.path.join(path, '02 Data','IC24 Prepared Data', 'ords_prods_merge_derived_new_variable.pkl'))

In [121]:
ords_prods_merge.shape

(32404859, 20)