# <a id='toc1_'></a>[Final Report (2 of 3)](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Final Report (2 of 3)](#toc1_)    
  - [II. Data Analysis](#toc1_1_)    
    - [II.3. Customer profiling](#toc1_1_1_)    
      - [II.3.1. Create a profiling variable based on age, income, family status, number of dependants and certain goods in the 'department_id' column](#toc1_1_1_1_)    
        - [II.3.1.4. Customer segmentation based on their ordered goods in the 'department_id' column](#toc1_1_1_1_1_)    
  - [IV. Data Export](#toc1_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
# import libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import scipy

In [2]:
# create a path to the directory
path = r'C:\Users\Ansgar.S\Uyen\OneDrive\Documents\Data Immersion\Achievement IV - Python Fundamentals for Data Analysts\02-2023 Instacart Basket Analysis'

# import the 'orders_products_secured_high_activity.pkl' dataset
df_secured_high_activity = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_secured_high_activity.pkl'))

# import the 'departments_wrangled.csv' dataset
df_deps = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'departments_wrangled.csv'))

## <a id='toc1_1_'></a>[II. Data Analysis](#toc0_)

### <a id='toc1_1_1_'></a>[II.3. Customer profiling](#toc0_)

#### <a id='toc1_1_1_1_'></a>[II.3.1. Create a profiling variable based on age, income, family status, number of dependants and certain goods in the 'department_id' column](#toc0_)

##### <a id='toc1_1_1_1_1_'></a>[II.3.1.4. Customer segmentation based on their ordered goods in the 'department_id' column](#toc0_)

In [4]:
# make sure pandas display all columns
pd.set_option('display.max_columns', None)

In [3]:
# check the output of df_secured_high_activity
print('Sample output of df_secured_high_activity:')
df_secured_high_activity.sample(3)

Sample output of df_secured_high_activity:


Unnamed: 0,product_id,product_name,aisle_id,department_id,prices,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,...,age,date_joined,number_of_dependants,family_status,income,region,low_activity_flag,age_bracket,income_bracket,family_type
4024993,6348,Mini Original Babybel Cheese,21,16,11.8,1428546,60049,5,6,20,...,75,5/23/2019,2,married,104271,Midwest,High activity,Senior,High income,Married with children
26236694,35321,Organic Lowfat Milk,84,16,5.1,2363720,182981,5,0,12,...,22,12/8/2019,2,married,77809,South,High activity,Young adult,Upper-middle income,Married with children
16188837,17600,"YoKids Squeezers Organic Low-Fat Yogurt, Straw...",120,16,14.1,3167230,153631,3,0,8,...,45,1/30/2017,0,single,45287,South,High activity,Middle-aged adult,Lower-middle income,Single with no children


In [None]:
# check the departments
print('All departments:')
df_deps

All departments:


Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol
5,6,international
6,7,beverages
7,8,pets
8,9,dry goods pasta
9,10,bulk


In [3]:
# create a copy of df_secured_high_activity with only columns 'user_id', 'product_id' and 'department_id'
df_users_deps = df_secured_high_activity[['user_id', 'product_id', 'department_id']]

In [4]:
# change the data type of column 'department_id' in df_deps to string
df_deps['department_id'] = df_deps['department_id'].astype('str')

In [5]:
# create column 'department' in df_users_deps with the name of corresponding department_id
df_users_deps = df_users_deps.merge(df_deps[['department_id', 'department']], on = 'department_id', how = 'left')

In [7]:
# group the data by user_id and department
print('df_users_deps grouped by user_id and department:')
df_users_deps.groupby(['user_id', 'department'])['product_id'].count().reset_index()

df_users_deps grouped by user_id and department:


Unnamed: 0,user_id,department,product_id
0,1,beverages,13
1,1,breakfast,3
2,1,dairy eggs,13
3,1,household,2
4,1,pantry,1
...,...,...,...
1779491,99999,meat seafood,14
1779492,99999,missing,1
1779493,99999,pantry,3
1779494,99999,produce,129


**Observation**: There are several customer segmentations based on their purchased orders' department:

- *Health-conscious shoppers*: Customers who purchase items from the produce (4) department
- *Alcohol drinkers*: Customers who purchase alcohol (5)
- *Pet owners*: Customers who purchase pet (8) items
- *Meat-eaters*: Customers who purchase items from the meat and seafood (12) departments
- *Family-oriented shoppers*: Customers who frequently buy products from the household (17) and babies (18) departments
- *Convenience shoppers*: Customers who mainly purchase from the frozen (1), canned goods (15), snacks (19) or deli (20) departments
- *Bulk buyers*: Customers who purchase in bulk (10)

a. Create flag for health-conscious customers

In [6]:
# create column 'health_conscious_shopper' in in df_users_deps based on each user's produce-related orders
df_users_deps['health_conscious_shopper'] = np.where(df_users_deps['department'] == 'produce', 'Yes', 'No')

b. Create flag for alcohol drinkers

In [7]:
# create column 'alcohol_drinker' in in df_users_deps based on each user's alcohol-related orders
df_users_deps['alcohol_drinker'] = np.where(df_users_deps['department'] == 'alcohol', 'Yes', 'No')

c. Create flag for pet owners

In [8]:
# create column 'pet_owner' in in df_users_deps based on each user's pets-related orders
df_users_deps['pet_owner'] = np.where(df_users_deps['department'] == 'pets', 'Yes', 'No')

d. Create flag for meat-eaters

In [9]:
# create column 'meat_eater' in in df_users_deps based on each user's meat and seafood-related orders
df_users_deps['meat_eater'] = np.where(df_users_deps['department'] == 'meat seafood', 'Yes', 'No')

e. Create flag for family-oriented customers

In [10]:
# create column 'family_oriented_shopper' in in df_users_deps based on each user's household & babies-related orders
df_users_deps['family_oriented_shopper'] = np.where(df_users_deps['department'].isin(['household', 'babies']), 'Yes', 'No')

f. Create flag for convenience shoppers

In [11]:
# create column 'convenience_shopper' in in df_users_deps based on each user's frozen, canned goods, snacks & deli-related orders
df_users_deps['convenience_shopper'] = np.where(df_users_deps['department'].isin(['frozen', 'canned goods', 'snacks', 'deli']), 'Yes', 'No')

g. Create flag for bulk buyers

In [12]:
# create column 'bulk_buyer' in in df_users_deps based on each user's bulk-related orders
df_users_deps['bulk_buyer'] = np.where(df_users_deps['department'] == 'bulk', 'Yes', 'No')

h. Create customer shopping behavior profiles

In [19]:
# count the values in health_conscious_shopper in df_users_deps grouped by user_id
print('Count of the values in health_conscious_shopper grouped by user_id:')
df_users_deps.groupby(['user_id'])['health_conscious_shopper'].count().reset_index()

Count of the values in health_conscious_shopper grouped by user_id:


Unnamed: 0,user_id,health_conscious_shopper
0,1,59
1,10,143
2,100,27
3,1000,103
4,10000,1092
...,...,...
153800,99994,164
153801,99995,50
153802,99996,128
153803,99998,21


In [13]:
# group df_users_deps by user_id and get the maximum value for each column
df_behavior_grouped = df_users_deps.groupby('user_id').max().reset_index()

In [14]:
# drop columns from df_behavior_grouped as only user_id and customer shopping behavior-related columns are needed for analysis
df_behavior_grouped = df_behavior_grouped.drop(columns = ['product_id', 'department_id', 'department'])

In [17]:
# check the outputs of df_behavior_grouped by user_id 4
print('Outputs of df_behavior_grouped by user_id 4:')
df_behavior_grouped[df_behavior_grouped['user_id'] == 4]

Outputs of df_behavior_grouped by user_id 4:


Unnamed: 0,user_id,health_conscious_shopper,alcohol_drinker,pet_owner,meat_eater,family_oriented_shopper,convenience_shopper,bulk_buyer


In [18]:
# compare the output above to the outputs of df_users_deps by user_id 4
print('Comparing to the outputs of df_users_deps by user_id 4:')
df_users_deps[df_users_deps['user_id'] == 4]

Comparing to the outputs of df_users_deps by user_id 4:


Unnamed: 0,user_id,product_id,department_id,department,health_conscious_shopper,alcohol_drinker,pet_owner,meat_eater,family_oriented_shopper,convenience_shopper,bulk_buyer


*Dataframe df_behavior_grouped is working as intended.*

In [15]:
# merge df_behavior_grouped with the main dataframe df_secured_high_activity
merged_df = pd.merge(df_secured_high_activity, df_behavior_grouped, on='user_id', how='left')

In [None]:
# check the outputs of merged_df
print('Outputs of merged_df:')
merged_df.head(3)

Outputs of merged_df:


Unnamed: 0,product_id,product_name,aisle_id,department_id,prices,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,...,age_bracket,income_bracket,family_type,health_conscious_shopper,alcohol_drinker,pet_owner,meat_eater,family_oriented_shopper,convenience_shopper,bulk_buyer
0,1,Chocolate Sandwich Cookies,61,19,5.8,3139998,138,28,6,11,...,Senior,Lower-middle income,Married with no children,Yes,No,No,Yes,Yes,Yes,Yes
1,1,Chocolate Sandwich Cookies,61,19,5.8,1977647,138,30,6,17,...,Senior,Lower-middle income,Married with no children,Yes,No,No,Yes,Yes,Yes,Yes
2,907,Premium Sliced Bacon,106,12,20.0,3160996,138,1,5,13,...,Senior,Lower-middle income,Married with no children,Yes,No,No,Yes,Yes,Yes,Yes


## <a id='toc1_2_'></a>[IV. Data Export](#toc0_)

In [None]:
# export merged_df in .pkl format
merged_df.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_customer_profiles_high_activity.pkl'))