## 4.10 Task Part 1 - PII, Regions, and Spending

### Contents

01 Import Libraries and Data

02 Address Personally Identifiable Information

03 Assign Customers to Regions

04 Create Crosstab of Region and Spending Flag

05 Analyze Crosstab of Region and Spending Flag

06 Export Data

### 01 Import Libraries and Data

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

In [2]:
# Create path variable for main project folder
path = r'D:\JupyterProjects\06-2022 Instacart Basket Analysis'

In [3]:
df = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'merged_final.pkl'))

In [4]:
# Check data dimensions
df.shape

(32404859, 34)

In [5]:
# View top five rows
df.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,new_customer,product_id,add_to_cart_order,reordered,...,frequency_flag,first_name,last_name,gender,state,age,date_joined,no_of_dependents,fam_status,income
0,2539329,1,1,2,8,,True,196,1,0,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
1,2398795,1,2,3,7,15.0,False,196,1,1,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
2,473747,1,3,3,12,21.0,False,196,1,1,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
3,2254736,1,4,4,7,29.0,False,196,1,1,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
4,431534,1,5,4,15,28.0,False,196,1,1,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423


### 02 Address Personally Identifiable Information

In [6]:
# Drop columns with personally identifieable information.  Also dropping aisle_id and _merge due to memory issues.
df_drop = df.drop(columns = ['first_name', 'last_name', 'aisle_id', '_merge'])

In [7]:
# Verify changes
df_drop.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,new_customer,product_id,add_to_cart_order,reordered,...,spending_flag,median_days_prior,frequency_flag,gender,state,age,date_joined,no_of_dependents,fam_status,income
0,2539329,1,1,2,8,,True,196,1,0,...,Low spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423
1,2398795,1,2,3,7,15.0,False,196,1,1,...,Low spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423
2,473747,1,3,3,12,21.0,False,196,1,1,...,Low spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423
3,2254736,1,4,4,7,29.0,False,196,1,1,...,Low spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423
4,431534,1,5,4,15,28.0,False,196,1,1,...,Low spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423


### 03 Assign Customers to Regions

In [8]:
# Create result list
result = []

for value in df_drop["state"]:
  if value in ['Maine', 'New Hampshire', 'Vermont', 'Massachusetts', 'Rhode Island', 'Connecticut', 'New York', 'Pennsylvania', 'New Jersey']:
    result.append('Northeast')
  elif value in ['Wisconsin', 'Michigan', 'Illinois', 'Indiana', 'Ohio', 'North Dakota', 'South Dakota', 'Nebraska', 'Kansas', 'Minnesota', 'Iowa', 'Missouri']:
    result.append('Midwest')
  elif value in ['Delaware', 'Maryland', 'District of Columbia', 'Virginia', 'West Virginia', 'North Carolina', 'South Carolina', 'Georgia', 'Florida', 'Kentucky', 'Tennessee', 'Mississippi', 'Alabama', 'Oklahoma','Texas', 'Arkansas', 'Louisiana']:
    result.append('South') 
  elif value in ['Idaho', 'Montana', 'Wyoming', 'Nevada', 'Utah', 'Colorado', 'Arizona', 'New Mexico', 'Alaska', 'Washington', 'Oregon', 'California', 'Hawaii']:  
    result.append('West') 
  else:
    result.append('error')

In [9]:
# Create new column for region
df_drop['region'] = result

In [10]:
# Check region frequencies
df_drop['region'].value_counts(dropna = False)

South        10791885
West          8292913
Midwest       7597325
Northeast     5722736
Name: region, dtype: int64

### 04 Create Crosstab of Region and Spending Flag

In [11]:
# Create crosstab
crosstab = pd.crosstab(df_drop['region'], df_drop['spending_flag'], dropna = False)

In [12]:
# View crosstab
crosstab

spending_flag,High spender,Low spender
region,Unnamed: 1_level_1,Unnamed: 2_level_1
Midwest,29376,7567949
Northeast,18642,5704094
South,40739,10751146
West,31366,8261547


### 05 Analyze Crosstab of Region and Spending Flag

In [13]:
# Find percentage of high spenders per region
Midwest_high_spenders = (155975 / (7597325)) * 100
Northeast_high_spenders = (108225 / (5722736)) * 100
South_high_spenders = (209691 / (10791885)) * 100
West_high_spenders = (160354 / (8292913)) * 100

In [14]:
# View percentage of high spenders for each region
Midwest_high_spenders, Northeast_high_spenders, South_high_spenders, West_high_spenders

(2.053025242437305, 1.8911408808653762, 1.9430433144904713, 1.9336269414619447)

The percentage of high spenders in each region is similar.  The Midwest leads with 2.05% of customers being high spenders.

### 06 Export Data

In [15]:
# Export data and create new notebook due to memory issues
df_drop.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'final_region.pkl'))