# Task A4.09 Intro to Data Visualization with Python Part 1

### Table of Contents
##### 01. Import Libraries & Dataframe Creation
##### 02. Wrangle Customers Data
##### 03. Complete Fundamental Data Quality & Consistency Checks
##### 04. Combine Customer Data with Prepared Instacart Data
##### 05. Export New Dataframe

### 01. Import Libraries & Dataframe Creation

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

In [2]:
# path shortcut
path = r'/Users/nicolechiu/OneDrive - InterVarsity Christian Fellowship USA/Documents/CF Data Analytics/Achievement 4/05-2023 Instacart Basket Analysis'

In [3]:
# Import customer dataframe
df_customers = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'), index_col = False)

### 02. Wrangle Customers Data

In [4]:
# Check shape
df_customers.shape

(206209, 10)

In [5]:
# Check output
df_customers.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


I will rename certain columns to be easier to read, such as changing "Surnam" to "Last Name" so that it is consistent with the "First Name" column. I will also change "STATE" to be "State," for more consistent formatting. I will also change "n_dependents" to "number_of_dependents" to make it clearer.

Lastly, I will remove capital letters and use _ for spaces to be consistent with previous data frames.

I would not drop any columns at this point.

In [6]:
# Renaming surname column to 'last_name'
df_customers.rename(columns = {'Surnam': 'last_name'}, inplace = True)

In [7]:
# Changing formatting of STATE
df_customers.rename(columns = {'STATE': 'state'}, inplace = True)

In [8]:
# Renaming n_dependents column to 'number_of_dependents'
df_customers.rename(columns = {'n_dependents': 'number_of_dependents'}, inplace = True)

In [9]:
# Changing formatting of First Name
df_customers.rename(columns = {'First Name': 'first_name'}, inplace = True)

In [10]:
# Changing formatting of Gender
df_customers.rename(columns = {'Gender': 'gender'}, inplace = True)

In [11]:
# Changing formatting of Age
df_customers.rename(columns = {'Age': 'age'}, inplace = True)

In [12]:
# Checking the output
df_customers.head()

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [13]:
# Correcting spelling of dependants...
# Renaming n_dependants column to 'number_of_dependants'
df_customers.rename(columns = {'n_dependants': 'number_of_dependants'}, inplace = True)

In [14]:
# Checking the output
df_customers.head()

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,number_of_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


### 03. Complete Fundamental Data Quality & Consistency Checks

In [15]:
# Checking data types
df_customers.dtypes

user_id                  int64
first_name              object
last_name               object
gender                  object
state                   object
age                      int64
date_joined             object
number_of_dependants     int64
fam_status              object
income                   int64
dtype: object

Looking at these data types, most seem appropriate. Will change the user_id to string since it does not need to be an integer.

In [16]:
# Change user_id to string
df_customers['user_id'] = df_customers['user_id'].astype('str')

In [17]:
# Checking output
df_customers.dtypes

user_id                 object
first_name              object
last_name               object
gender                  object
state                   object
age                      int64
date_joined             object
number_of_dependants     int64
fam_status              object
income                   int64
dtype: object

In [18]:
# Checking data's basic statistics
df_customers.describe()

Unnamed: 0,age,number_of_dependants,income
count,206209.0,206209.0,206209.0
mean,49.501646,1.499823,94632.852548
std,18.480962,1.118433,42473.786988
min,18.0,0.0,25903.0
25%,33.0,0.0,59874.0
50%,49.0,1.0,93547.0
75%,66.0,3.0,124244.0
max,81.0,3.0,593901.0


Overall these numbers seem reasonable, though max number of dependents could possibly be low - would expect some family units to have more than 3 dependents. Am curious whether income is combined income for married folks, as if there are any stay-at-home parents I would expect the min to possibly be 0 for income. 

In [19]:
# Checking for missing values
df_customers.isnull().sum()

user_id                     0
first_name              11259
last_name                   0
gender                      0
state                       0
age                         0
date_joined                 0
number_of_dependants        0
fam_status                  0
income                      0
dtype: int64

I can see there are 11,259 missing first name values. While this may be a meaningful number of missing values, I am doubtful first name will be an important column. I will leave this as is and will simply omit printing the first name column when necessary.

In [20]:
# Checking for mixed-type data
for col in df_customers.columns.tolist():
    weird = (df_customers[[col]].applymap(type) != df_customers[[col]].iloc[0].apply(type)).any(axis=1)
    if len (df_customers[weird]) > 0:
        print (col)

first_name


The first_name column is showing mixed-type data. This may be because of all the missing values! Will convert the data type of the first_name column to string

In [21]:
# Converting first_name column to string
df_customers['first_name'] = df_customers['first_name'].astype('str')

In [22]:
# Checking if this resolved the mixed-type data 
for col in df_customers.columns.tolist():
    weird = (df_customers[[col]].applymap(type) != df_customers[[col]].iloc[0].apply(type)).any(axis=1)
    if len (df_customers[weird]) > 0:
        print (col)

Because there is no output, this seems to have resolved the issue.

In [23]:
# Checking for duplicates
df_dups = df_customers[df_customers.duplicated()]

In [24]:
# Checking for duplicates
df_dups.shape

(0, 10)

No duplicates found.

In [25]:
# Checking for number of rows after consistency checks
df_customers.shape

(206209, 10)

### 04. Combine Customer Data with Prepared Instacart Data

In [26]:
# Create dataframe of most recent Instacart data
df_ords_prods_merged = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged_task_4_8.pkl'))

In [27]:
# Am sure they will merge on user_id key column but just to double-check the output
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,mean_price,spending_flag,median_days_since_prior_order,frequency_flag
0,2539329,1,1,2,8,,196,1,0,Soda,...,Mid range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequency customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Mid range product,Regularly busy,Least busiest days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequency customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,Mid range product,Regularly busy,Least busiest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequency customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Mid range product,Least busy,Least busiest days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequency customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,Mid range product,Least busy,Least busiest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequency customer


As I thought, I will merge the two dataframes on the user_id key column. Both need to be the data type of string.

In [28]:
# Double-checking data type of user_id in df_ords_prods_merged
df_ords_prods_merged.dtypes

order_id                            int64
user_id                             int64
order_number                        int64
orders_day_of_week                  int64
order_hour_of_day                   int64
days_since_prior_order            float64
product_id                          int64
add_to_cart_order                   int64
reordered                           int64
product_name                       object
aisle_id                            int64
department_id                       int64
prices                            float64
_merge                           category
price_range_loc                    object
busiest_day                        object
busiest_days                       object
busiest_period_of_day              object
max_order                           int64
loyalty_flag                       object
mean_price                        float64
spending_flag                      object
median_days_since_prior_order     float64
frequency_flag                    

In [29]:
# Changing user_id to string
df_ords_prods_merged['user_id'] = df_ords_prods_merged['user_id'].astype('str')

In [30]:
# Checking that this worked
df_ords_prods_merged.dtypes

order_id                            int64
user_id                            object
order_number                        int64
orders_day_of_week                  int64
order_hour_of_day                   int64
days_since_prior_order            float64
product_id                          int64
add_to_cart_order                   int64
reordered                           int64
product_name                       object
aisle_id                            int64
department_id                       int64
prices                            float64
_merge                           category
price_range_loc                    object
busiest_day                        object
busiest_days                       object
busiest_period_of_day              object
max_order                           int64
loyalty_flag                       object
mean_price                        float64
spending_flag                      object
median_days_since_prior_order     float64
frequency_flag                    

In [31]:
# Merging customers with df_ords_prods_merged
df_ords_prods_all = df_ords_prods_merged.merge(df_customers, on = 'user_id', indicator = 'True')

In [32]:
# Checking merged data frame
df_ords_prods_all.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,first_name,last_name,gender,state,age,date_joined,number_of_dependants,fam_status,income,True
0,2539329,1,1,2,8,,196,1,0,Soda,...,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both


In [33]:
# Checking shape of new dataframe
df_ords_prods_all.shape

(32404859, 34)

### 05. Export New Dataframe

In [34]:
# Export to pickle
df_ords_prods_all.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_all.pkl'))