# Instacart Customer Data Preparation  
# Achievement 4 â€“ Exercise 4.9 (Part 1)

This notebook prepares the Instacart customer dataset for analysis by conducting data wrangling, consistency checks, and merging it with the existing Instacart order and product data. The cleaned dataset will be exported for use in Part 2 of Exercise 4.9.

In [1]:
# Import core analysis libraries
import pandas as pd
import numpy as np
import os

In [2]:
# Set project root by moving one level up from the current working directory
path = os.path.dirname(os.getcwd())

# Check
path

'/Users/jessduong/Documents/CF/Achievement 4_Python/12-2025 Instacart Basket Analysis'

In [3]:
# Import customer data from Original Data folder
df_cust = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'))

# Preview first rows to confirm successful import
df_cust.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [4]:
# Check dataframe dimensions
df_cust.shape

(206209, 10)

In [5]:
# Review column names
df_cust.columns


Index(['user_id', 'First Name', 'Surnam', 'Gender', 'STATE', 'Age',
       'date_joined', 'n_dependants', 'fam_status', 'income'],
      dtype='object')

In [6]:
# Review data types
df_cust.dtypes

user_id          int64
First Name      object
Surnam          object
Gender          object
STATE           object
Age              int64
date_joined     object
n_dependants     int64
fam_status      object
income           int64
dtype: object

In [7]:
# Rename columns for consistency and clarity
df_cust.columns = [
    'user_id',
    'first_name',
    'surname',
    'gender',
    'state',
    'age',
    'date_joined',
    'n_dependents',
    'family_status',
    'income'
]

# Verify column names
df_cust.columns

Index(['user_id', 'first_name', 'surname', 'gender', 'state', 'age',
       'date_joined', 'n_dependents', 'family_status', 'income'],
      dtype='object')

In [8]:
# Check for missing values
df_cust.isnull().sum()

user_id              0
first_name       11259
surname              0
gender               0
state                0
age                  0
date_joined          0
n_dependents         0
family_status        0
income               0
dtype: int64

A missing values check was conducted across all customer variables to
identify any incomplete records prior to merging with the Instacart
transactional data.

In [9]:
# Check for duplicate rows
df_cust.duplicated().sum()

np.int64(0)

In [10]:
# Check for duplicate user IDs
df_cust['user_id'].duplicated().sum()

np.int64(0)

In [11]:
# Fill missing first names with placeholder
df_cust['first_name'] = df_cust['first_name'].fillna('Unknown')

In [12]:
# Verify missing values resolved
df_cust.isnull().sum()

user_id          0
first_name       0
surname          0
gender           0
state            0
age              0
date_joined      0
n_dependents     0
family_status    0
income           0
dtype: int64

Missing values were identified only in the first_name column. Since first_name is not used in behavioral or demographic analysis, missing values were filled with a placeholder to preserve all records for downstream merging.

In [13]:
# Convert date_joined to datetime format
df_cust['date_joined'] = pd.to_datetime(df_cust['date_joined'], errors='coerce')

In [14]:
# Verify data types
df_cust.dtypes

user_id                   int64
first_name               object
surname                  object
gender                   object
state                    object
age                       int64
date_joined      datetime64[ns]
n_dependents              int64
family_status            object
income                    int64
dtype: object

In [15]:
# Age distribution check
df_cust['age'].describe()

count    206209.000000
mean         49.501646
std          18.480962
min          18.000000
25%          33.000000
50%          49.000000
75%          66.000000
max          81.000000
Name: age, dtype: float64

In [16]:
# Income distribution check
df_cust['income'].describe()

count    206209.000000
mean      94632.852548
std       42473.786988
min       25903.000000
25%       59874.000000
50%       93547.000000
75%      124244.000000
max      593901.000000
Name: income, dtype: float64

In [17]:
# Dependents distribution check
df_cust['n_dependents'].value_counts().sort_index()

n_dependents
0    51602
1    51531
2    51482
3    51594
Name: count, dtype: int64

Numeric customer attributes were reviewed for logical consistency. Age, income, and number of dependents fall within reasonable ranges and required no further adjustment.

In [18]:
# Load prepared Instacart transactional dataset
ords_prods_merge = pd.read_pickle(
    os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge.pkl'))

In [19]:
# Checks
ords_prods_merge.shape

(32435059, 26)

In [20]:
# Check
ords_prods_merge.columns

Index(['order_id', 'user_id', 'eval_set', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_previous_order', 'first_order_flag',
       'product_id', 'add_to_cart_order', 'reordered', 'product_name',
       'aisle_id', 'department_id', 'prices', 'price_range_loc', 'busiest_day',
       'busiest_days_2', 'busiest_period_of_day', 'max_order', 'loyalty_flag',
       'mean_price_user', 'spending_flag', 'median_days_since_order',
       'order_frequency_flag', 'median_days_since_prior_order'],
      dtype='object')

In [21]:
# Ensure user_id is integer in both datasets
df_cust['user_id'] = df_cust['user_id'].astype(int)
ords_prods_merge['user_id'] = ords_prods_merge['user_id'].astype(int)

In [22]:
# Merge customer data with Instacart dataset
ords_prods_cust = ords_prods_merge.merge(df_cust, on='user_id', how='left')

In [23]:
# Check merged dataset
ords_prods_cust.shape

(32435059, 35)

In [24]:
# Check for missing customer attributes after merge
ords_prods_cust[['age', 'income', 'family_status']].isnull().sum()

age              0
income           0
family_status    0
dtype: int64

Customer demographic data was merged with the Instacart transactional dataset using a left join to preserve all order records. Any missing demographic values post-merge reflect customers without available profile information and were retained for analysis completeness.

The merged customer and transactional dataset was exported as a pickle file for use in Part 2 of Exercise 4.9, where exploratory analysis and visualizations will be conducted.

In [25]:
# Export merged dataset for Part 2 visualization analysis
ords_prods_cust.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_cust.pkl'))