## <a id='toc1_1_'></a>[Data Visualization (1 of 1)](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Data Visualization (1 of 1)](#toc1_1_)    
  - [I. Data Wrangling on the Customers Dataset](#toc1_2_)    
    - [I.1. Data overview](#toc1_2_1_)    
    - [I.2. Rename columns](#toc1_2_2_)    
    - [I.3. Change data types of variables](#toc1_2_3_)    
  - [II. Data Consistency Checks](#toc1_3_)    
    - [II.1. Check for mixed-types](#toc1_3_1_)    
    - [II.2. Check for and address missing values and duplicates](#toc1_3_2_)    
  - [III. Data Combining with orders_products_merged_3.pkl](#toc1_4_)    
    - [III.1. Check consistency of key columns](#toc1_4_1_)    
    - [III.2. Merge datasets](#toc1_4_2_)    
  - [IV. Data Export](#toc1_5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
# import libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

In [2]:
# create a path to the directory
path = r'C:\Users\Ansgar.S\Uyen\OneDrive\Documents\Data Immersion\Achievement IV - Python Fundamentals for Data Analysts\02-2023 Instacart Basket Analysis'

# import the 'customers.csv' dataset
df_customers = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'), index_col = False)

# import the 'orders_products_merged_3.pkl' dataset
df_ords_merged = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged_3.pkl'))

## <a id='toc1_2_'></a>[I. Data Wrangling on the Customers Dataset](#toc0_)

### <a id='toc1_2_1_'></a>[I.1. Data overview](#toc0_)

In [3]:
# check number of rows and columns in df_customers
print('Number of rows and columns in df_customers:')
df_customers.shape

Number of rows and columns in df_customers:


(206209, 10)

In [4]:
# check the output of df_customers
print('Sample output of df_customers:')
df_customers.sample(5)

Sample output of df_customers:


Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
129522,137990,Diane,Valenzuela,Female,Michigan,31,1/17/2019,1,married,52642
44750,87406,Thomas,Humphrey,Male,Tennessee,19,9/15/2017,0,single,67348
63547,136021,John,Holloway,Male,Arkansas,43,1/1/2018,3,married,70158
125952,35924,Willie,Vasquez,Male,Maine,60,12/27/2018,3,married,46174
192479,70495,,Reese,Male,Idaho,36,1/14/2020,0,single,63249


In [5]:
# descriptive statistics on df_customers
print('Descriptive statistics on df_customers:')
df_customers.describe()

Descriptive statistics on df_customers:


Unnamed: 0,user_id,Age,n_dependants,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


Observations:

There are 206,209 records in the dataset.

Customers range in age from 18 to 81 years old, with an average age of 49.5 years.

Their dependents range from 0 to 3 people.

Customers' income ranges from $26,000 to 594,000 USD, with an average of 94,600 USD. The earnings appear to be annual.

### <a id='toc1_2_2_'></a>[I.2. Rename columns](#toc0_)

In [6]:
# check column names in df_customers
print('Column names in df_customers:')
df_customers.columns

Column names in df_customers:


Index(['user_id', 'First Name', 'Surnam', 'Gender', 'STATE', 'Age',
       'date_joined', 'n_dependants', 'fam_status', 'income'],
      dtype='object')

In [7]:
# change column names in df_customers to be more intuitive
df_customers.rename(
    columns = {
        'First Name' : 'first_name',
        'Surnam' : 'last_name',
        'Gender' : 'gender',
        'STATE' : 'state',
        'Age' : 'age',
        'n_dependants' : 'number_of_dependants',
        'fam_status' : 'family_status'
        }, inplace = True)

In [8]:
# check the new column names in df_customers
print('New column names in df_customers:')
df_customers.columns

New column names in df_customers:


Index(['user_id', 'first_name', 'last_name', 'gender', 'state', 'age',
       'date_joined', 'number_of_dependants', 'family_status', 'income'],
      dtype='object')

### <a id='toc1_2_3_'></a>[I.3. Change data types of variables](#toc0_)

In [9]:
# check the data types of each column in df_customers
print('Data types of columns in df_customers:')
df_customers.dtypes

Data types of columns in df_customers:


user_id                  int64
first_name              object
last_name               object
gender                  object
state                   object
age                      int64
date_joined             object
number_of_dependants     int64
family_status           object
income                   int64
dtype: object

In [10]:
# change the data type of column 'user_id' to string
df_customers['user_id'] = df_customers['user_id'].astype('str')

In [11]:
# check column 'user_id'
print('Data type of column user_id:')
df_customers['user_id'].dtype

Data type of column user_id:


dtype('O')

## <a id='toc1_3_'></a>[II. Data Consistency Checks](#toc0_)

### <a id='toc1_3_1_'></a>[II.1. Check for mixed-types](#toc0_)

In [12]:
# check for mixed-type columns in df_customers
print('Columns in df_customers:')
for col in df_customers.columns.tolist():
    weird = (
        df_customers[[col]].applymap(type) != df_customers[[col]].iloc[0].apply(type)
        ).any(axis = 1)
    if len (df_customers[weird]) >0:
        print(col, ': mixed types')
    else:
        print(col, ': consistent')

Columns in df_customers:
user_id : consistent
first_name : mixed types
last_name : consistent
gender : consistent
state : consistent
age : consistent
date_joined : consistent
number_of_dependants : consistent
family_status : consistent
income : consistent


In [13]:
# check the current types of values in column 'first_name' column in df_customers
print('Current types of values in column first_name:')
df_customers['first_name'].apply(type).value_counts()

Current types of values in column first_name:


<class 'str'>      194950
<class 'float'>     11259
Name: first_name, dtype: int64

In [14]:
# check a few values in column 'first_name'
print('Sampled values in column first_name:')
df_customers['first_name'].sample(10)

Sampled values in column first_name:


103799         Todd
160259      Anthony
2897      Stephanie
19976      Clarence
184301       Victor
172373       Louise
54784        Andrea
75266        Thomas
33919          Ruth
47487       Shirley
Name: first_name, dtype: object

### <a id='toc1_3_2_'></a>[II.2. Check for and address missing values and duplicates](#toc0_)

In [15]:
# check for null values in column 'first_name' in df_customers
print('Null values in column first_name in df_customers:')
df_customers[df_customers['first_name'].isna()]

Null values in column first_name in df_customers:


Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,number_of_dependants,family_status,income
53,76659,,Gilbert,Male,Colorado,26,1/1/2017,2,married,41709
73,13738,,Frost,Female,Louisiana,39,1/1/2017,0,single,82518
82,89996,,Dawson,Female,Oregon,52,1/1/2017,3,married,117099
99,96166,,Oconnor,Male,Oklahoma,51,1/1/2017,1,married,155673
105,29778,,Dawson,Female,Utah,63,1/1/2017,3,married,151819
...,...,...,...,...,...,...,...,...,...,...
206038,121317,,Melton,Male,Pennsylvania,28,3/31/2020,3,married,87783
206044,200799,,Copeland,Female,Hawaii,52,4/1/2020,2,married,108488
206090,167394,,Frost,Female,Hawaii,61,4/1/2020,1,married,45275
206162,187532,,Floyd,Female,California,39,4/1/2020,0,single,56325


Observation: There are 11,259 first names missing, likely because they were not recorded in the first place.

In [16]:
# remove the missing values from df_customers and create a new dataframe
df_customers_clean = df_customers[df_customers['first_name'].isnull() == False]

print('Number of rows and columns in df_customers_clean:')
df_customers_clean.shape

Number of rows and columns in df_customers_clean:


(194950, 10)

In [17]:
# check the types of values in column 'first_name' in df_customers_clean
print('Types of values in column first_name in df_customers_clean:')
df_customers_clean['first_name'].apply(type).value_counts()

Types of values in column first_name in df_customers_clean:


<class 'str'>    194950
Name: first_name, dtype: int64

In [18]:
# check for full duplicates in df_customers_clean
df_dupes = df_customers_clean[df_customers_clean.duplicated()]

print('Duplicates in df_customers_clean:')
df_dupes

Duplicates in df_customers_clean:


Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,number_of_dependants,family_status,income


Observation: The dataset has no duplicates.

## <a id='toc1_4_'></a>[III. Data Combining with orders_products_merged_3.pkl](#toc0_)

### <a id='toc1_4_1_'></a>[III.1. Check consistency of key columns](#toc0_)

In [19]:
# check the dimensions of both dataframes
print('Number of rows and columns of df_customers_clean:', df_customers_clean.shape)
print('Number of rows and columns of df_ords_merged:', df_ords_merged.shape)

Number of rows and columns of df_customers_clean: (194950, 10)
Number of rows and columns of df_ords_merged: (32404859, 24)


In [20]:
# check the output of df_ords_merged
print('Sample output of df_ords_merged:')
df_ords_merged.sample(5)

Sample output of df_ords_merged:


Unnamed: 0,product_id,product_name,aisle_id,department_id,prices,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,...,price_range,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,order_freq,order_freq_flag
7417526,12945,Signature Recipes Vodka Sauce Pasta Sauce,9,9,8.8,3051466,202786,4,1,19,...,Mid-range product,Regularly busy,Busiest days,Average orders,4,New customer,8.010448,Low spender,22.0,Non-frequent customer
23580558,36557,Campari Tomato,83,4,2.4,10858,63700,15,1,8,...,Low-range product,Regularly busy,Busiest days,Average orders,19,Regular customer,7.677114,Low spender,9.0,Frequent customer
28038822,43768,Organic Bell Pepper,83,4,9.1,2905929,63942,22,5,15,...,Mid-range product,Regularly busy,Regular days,Most orders,24,Regular customer,8.32769,Low spender,12.0,Regular customer
29776096,46306,100% Pure Avocado Oil,19,13,12.1,77560,180729,14,6,12,...,Mid-range product,Regularly busy,Regular days,Most orders,26,Regular customer,7.973252,Low spender,9.0,Frequent customer
13165357,21616,Organic Baby Arugula,123,4,4.9,1235169,60301,13,2,16,...,Low-range product,Regularly busy,Regular days,Most orders,20,Regular customer,8.400395,Low spender,9.0,Frequent customer


In [21]:
# check the data types of key columns 'user_id' in both df_customers_clean and df_ords_merged
print('Data types of key columns user_id in df_customers_clean:', df_customers_clean['user_id'].dtype)
print('Data types of key columns user_id in df_ords_merged:', df_ords_merged['user_id'].dtype)

Data types of key columns user_id in df_customers_clean: object
Data types of key columns user_id in df_ords_merged: object


### <a id='toc1_4_2_'></a>[III.2. Merge datasets](#toc0_)

In [22]:
# Merge df_customers_clean and df_ords_merged using 'user_id' as a key and an indicator flag
df_merged_large = df_ords_merged.merge(df_customers_clean, on = 'user_id', how = 'inner', indicator = True)

In [23]:
# check number of rows and columns in the merged dataframe
print('Number of rows and columns of df_merged_large:')
df_merged_large.shape

Number of rows and columns of df_merged_large:


(30629741, 34)

In [24]:
# check the output of df_ords_merged
print('Sampled output of df_merged_large:')
df_merged_large.sample(5)

Sampled output of df_merged_large:


Unnamed: 0,product_id,product_name,aisle_id,department_id,prices,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,...,first_name,last_name,gender,state,age,date_joined,number_of_dependants,family_status,income,_merge
22864074,41665,Organic Mexican Blend Finely Shredded Cheese,21,16,4.7,530857,18421,6,1,22,...,Cheryl,Curtis,Female,Florida,37,6/8/2019,0,single,41927,both
27551682,5258,Sparkling Water,115,7,1.1,2539102,108214,22,1,7,...,Lillian,Schaefer,Female,Tennessee,29,12/17/2017,0,single,45366,both
5718930,27966,Organic Raspberries,123,4,4.4,2173478,109017,16,5,11,...,Louis,Schmitt,Male,Ohio,55,3/3/2019,1,married,104654,both
22806763,13500,Mild Sliced Cheddar Cheese,21,16,2.9,2971138,143016,9,1,16,...,Denise,Merritt,Female,Hawaii,67,12/4/2019,0,divorced/widowed,106038,both
981842,9106,Chocolate Pudding,71,16,14.6,2262591,78611,1,3,20,...,Eric,Jefferson,Male,Maine,19,10/25/2019,3,living with parents and siblings,55532,both


## <a id='toc1_5_'></a>[IV. Data Export](#toc0_)

In [None]:
# export df_merged_large in .pkl format
df_merged_large.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_merged_4.pkl'))