### 4.5 Data Consistency Checks

### Contents
01 Import Libraries and Data Files

02 Mixed Type Data

03 Missing Values

04 Duplicates

05 Task 4.5 Exercises

06 Export Data orders_checked.csv


### 01 Import Libraries and Data Files

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

In [2]:
# Create path variable for main project folder
path = r'C:\Users\julie_y9025s2\06-2022 Instacart Basket Analysis'

In [3]:
# Use path variable to import orders_wrangled.csv as df_ords
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index_col = False)

In [4]:
# Use path variable to import products.csv as df_prods
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)

### 02 Mixed Type Data

In [1]:
# Create a test dataframe
df_test = pd.DataFrame()

NameError: name 'pd' is not defined

In [6]:
# Create a mixed type column
df_test['mix'] = ['a', 'b', 1, True]

In [7]:
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [8]:
# Check for mixed type column
for col in df_test.columns.tolist():
  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


In [9]:
# Convert mixed type column to string
df_test['mix'] = df_test['mix'].astype('str')

### 03 Missing Values

In [10]:
# Find missing values
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [11]:
# Create subset for missing values
df_nan = df_prods[df_prods['product_name'].isnull() == True]

In [12]:
# View missing values dataframe 
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [13]:
# Get row count of original dataframe
df_prods.shape

(49693, 5)

In [14]:
# Create new dataframe, dropping rows with null values
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [15]:
# Verify row count = original row count - null row count
df_prods_clean.shape

(49677, 5)

### 04 Duplicates

In [16]:
# Find full row duplicates and create subset of them
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [17]:
# View duplicates
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [18]:
# Drop duplicates and create new dataframe
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [19]:
# Check row count after dropping duplicates
df_prods_clean_no_dups.shape

(49672, 5)

In [20]:
# Export clean file
df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'products_checked.csv'))

### 05 Task 4.5 Exercises

1. If you haven’t performed the consistency checks covered in this Exercise on your df_prods dataframe, do so now.  Please see above work.

2.  Run the df.describe() function on your df_prods dataframe. Using your new knowledge about how to interpret the output of this function, share in a markdown cell whether anything about the data looks off or should be investigated further.
Tip: Keep an eye on min and max values!

In [21]:
df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49693.0,49693.0,49693.0,49693.0
mean,24844.345139,67.770249,11.728433,9.994136
std,14343.717401,38.316774,5.850282,453.519686
min,1.0,1.0,1.0,1.0
25%,12423.0,35.0,7.0,4.1
50%,24845.0,69.0,13.0,7.1
75%,37265.0,100.0,17.0,11.2
max,49688.0,134.0,21.0,99999.0


While product, aisle, and deparment ids could vary a good bit, I'm suspicious of the max price of 99999 for a grocery item and want to investigate this further.

3. Check for mixed-type data in your df_ords dataframe.

In [22]:
# Check for mixed type columns in df_ords
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

No columns were returned as having mixed-type data.

4. If you find mixed-type data, fix it. The column in question should contain observations of a single data type.

No mixed-type data was found.

5. Run a check for missing values in your df_ords dataframe.  In a markdown cell, report your findings and propose an explanation for any missing values you find.

In [23]:
# Find missing values in df_ords
df_ords.isnull().sum()

Unnamed: 0                     0
order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

The days_since_prior_order column contains 206209 null values.  This would be valid for customers placing their first order as there would be no prior order for comparison.

6. Address the missing values using an appropriate method.
In a markdown cell, explain why you used your method of choice.

In [24]:
# Create subset for missing values to verify that null values are for new customers
df_nan_ords = df_ords[df_ords['days_since_prior_order'].isnull() == True]

In [25]:
# Check tail of subset 
df_nan_ords.tail(10)

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
3420839,3420839,2508134,206200,1,5,19,
3420863,3420863,28210,206201,1,6,14,
3420896,3420896,799173,206202,1,2,12,
3420919,3420919,1991850,206203,1,1,13,
3420925,3420925,1438269,206204,1,1,11,
3420930,3420930,969311,206205,1,4,12,
3420934,3420934,3189322,206206,1,3,18,
3421002,3421002,2166133,206207,1,6,19,
3421019,3421019,2227043,206208,1,1,15,
3421069,3421069,3154581,206209,1,3,11,


Null values are all for order_number 1.

In [27]:
# Create a new dataframe to address the missing values.
df_ords_newcust = df_ords

In [28]:
# Add a new column to the new dataframe to indicate whether customer is new.
df_ords_newcust['new_customer'] = df_ords['days_since_prior_order'].isnull() == True

In [29]:
# Check new_customer column
df_ords_newcust

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,new_customer
0,0,2539329,1,1,2,8,,True
1,1,2398795,1,2,3,7,15.0,False
2,2,473747,1,3,3,12,21.0,False
3,3,2254736,1,4,4,7,29.0,False
4,4,431534,1,5,4,15,28.0,False
...,...,...,...,...,...,...,...,...
3421078,3421078,2266710,206209,10,5,18,29.0,False
3421079,3421079,1854736,206209,11,4,10,30.0,False
3421080,3421080,626363,206209,12,1,12,18.0,False
3421081,3421081,2977660,206209,13,1,12,7.0,False


I chose to create a new column that acts like a flag based on the missing values as these missing values are significant and should not be deleted.

7. Run a check for duplicate values in your df_ords data.
In a markdown cell, report your findings and propose an explanation for any duplicate values you find.

In [30]:
# Find full row duplicates and create subset of them
df_dups = df_ords_newcust[df_ords_newcust.duplicated()]

In [31]:
df_dups

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,new_customer


There are no full row duplicates.

8. Address the duplicates using an appropriate method.  In a markdown cell, explain why you used your method of choice.

There are no full row duplicates to remove.

### 06 Export Data orders_checked.csv

In [34]:
df_ords_newcust.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_checked.csv'))