# Data Consistency Checks

## This Script Contains the Following Points:
### 1. Importing Libraries
### 2. Descriptive Statistics for Numeric Values
### 3. Create a Test Dataframe to work with Mixed Type Data
### 4. Check for Mixed Type Data
### 5. Finding Missing Values 
### 6. Addressing Missing Values
### 7. Duplicates
### 8. Addressing Duplicates
### 9. Exporting Clean Dataframe

# Importing Libraries

In [250]:
import pandas as pd
import numpy as np
import os

In [251]:
#importing data
path = r'/Users/kimkmiz/Documents/Instacart Basket Analysis 2024'


In [252]:
#import products.csv from original data
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'IC24 Original Data', 'products.csv'), index_col = False)

In [253]:
#import orders_wrangled.csv data set from “Prepared Data” folder as df_ords
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'IC24 Prepared Data', 'orders_wrangled.csv'), index_col = False)

# Descriptive Statistics for Numeric Values of df_ords

In [255]:
df_ords.describe()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710541.0,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,0.0,1.0,1.0,1.0,0.0,0.0,0.0
25%,855270.5,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710541.0,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421082.0,3421083.0,206209.0,100.0,6.0,23.0,30.0


In [256]:
df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49693.0,49693.0,49693.0,49693.0
mean,24844.345139,67.770249,11.728433,9.994136
std,14343.717401,38.316774,5.850282,453.519686
min,1.0,1.0,1.0,1.0
25%,12423.0,35.0,7.0,4.1
50%,24845.0,69.0,13.0,7.1
75%,37265.0,100.0,17.0,11.2
max,49688.0,134.0,21.0,99999.0


**Observations**
- max price is much higher than mean

# Create a Test Dataframe to work with Mixed-Type Columns

In [259]:
#create a dataset
df_test = pd.DataFrame()

In [260]:
#createe a mixed type column
df_test['mix'] = ['a', 'b', 1, True]

In [261]:
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


## Check for Mixed-Type Columns

In [263]:
#check for mixed-type columns

for col in df_test.columns.tolist():
  weird = (df_test[[col]].map(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


In [264]:
mixed_type_columns = df_test.select_dtypes(include=['object']).columns
mixed_type_columns_info = df_test[mixed_type_columns].map(type).nunique()

mixed_columns = mixed_type_columns_info[mixed_type_columns_info > 1].index
print(mixed_columns)

Index(['mix'], dtype='object')


In [265]:
#change from numeric to string
df_test['mix'] = df_test['mix'].astype('str')

In [266]:
df_test

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [267]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   mix     4 non-null      object
dtypes: object(1)
memory usage: 164.0+ bytes


In [268]:
df_test['mix'].dtype

dtype('O')

## Finding Missing Values

In [270]:
#finding missing values
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [271]:
#product_name column has 16 missing values

In [272]:
#create dataframe with missing values in order to view them
df_nan=df_prods[df_prods['product_name'].isnull()==True]

In [273]:
#print dataframe 
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


## Addressing Missing values

In [275]:
df_nan.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,16.0,16.0,16.0,16.0
mean,6684.0,89.9375,10.9375,13.0125
std,12836.665242,33.731229,4.639953,3.881731
min,34.0,26.0,1.0,1.2
25%,459.25,70.75,7.75,12.175
50%,2413.0,98.5,11.5,13.65
75%,3872.75,120.0,14.5,14.425
max,40440.0,126.0,16.0,20.9


In [276]:
df_nan.median()

product_id       2413.0
product_name        NaN
aisle_id           98.5
department_id      11.5
prices            13.65
dtype: object

In [277]:
#Find shape
df_prods.shape

(49693, 5)

Create new dataframe called df_prods_clean

In [279]:
df_prods_clean = df_prods[df_prods['product_name'].isnull()==False]

Check that the number of rows has decreased

In [281]:
df_prods_clean.shape

(49677, 5)

Rows successfully decreased

In [283]:
df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49693.0,49693.0,49693.0,49693.0
mean,24844.345139,67.770249,11.728433,9.994136
std,14343.717401,38.316774,5.850282,453.519686
min,1.0,1.0,1.0,1.0
25%,12423.0,35.0,7.0,4.1
50%,24845.0,69.0,13.0,7.1
75%,37265.0,100.0,17.0,11.2
max,49688.0,134.0,21.0,99999.0


## Duplicates

Need to Identify what kind of duplicates first.

In [286]:
#finding duplicates
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [287]:
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


## Addressing Duplicates

First, check rows of df_prods_clean

In [290]:
df_prods_clean.shape

(49677, 5)

Next, create a new dataframe that doesn’t include the duplicates you just identified using the drop_duplicates() function

In [292]:
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [293]:
df_prods_clean_no_dups.shape

(49672, 5)

In [294]:
#check for outlier prices
df_prods_clean_no_dups[df_prods_clean_no_dups["prices"]> 100.0]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
21554,21553,Lowfat 2% Milkfat Cottage Cheese,108,16,14900.0
33666,33664,2 % Reduced Fat Milk,84,16,99999.0


In [295]:
#replace outlier costs
df_prods_clean_no_dups = df_prods_clean_no_dups.replace ({"prices":
{99999.0: 9.99, 14900.0:1.49 }})

In [296]:
#check output
df_prods_clean_no_dups.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49672.0,49672.0,49672.0,49672.0
mean,24850.349775,67.762442,11.728942,7.680359
std,14340.705287,38.315784,5.850779,4.199401
min,1.0,1.0,1.0,1.0
25%,12432.75,35.0,7.0,4.1
50%,24850.5,69.0,13.0,7.1
75%,37268.25,100.0,17.0,11.1
max,49688.0,134.0,21.0,25.0


## Exporting Clean Dataframe

In [298]:
df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data', 'IC24 Prepared Data', 'products_clean.csv'))

# Task

**2. Run the df.describe() function on your df_ords dataframe. Using your new knowledge about how to interpret the output of this function, share in a markdown cell whether anything about the data looks off or should be investigated further**

In [301]:
df_ords.describe()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710541.0,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,0.0,1.0,1.0,1.0,0.0,0.0,0.0
25%,855270.5,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710541.0,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421082.0,3421083.0,206209.0,100.0,6.0,23.0,30.0


Notes: 
- Minimin of 0 and maximum of 6 for order_dow column makes sense due to days of week being total of 7 days.
- No negative values for days_since_prior_order column
- The max for order_hour_of_day isn't greater than 24

**3. Check for mixed-type data in your df_ords dataframe.**

In [304]:
for col in df_ords.columns.tolist():
    weird = (df_ords[col].map(type) != df_ords[col].iloc[0].__class__).any()
    if weird:
        print(col)

Unnamed: 0
order_id
user_id
order_number
order_dow
order_hour_of_day
days_since_prior_order


In [305]:
df_ords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 8 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   Unnamed: 0              int64  
 1   order_id                int64  
 2   user_id                 int64  
 3   eval_set                object 
 4   order_number            int64  
 5   order_dow               int64  
 6   order_hour_of_day       int64  
 7   days_since_prior_order  float64
dtypes: float64(1), int64(6), object(1)
memory usage: 208.8+ MB


**4.If you find mixed-type data, fix it. The column in question should contain observations of a single data type.**

- No Mixed-type data found

**5. Run a check for missing values in your df_ords dataframe.**

In [309]:
#checking for missing values
df_ords.isnull().sum()

Unnamed: 0                     0
order_id                       0
user_id                        0
eval_set                       0
order_number                   0
order_dow                      0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

**Notes:**
- Days_since_prior_order column has 206209 missing values
- this may be due to information not being entered or it could be because some users only have 1 order, meaning there would be no prior order yet

**6. Address the missing values using an appropriate method.**

In [312]:
#create new dataframe to show only rows with missing values
df_ords_nan = df_ords[df_ords['days_since_prior_order'].isnull() == True]

In [313]:
df_ords_nan

Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,0,2539329,1,prior,1,2,8,
11,11,2168274,2,prior,1,2,11,
26,26,1374495,3,prior,1,1,14,
39,39,3343014,4,prior,1,6,11,
45,45,2717275,5,prior,1,3,12,
...,...,...,...,...,...,...,...,...
3420930,3420930,969311,206205,prior,1,4,12,
3420934,3420934,3189322,206206,prior,1,3,18,
3421002,3421002,2166133,206207,prior,1,6,19,
3421019,3421019,2227043,206208,prior,1,1,15,


**Notes:**
- The missing values for days_since_prior_order seem to be only for orders with an order_number of 1, meaning it is the user's first order
- My theory for the reason this column would have missing values was correct, there is no prior order so there is no data to input in this column for a user's first order
- I will not impute the data, because there is not an appropriate number to replace
- I would make note that when analyzing data related to days_since_prior_order it is necessary to filter out order_number of 1 data

**7. Run a check for duplicate values in your df_ords data**

In [316]:
#create dataframe of duplicates
df_ords_dups=df_ords[df_ords.duplicated()]

In [317]:
df_ords_dups

Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order


**Notes**
- There are no duplicates

**8. Address the duplicates using an appropriate method**
- No duplicates, but if there were I would use .drop_duplicates() function

**9. Export your final, cleaned df_prods and df_ords data as “.csv” files in your “Prepared Data” folder and give them appropriate, succinct names**

In [321]:
#exporting
df_ords.to_csv(os.path.join(path, '02 Data', 'IC24 Prepared Data', 'orders_clean.csv'))