# Table of Contents

## Importing Libraries

## Importing Dataframes

## Data Consistency Checks 
### Products 
#### - Missing Values for Products 
#### - Duplicates for Products 
#### - Export changes made to Products
### Orders
#### - Missing Values for Orders
#### - Duplicate Values for Orders
#### - Export changes made to Orders

----

# Importing Libraries

In [1]:
import pandas as pd
import numpy as ny
import os

# Importing Dataframes

In [2]:
path= r'C:\Users\princess\Documents\09-2023 Instacart Basket Analysis'

In [21]:
# Import wrangledorders.csv
df_ords=pd.read_csv(os.path.join(path, 'Data', 'Prepared Data', 'orders_wrangled.csv'),index_col=False)

In [4]:
# Import products.csv
df_prods=pd.read_csv(os.path.join(path, 'Data', 'Original Data', 'products.csv'),index_col=False)

In [5]:
# Import departments.csv
df_deps=pd.read_csv(os.path.join(path, 'Data', 'Original Data', 'departments.csv'),index_col=False)

# Data Consistency Checks

## Products

### Missing Values for Products

In [6]:
#find missing values of df_prods dataframe
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [7]:
#create subset without the null values
df_nan= df_prods[df_prods['product_name'].isnull()==True]

In [8]:
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [9]:
df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49693.0,49693.0,49693.0,49693.0
mean,24844.345139,67.770249,11.728433,9.994136
std,14343.717401,38.316774,5.850282,453.519686
min,1.0,1.0,1.0,1.0
25%,12423.0,35.0,7.0,4.1
50%,24845.0,69.0,13.0,7.1
75%,37265.0,100.0,17.0,11.2
max,49688.0,134.0,21.0,99999.0


In [10]:
df_prods.shape

(49693, 5)

In [11]:
# Clean products data
df_prods_clean= df_prods[df_prods['product_name'].isnull()==False]

In [12]:
df_prods_clean.shape

(49677, 5)

After removing the 16 missing values, we now have 49,677 rows.

### Duplicates for Products

In [19]:
# New dataframe with duplicates
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [14]:
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [15]:
df_prods_clean.shape

(49677, 5)

In [18]:
# Removing the duplicates from clean data and making new dataframe
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [17]:
df_prods_clean_no_dups.shape

(49672, 5)

After removing the duplicates, the rows went down to 49,672.

### Export changes made to Products

In [25]:
# Export changes 
df_prods_clean_no_dups.to_csv(os.path.join(path, 'Data','Prepared Data', 'products_checked.csv'))

## Orders

In [22]:
#looking to see if anthing looks off about the df_ords dataframe

df_ords.describe()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710541.0,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,0.0,1.0,1.0,1.0,0.0,0.0,0.0
25%,855270.5,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710541.0,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421082.0,3421083.0,206209.0,100.0,6.0,23.0,30.0


The days since last order column has a different count than all the rest of the columns. That might need to be investigated.
Additionally, the 50th percentile of 7 days fora max of 30 days doesn't seem right.

In [33]:
df_ords.head()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
0,0,2539329,1,1,2,8,
1,1,2398795,1,2,3,7,15.0
2,2,473747,1,3,3,12,21.0
3,3,2254736,1,4,4,7,29.0
4,4,431534,1,5,4,15,28.0


In [31]:
#dropping eval_set 
df_ords=df_ords.drop(columns=['eval_set'])

In [24]:
# Check for mixed data types

for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

Nothing comes up when this code is run, meaning there are no mixed data types.

## Missing Values for Orders

In [34]:
# findings missing values 
df_ords.isnull().sum()

Unnamed: 0                    0
order_id                      0
user_id                       0
order_number                  0
orders_day_of_week            0
order_hour_of_day             0
days_since_last_order    206209
dtype: int64

Days since last order has a large amount of missing values, with 206,209. Since there are so many, creating a new data frame
would be best.

In [35]:
# new data frame subset for missing values in days since last order
df_ords_nan=df_ords[df_ords['days_since_last_order'].isnull()==True]

In [36]:
df_ords_nan

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
0,0,2539329,1,1,2,8,
11,11,2168274,2,1,2,11,
26,26,1374495,3,1,1,14,
39,39,3343014,4,1,6,11,
45,45,2717275,5,1,3,12,
...,...,...,...,...,...,...,...
3420930,3420930,969311,206205,1,4,12,
3420934,3420934,3189322,206206,1,3,18,
3421002,3421002,2166133,206207,1,6,19,
3421019,3421019,2227043,206208,1,1,15,


The total user_id's with missing values is equal to the number of missing data rows, which is 206,209.The order number for all of the customers with missing data, meaning that they would not have any prior orders to begin with. Therefore, I will leave it alone as that is still valuable data.

## Duplicate Values for Orders

In [37]:
# finding duplicate values for orders
df_ords_dups= df_ords[df_ords.duplicated()]

In [38]:
df_ords_dups

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order


This shows that there are no duplicates in the ords dataframe. No changes need to be made.

### Export changes made to Orders

In [39]:
df_ords.to_csv(os.path.join(path, 'Data','Prepared Data', 'orders_checked.csv'))