# Data consistency checks
#

# List of contents:
## 1. Import libraries
## 2. Import 'orders_wrangled' and 'products' datasets
## 3. Data consistency checks for df_prods
### 3.1. Practicing mixed-type data with a test dataframe
### 3.2. Check for missing values within df_prods
### 3.3. Check for duplicates
### 3.4. Descriptive statistics for df_prods_clean_no_dupes
## 4. Consistency checks for df_ords
### 4.1. Descriptive statistics
### 4.2. Checking for mixed-type data
### 4.3. Checking for missing values
### 4.4. Checking for duplicates
## 5. Export 'df_prods_clean_final' and 'df_ords' dataframes
#

## 1. Import libraries

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import os

## 2. Import datasets

In [2]:
# Creating folder path variable
path = r'C:\Users\marta\OneDrive\Documents\2023-09-18 Instacart Basket Analysis'

In [3]:
# Importing orders_wrangled dataset
df_ords = pd.read_csv(os.path.join (path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index_col = False)

In [5]:
# Removing 'Unnamed: 0 ' column from df_ords
df_ords = df_ords.drop(columns = ['Unnamed: 0'])

## 3. Data consistency checks for df_prods

### 3.1. Practicing mixed-type data with a test dataframe

In [6]:
# Creating a dataframe
df_test = pd.DataFrame()

In [7]:
# Creating a mixed-type column
df_test['mix'] = ['a', 'b', 1, True]

In [8]:
# Check the output
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [9]:
# Checking for mixed types
for col in df_test.columns.tolist():
  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


In [10]:
# Changing column data type to string
df_test['mix'] = df_test['mix'].astype('str')

### 3.2. Check for missing values within df_prods

In [11]:
# Finding missing values within df_prods
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [12]:
# Creating a subset dataframe for missing values
df_nan = df_prods[df_prods['product_name'].isnull() == True]

In [13]:
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [14]:
# Dimensions of df_prods
df_prods.shape

(49693, 5)

In [15]:
# Creating new dataframe without missing values
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [16]:
# Check dimensions
df_prods_clean.shape

(49677, 5)

### 3.3. Check for duplicates

In [17]:
# Creating subset df for dupes
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [18]:
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [19]:
# Check dimensions
df_prods_clean.shape

(49677, 5)

In [20]:
# Removing duplicates from df_prods_clean
df_prods_clean_no_dupes = df_prods_clean.drop_duplicates()

In [21]:
# Check dimensions
df_prods_clean_no_dupes.shape

(49672, 5)

### 3.4. Descriptive statistics for df_prods_clean_no_dupes

In [22]:
# Calculating descriptive statistics
df_prods_clean_no_dupes.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49672.0,49672.0,49672.0,49672.0
mean,24850.349775,67.762442,11.728942,9.993282
std,14340.705287,38.315784,5.850779,453.615536
min,1.0,1.0,1.0,1.0
25%,12432.75,35.0,7.0,4.1
50%,24850.5,69.0,13.0,7.1
75%,37268.25,100.0,17.0,11.1
max,49688.0,134.0,21.0,99999.0


#### product_id, aisle_id and department_id columns seem to have consistent data. Min value is 1, which is expected for the id columns, max values look realistic. Count number is the same across all columns, which indicate that no rows have missing data now. 
#### The only thing that looks unrealistic, is the max value for price column, which is 99999.00.


In [23]:
# Function to display all records 
pd.set_option('display.max_rows', None)

In [24]:
# Frequency table for prices column as a dataframe
df = df_prods_clean_no_dupes['prices'].value_counts(dropna = False).to_frame()


In [25]:
# Frequency table for prices column
df_prods_clean_no_dupes['prices'].value_counts(dropna = False)

prices
2.5        470
5.3        458
6.2        451
2.6        447
5.4        444
3.8        442
6.3        442
2.4        442
6.9        442
3.0        441
3.4        438
6.8        438
5.2        437
4.5        437
3.3        435
2.8        435
1.8        434
5.1        431
6.5        430
5.6        428
4.6        427
4.4        426
3.6        425
2.0        424
4.9        423
5.7        423
4.3        420
6.4        419
4.0        419
5.8        418
6.6        417
1.7        417
3.5        416
2.9        416
2.1        415
4.2        415
2.3        415
5.0        414
2.2        413
2.7        412
4.7        411
6.0        408
3.1        407
4.1        406
3.9        406
4.8        405
6.7        404
3.2        403
3.7        399
5.9        399
7.0        396
5.5        386
6.1        385
1.9        377
1.6        371
12.6       356
11.5       348
8.3        347
9.4        340
8.2        334
11.3       333
14.4       330
9.2        329
12.9       328
14.2       327
14.3       326
7.5

#### Frequency table shows that value 99999 occured only once within the whole dataframe. It also showed another unrealistic value: 14900, that also occured once. I think it's safe to remove rows that contain this two values.

In [26]:
# Descriptive statistics for df
df.describe()

Unnamed: 0,count
count,242.0
mean,205.256198
std,176.496448
min,1.0
25%,6.0
50%,292.0
75%,332.25
max,470.0


In [27]:
# Filtering rows where price = [99999.00, 14900.00]
df_prods_clean_no_dupes[df_prods_clean_no_dupes['prices'].isin([99999, 14900])]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
21554,21553,Lowfat 2% Milkfat Cottage Cheese,108,16,14900.0
33666,33664,2 % Reduced Fat Milk,84,16,99999.0


In [28]:
# Creating df_unreal that contains filtered rows
df_unreal = df_prods_clean_no_dupes[df_prods_clean_no_dupes['prices'].isin([99999, 14900])]

In [29]:
# Check the output
df_unreal

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
21554,21553,Lowfat 2% Milkfat Cottage Cheese,108,16,14900.0
33666,33664,2 % Reduced Fat Milk,84,16,99999.0


In [30]:
# Removing row with price = [99999.00, 14900.00] from df_prods
df_prods_clean_final = df_prods_clean_no_dupes.drop(df_unreal.index)

In [31]:
# Check dimensions
df_prods_clean_final.shape

(49670, 5)

In [32]:
# Method 2 of removing unwanted rows
df_prods_clean_no_dupes_v2 = df_prods_clean_no_dupes[df_prods_clean_no_dupes['prices']<500]

In [33]:
# Check dimensions
df_prods_clean_no_dupes_v2.shape

(49670, 5)

## 4. Consistency checks for df_ords

### 4.1. Descriptive statistics

In [34]:
# Descriptive statistics for df_ords with converting scientific notation to decimal form
df_ords.describe().apply(lambda s: s.apply('{0:.2f}'.format))

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.21,17.15,2.78,13.45,11.11
std,987581.74,59533.72,17.73,2.05,4.23,9.21
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.5,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


#### Min and max values look realistic for all columns. Min value 0 for order_dow and order_hour_of_day are expected, as this indicates first day of the week and hour of a day respectively.
#### days_since_prior_order doesn't have the same count as other columns, so this might be a case of missing values.

### 4.2. Checking for mixed-type data

In [None]:
# Checking for mixed types
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

#### According to the check, this dataframe doesn't have mixed-type columns.

### 4.3. Checking for missing values

In [None]:
# Checking for missing values within df_ords
df_ords.isnull().sum()

In [None]:
# Displaying observations with missing values
df_ords[df_ords['days_since_prior_order'].isnull() == True]

#### days_since_prior_order column has 206,209 missing values. This missing value are present in the rows with only one order, which makes sense, as if no further orders were made, there is no day number to display.
#### I wouldn't delete or impute missing values in this case, as they actually provide useful information about the orders.

### 4.4. Checking for duplicates

In [38]:
# Duplicates checks within df_ords
df_dups_ords = df_ords[df_ords.duplicated()]

In [39]:
df_dups_ords

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order


#### No duplicate rows were found within df_ords.

## 5. Export 'df_prods_clean_final' and 'df_ords' dataframes

In [None]:
# Export df_prods_clean_final as 'products_clean'
df_prods_clean_final.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_clean.csv'))

In [None]:
# Export df_ords as 'orders_clean'
df_ords.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_clean.csv'))