# Data Consistency Checks

## Importing Libraries


In [10]:
import pandas as pd
import numpy as np
import os


## Defining Project Path


In [11]:
project_path = r"C:\Users\mshhan\Documents\05-2024 Instacart Basket Analysis\02 Data\Original Data"


## Importing Data


In [12]:
products_path = os.path.join(project_path, "products.csv")
df_prods = pd.read_csv(products_path)

# Loading the orders_wrangled.csv data set
prepared_data_path = os.path.join(project_path, "..", "Prepared Data")
orders_path = os.path.join(prepared_data_path, "orders_wrangled.csv")
df_ords = pd.read_csv(orders_path)


## Perform Consistency Checks on df_prods

In [22]:
def check_mixed_types(df):
    for col in df.columns.tolist():
        # Map the types of the values in the column
        type_map = df[col].map(type)
        # Check if there are any discrepancies in the types within the column
        if type_map.nunique() > 1:
            print(f"Mixed data types found in column: {col}")


# Check for mixed-type data in df_prods
check_mixed_types(df_prods)



In [23]:
# Example conversion (adjust as needed based on actual column and data type)
df_prods['product_name'] = df_prods['product_name'].astype(str)


In [24]:
# Check for missing values in df_prods
missing_values = df_prods.isnull().sum()
print(missing_values)

product_id       0
product_name     0
aisle_id         0
department_id    0
prices           0
dtype: int64


looks like there are no missing data points, so we go ahead!

In [25]:
# Check for duplicate rows in df_prods
duplicates = df_prods.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")


Number of duplicate rows: 0


In [26]:
# Remove duplicate rows
df_prods.drop_duplicates(inplace=True)


In [27]:
# Summary statistics for df_ords
df_ords.describe()


Unnamed: 0,order_id,user_id,order_number,day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


### Analysis of `df_ords.describe()` Output

Based on the summary statistics provided by the `df.describe()` function, here are the observations and potential issues that should be investigated further:

1. **order_id and user_id**:
   - These columns appear to be identifiers, so there should be no concerns with their values as long as they are unique and non-missing.

2. **order_number**:
   - Min: 1, Max: 100
   - This seems reasonable, as `order_number` likely represents the sequence of orders placed by a user.

3. **day_of_week**:
   - Min: 0, Max: 6
   - This is consistent with representing days of the week, where 0 might represent Sunday and 6 represents Saturday.

4. **order_hour_of_day**:
   - Min: 0, Max: 23
   - These values are within the expected range for hours in a day.

5. **days_since_prior_order**:
   - Min: 0, Max: 30
   - The values range from 0 to 30 days, which seems plausible. However, if there are days with significantly higher gaps (like 30), it might indicate missing data or users with very infrequent orders. Further investigation might be warranted to understand the context of these values.

### Summary of Potential Issues:
- **days_since_prior_order**: Ensure that the higher values are consistent with the business logic or user behavior.
- All other columns appear to be within expected ranges and do not indicate any immediate issues.


## Checking for Mixed-Type Data in df_ords


In [29]:
# Check for mixed-type data in df_ords using the previously defined function
check_mixed_types(df_ords)

In [30]:
df_ords['order_id'] = df_ords['order_id'].astype(str)


## Checking for Missing Values in df_ords


Now we will check for missing values in the df_ords dataframe. Here is the code to perform this check:

In [32]:
# Check for missing values in df_ords
missing_values_ords = df_ords.isnull().sum()
print(missing_values_ords)

order_id                       0
user_id                        0
order_number                   0
day_of_week                    0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64


### Analysis of Missing Values in `df_ords`

Based on the output from the `isnull().sum()` function, the following observations were made:

- The `days_since_prior_order` column has 206,209 missing values.
- All other columns (`order_id`, `user_id`, `order_number`, `day_of_week`, and `order_hour_of_day`) have no missing values.

#### Explanation for Missing Values

The missing values in the `days_since_prior_order` column are likely due to customers placing their first order. For these initial orders, there would be no prior order to reference, resulting in a missing value for the `days_since_prior_order` column.

This is a reasonable explanation and aligns with expected behavior for new customers.


In [33]:
# Fill missing values in 'days_since_prior_order' with 0
df_ords['days_since_prior_order'].fillna(0, inplace=True)


In [34]:
# Check for duplicate rows in df_ords
duplicates_ords = df_ords.duplicated().sum()
print(f"Number of duplicate rows: {duplicates_ords}")

Number of duplicate rows: 0


Since there are no duplicate rows in the `df_ords` dataframe, there are no duplicate values to address. This indicates that the data has been properly maintained and does not contain any redundant entries.

In [36]:
# Export cleaned df_prods
df_prods.to_csv(os.path.join(prepared_data_path, "products_cleaned.csv"), index=False)

# Export cleaned df_ords
df_ords.to_csv(os.path.join(prepared_data_path, "orders_cleaned.csv"), index=False)
