# Data Preparation - Instacart Market Basket Analysis

## Overview

This notebook focuses on cleaning and preparing the Instacart datasets for analysis. Based on the findings from our initial data exploration, we will address data quality issues and ensure our datasets are ready for comprehensive analysis.

## Objectives

**1. Data Quality Assessment**
- Handle missing values across all datasets
- Remove or flag duplicate records
- Validate data consistency and logical constraints

**2. Data Cleaning**
- Clean inconsistent data formats
- Standardize data types where necessary
- Address any anomalies identified during exploration

**3. Data Integration**
- Merge datasets where appropriate for analysis
- Create derived variables if needed
- Ensure referential integrity between tables

**4. Data Validation**
- Verify cleaned data meets quality standards
- Perform final consistency checks
- Document all transformations applied

## Expected Outcome

Clean, validated datasets ready for in-depth analysis of customer shopping patterns, product preferences, and ordering behaviors.

---

*Note: All cleaning steps will be documented with clear explanations and rationale for reproducibility.*

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Reading the data to variables
df_instacart_orders = pd.read_csv('../data/raw/instacart_orders.csv', sep=';')
df_products = pd.read_csv('../data/raw/products.csv', sep=';')
df_order_products = pd.read_csv('../data/raw/order_products.csv', sep=';')
df_aisles = pd.read_csv('../data/raw/aisles.csv', sep=';')
df_departments = pd.read_csv('../data/raw/departments.csv', sep=';')

### DataFrame instacart_orders

In [None]:
#count duplicated orders
dup_num = df_instacart_orders.duplicated().sum()
message = f"df_instacart_orders have {dup_num} duplicated lines"
print(message)

print("===========================================")
#printing duplicated rows sorted by order_id
dup_rows = df_instacart_orders[df_instacart_orders.duplicated(keep=False)]
print(dup_rows.sort_values(by='order_id'))

#### Duplicated Values Analysis
- 15 duplicated orders
- All happened on Wednesday (order_dow == 3) at 2am (order_hour_of_day == 2)

#### Identified Issues
1. Lines are matching several fields (order_id, user_id, order_number)
2. The same user_id cannot place the same order_id multiple times
3. order_id must be a unique key value (not repeated)

#### Conclusion
We can conclude this probably happened due to a server failure or an error during backup, so we must remove the duplicated lines. Despite representing a residual number of lines, we must keep data integrity in mind.

In [50]:
#removing duplicates and reset index
df_instacart_orders_cleared = df_instacart_orders.drop_duplicates().reset_index(drop = True)

#count duplicated orders
dup_num = df_instacart_orders_cleared.duplicated().sum()
message = f"df_instacart_orders_cleared have {dup_num} duplicated lines"
print(message)

#checking duplicated order_id
dup_order_id = df_instacart_orders_cleared['order_id'].duplicated().sum()
message = f"df_instacart_orders_cleared have {dup_order_id} duplicated order_id"
print(message)

df_instacart_orders_cleared have 0 duplicated lines
df_instacart_orders_cleared have 0 duplicated order_id
