# Data Preprocessing - Instacart Dataset

Author: Kelly Li

## Table of Contents:
* [1 Introduction](#one)
* [2 Datasets](#two)
    * [2.1 Data Sources](#twoone)
    * [2.2 Data Import & Summary](#twotwo)
* [3 Data Cleaning](#three)
    * [3.1 Data Types](#threeone) 
    * [3.2 Missing Data](#threetwo)
    * [3.3 Duplicate Data](#threethree)
    * [3.4 Outliers](#threefour)
    * [3.5 Zero Ratings](#threefive)
* [4 Data Preprocessing](#four)
    * [4.1 Feature Engineering](#fourone)
        * [4.1.1 Nutritional Values](#fouroneone)
        * [4.1.2 Time Components](#fouronetwo)
        * [4.1.3 Dietary Preference](#fouronethree)
    * [4.2 Aggregation](#fourtwo)
        * [4.2.1 Ratings](#fourtwoone)
        * [4.2.2 Merge Datasets](#fourtwotwo)
* [5 Conclusion](#five)

---

## 1 Introduction <a class="anchor" id="one"></a>

Welcome to the Data Preprocessing phase of the Instacart Market Basket Analysis (MBA) project. In this Jupyter Notebook, we will delve into the intricacies of preparing and cleaning the Instacart dataset, laying the foundation for insightful analysis and modeling.

### Project Rationale
**Why Instacart?**

This project was inspired by a passion for data science and a genuine interest in the realm of consumer behavior and online grocery shopping. The Instacart dataset provides a rich and complex landscape of real-world data that presents a multitude of exciting analytical opportunities.

**Understanding Shopping Behavior**

The modern world of e-commerce has revolutionized the way consumers shop for everyday essentials. With the increasing reliance on online platforms for grocery shopping, understanding customer preferences, purchasing patterns, and product associations has become essential for both retailers and data scientists.

**Extracting Insights**

By undertaking this project, I aim to extract valuable insights from the dataset that can benefit both Instacart as a service provider and consumers looking for a more convenient shopping experience. Through data-driven analysis, we can uncover hidden patterns, identify trends, and develop data-driven strategies for optimizing the shopping journey.

**The Power of Data**

Data science is not merely about numbers and algorithms; it's about leveraging the power of data to solve real-world problems. Through this project, I hope to demonstrate how data preprocessing is the critical first step in transforming raw data into actionable insights. Clean and well-structured data empowers us to build accurate models, make informed decisions, and drive positive change.

### What to Expect
In this notebook, we will systematically address data quality issues, handle missing data, encode categorical variables, and prepare the dataset for subsequent stages of our analysis. Each step in the data preprocessing journey is a building block that contributes to the overarching goal of gaining a deeper understanding of customer behavior in the context of online grocery shopping.

Let's embark on this data-driven adventure and unlock the potential insights hidden within the Instacart dataset.

In [None]:
import pandas as pd
basket_df = pd.read_csv("baskets.csv")
basket_df.head(10)