# Instacart Market Basket Analysis

## Objectives
- Analyze customer shopping patterns
- Identify most popular products
- Discover insights about purchasing behavior

## Datasets
- instacart_orders.csv
- products.csv  
- order_products.csv
- aisles.csv
- departments.csv

In [73]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [74]:
# Reading the data to variables
df_instacart_orders = pd.read_csv('../data/raw/instacart_orders.csv', sep=';')
df_products = pd.read_csv('../data/raw/products.csv', sep=';')
df_order_products = pd.read_csv('../data/raw/order_products.csv', sep=';')
df_aisles = pd.read_csv('../data/raw/aisles.csv', sep=';')
df_departments = pd.read_csv('../data/raw/departments.csv', sep=';')

In [75]:
# Checking dataframes information
print("=== INSTACART ORDERS INFO ===")
df_instacart_orders.info(show_counts=True)
df_instacart_orders.nunique()

=== INSTACART ORDERS INFO ===
<class 'pandas.DataFrame'>
RangeIndex: 478967 entries, 0 to 478966
Data columns (total 6 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   order_id                478967 non-null  int64  
 1   user_id                 478967 non-null  int64  
 2   order_number            478967 non-null  int64  
 3   order_dow               478967 non-null  int64  
 4   order_hour_of_day       478967 non-null  int64  
 5   days_since_prior_order  450148 non-null  float64
dtypes: float64(1), int64(5)
memory usage: 21.9 MB


order_id                  478952
user_id                   157437
order_number                 100
order_dow                      7
order_hour_of_day             24
days_since_prior_order        31
dtype: int64

Data insights
Instacart orders dataframe have a total of 478967 lines and is composed by the columns(in order): order_id, user_id, order_number, order_dow,order_hour_of_day and days_since_prior_order.
All of them are free of NaN values and data type is int64; except for "days_since_prior_order" that have 450148 non-null values and data type is float64.

In [76]:
# Checking dataframes information
print("=== PRODUCTS INFO ===")
df_products.info()

=== PRODUCTS INFO ===
<class 'pandas.DataFrame'>
RangeIndex: 49694 entries, 0 to 49693
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   product_id     49694 non-null  int64
 1   product_name   48436 non-null  str  
 2   aisle_id       49694 non-null  int64
 3   department_id  49694 non-null  int64
dtypes: int64(3), str(1)
memory usage: 1.5 MB


Data insights
Products dataframe have a total of 49694 lines and is composed by the columns(in order): product_id, product_name, aisle_id and department_id.
All of them are free of NaN values and data type is int64; except for "product_name" that have 48436 non-null values and data type is str.

In [80]:
# Checking dataframes information
print("=== ORDERS PRODUCTS INFO ===")
df_order_products.info(show_counts=True)

=== ORDERS PRODUCTS INFO ===
<class 'pandas.DataFrame'>
RangeIndex: 4545007 entries, 0 to 4545006
Data columns (total 4 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   order_id           4545007 non-null  int64  
 1   product_id         4545007 non-null  int64  
 2   add_to_cart_order  4544171 non-null  float64
 3   reordered          4545007 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 138.7 MB


Data insights
Order productsdataframe have a total of 4545007 lines and is composed by the columns(in order): order_id, product_id, add_to_cart_order and reordered.
All of them are free of NaN values and data type is int64; except for "add_to_cart_order" that have 4544171 non-null values and data type is float64.

In [78]:
# Checking dataframes information
print("=== AISLES INFO ===")
df_aisles.info()

=== AISLES INFO ===
<class 'pandas.DataFrame'>
RangeIndex: 134 entries, 0 to 133
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   aisle_id  134 non-null    int64
 1   aisle     134 non-null    str  
dtypes: int64(1), str(1)
memory usage: 2.2 KB


Data insights
Aisles dataframe have a total of 134 lines and is composed by the columns(in order): aisle_id and aisle. 
Both are free of NaN values and data type is: int64 for aisle_id and str for aisle.

In [79]:
# Checking dataframes information
print("=== DEPARTMENTS INFO ===")
df_departments.info()

=== DEPARTMENTS INFO ===
<class 'pandas.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   department_id  21 non-null     int64
 1   department     21 non-null     str  
dtypes: int64(1), str(1)
memory usage: 468.0 bytes


Data insights
Departments dataframe have a total of 21lines and is composed by the columns(in order): department_id and department. 
Both are free of NaN values and data type is: int64 for department_id and str for department.

Also run the nunique function to the dataframe to check if columns "order_dow" and "order_hour_of_day" had the correct values; 
"order_dow" must have max 7 as it refers to the day of the week
"order_hour_of_day" must have max 24 as it refers to hour of the day