# Unpacking Instacart: A Deep Dive into North American Grocery E-Commerce Behavior

# Preliminary Data Exploration

The main aim is to understand the relationships inside a dataset before its uploading to SQL and to create a database schema.

In [2]:
import pandas as pd

In [9]:
# Read the CSV files into a DataFrame
df_aisles = pd.read_csv('data/aisles.csv')
df_departments = pd.read_csv('data/departments.csv')
df_order_products_prior = pd.read_csv('data/order_products__prior.csv')
df_order_products_train = pd.read_csv('data/order_products__train.csv')
df_orders = pd.read_csv('data/orders.csv')
df_products = pd.read_csv('data/products.csv')
df_sample_submission = pd.read_csv('data/sample_submission.csv')

## Aisles

In [29]:
print(f'The dataset has {df_aisles.shape[0]} rows and {df_aisles.shape[1]} columns.')
print(f'There are duplicates: {df_aisles.duplicated().values.any()}')

The dataset has 134 rows and 2 columns.
There are duplicates: False


In [28]:
print(df_aisles.info())
df_aisles.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134 entries, 0 to 133
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   aisle_id  134 non-null    int64 
 1   aisle     134 non-null    object
dtypes: int64(1), object(1)
memory usage: 2.2+ KB
None


Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation


## Departments

In [26]:
print(f'The dataset has {df_departments.shape[0]} rows and {df_departments.shape[1]} columns.')
print(f'There are duplicates: {df_departments.duplicated().values.any()}')

The dataset has 21 rows and 2 columns.
There are duplicates: False
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   department_id  21 non-null     int64 
 1   department     21 non-null     object
dtypes: int64(1), object(1)
memory usage: 464.0+ bytes


In [31]:
print(df_departments.info())
df_departments.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   department_id  21 non-null     int64 
 1   department     21 non-null     object
dtypes: int64(1), object(1)
memory usage: 464.0+ bytes
None


Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol


## Products

In [33]:
print(f'The dataset has {df_products.shape[0]} rows and {df_products.shape[1]} columns.')
print(f'There are duplicates: {df_products.duplicated().values.any()}')

The dataset has 49688 rows and 4 columns.
There are duplicates: False


In [32]:
print(df_products.info())
df_products.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49688 entries, 0 to 49687
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   product_id     49688 non-null  int64 
 1   product_name   49688 non-null  object
 2   aisle_id       49688 non-null  int64 
 3   department_id  49688 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 1.5+ MB
None


Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


## Order Products

These files specify which products were purchased in each order.

* **order_products__prior.csv** contains previous order contents for all customers.
* **'reordered'** indicates that the customer has a previous order that contains the product. Note that some orders will have no reordered items. You may predict an explicit 'None' value for orders with no reordered items. See the evaluation page for full details.

### Prior

In [34]:
print(f'The dataset has {df_order_products_prior.shape[0]} rows and {df_order_products_prior.shape[1]} columns.')
print(f'There are duplicates: {df_order_products_prior.duplicated().values.any()}')
print(df_order_products_prior.info())
df_order_products_prior.head()

The dataset has 32434489 rows and 4 columns.
There are duplicates: False
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32434489 entries, 0 to 32434488
Data columns (total 4 columns):
 #   Column             Dtype
---  ------             -----
 0   order_id           int64
 1   product_id         int64
 2   add_to_cart_order  int64
 3   reordered          int64
dtypes: int64(4)
memory usage: 989.8 MB
None


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


### Train

In [35]:
print(f'The dataset has {df_order_products_train.shape[0]} rows and {df_order_products_train.shape[1]} columns.')
print(f'There are duplicates: {df_order_products_train.duplicated().values.any()}')
print(df_order_products_train.info())
df_order_products_train.head()

The dataset has 1384617 rows and 4 columns.
There are duplicates: False
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1384617 entries, 0 to 1384616
Data columns (total 4 columns):
 #   Column             Non-Null Count    Dtype
---  ------             --------------    -----
 0   order_id           1384617 non-null  int64
 1   product_id         1384617 non-null  int64
 2   add_to_cart_order  1384617 non-null  int64
 3   reordered          1384617 non-null  int64
dtypes: int64(4)
memory usage: 42.3 MB
None


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


## Orders

This file tells to which set (prior, train, test) an order belongs.
* You are predicting reordered items only for the test set orders.
* **'order_dow'** is the day of week.

In [37]:
print(f'The dataset has {df_orders.shape[0]} rows and {df_orders.shape[1]} columns.')
print(f'There are duplicates: {df_orders.duplicated().values.any()}')
print(df_orders.info())
df_orders.head()
df_orders.tail()

The dataset has 3421083 rows and 7 columns.
There are duplicates: False
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int64  
 1   user_id                 int64  
 2   eval_set                object 
 3   order_number            int64  
 4   order_dow               int64  
 5   order_hour_of_day       int64  
 6   days_since_prior_order  float64
dtypes: float64(1), int64(5), object(1)
memory usage: 182.7+ MB
None


Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
3421078,2266710,206209,prior,10,5,18,29.0
3421079,1854736,206209,prior,11,4,10,30.0
3421080,626363,206209,prior,12,1,12,18.0
3421081,2977660,206209,prior,13,1,12,7.0
3421082,272231,206209,train,14,6,14,30.0


In [53]:
df_orders.order_dow.unique()

array([2, 3, 4, 1, 5, 0, 6])

### Check if there are any NaN values

In [47]:
print(f'There are NaN values: {df_orders.isna().values.any()}')

There are NaN values: True


In [46]:
df_orders.groupby('eval_set').count()

Unnamed: 0_level_0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
eval_set,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
prior,3214874,3214874,3214874,3214874,3214874,3008665
test,75000,75000,75000,75000,75000,75000
train,131209,131209,131209,131209,131209,131209


In [48]:
df_orders.count()

order_id                  3421083
user_id                   3421083
eval_set                  3421083
order_number              3421083
order_dow                 3421083
order_hour_of_day         3421083
days_since_prior_order    3214874
dtype: int64

There are NaN values in the column 'days_since_prior_order', which are related to the prior evaluation set.

### Find a primary key in the dataset

In [93]:
df_prior = df_orders[df_orders.eval_set == 'prior']
print(f'There are duplicates: {df_prior.duplicated().values.any()}')

There are duplicates: False


In [100]:
prior_unique_orders = df_orders.order_id.value_counts().reset_index()
prior_unique_orders_df

Unnamed: 0,order_id,count
0,2539329,1
1,1591157,1
2,1354759,1
3,1971373,1
4,1558866,1
...,...,...
3421078,3266950,1
3421079,118963,1
3421080,9433,1
3421081,2938641,1


In [99]:
df_train = df_orders[df_orders.eval_set == 'train']
print(f'There are duplicates: {df_train.duplicated().values.any()}')

train_unique_orders = df_train.order_id.value_counts().reset_index()
train_unique_orders

There are duplicates: False


Unnamed: 0,index,order_id
0,1187899,1
1,1452256,1
2,1928992,1
3,1485735,1
4,172103,1
...,...,...
131204,1522652,1
131205,857255,1
131206,3143222,1
131207,3406606,1


## Sample Submission

For each **order_id in the test set**, you should predict a space-delimited list of product_ids for that order.
* If you wish to predict an empty order, you should submit an explicit 'None' value.
* You may combine 'None' with product_ids.
* The spelling of 'None' is case sensitive in the scoring metric. 

In [50]:
print(f'The dataset has {df_sample_submission.shape[0]} rows and {df_sample_submission.shape[1]} columns.')
print(f'There are duplicates: {df_sample_submission.duplicated().values.any()}')
print(df_sample_submission.info())
df_sample_submission.head()

The dataset has 75000 rows and 2 columns.
There are duplicates: False
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75000 entries, 0 to 74999
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   order_id  75000 non-null  int64 
 1   products  75000 non-null  object
dtypes: int64(1), object(1)
memory usage: 1.1+ MB
None


Unnamed: 0,order_id,products
0,17,39276 29259
1,34,39276 29259
2,137,39276 29259
3,182,39276 29259
4,257,39276 29259


In [98]:
df_sample_submission.order_id.value_counts().reset_index()

Unnamed: 0,index,order_id
0,17,1
1,2279096,1
2,2279694,1
3,2279628,1
4,2279549,1
...,...,...
74995,1139568,1
74996,1139530,1
74997,1139519,1
74998,1139495,1


## Summary
The dataset is a relational set of files describing customers' orders over time. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.

For each user, provided between 4 and 100 of their orders, with the sequence of products purchased in each order.
The week and hour of day the order was placed, and a relative measure of time between orders, are also provided.