# Machine Learning Nanodegree

## Capstone Project: Instacart Market Basket Analysis
### Which products will an Instacart consumer purchase again?

The dataset for this challenge is a relational set of files describing customers' orders over time. The goal of the competition is to predict which products will be in a user's next order. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders. For more information, see the blog post accompanying its public release.

### The Road Ahead

We break the notebook into separate steps.  Feel free to use the links below to navigate the notebook.

* [Step 0](#step0): Import Datasets
* [Step 1](#step1): Data Exploration
* [Step 2](#step2): Exploratory Visualizations
* [Step 3](#step3): Preprocessing 
* [Step 4](#step4): Benchmarks
* [Step 5](#step5): Algorithm and Techniques
* [Step 6](#step6): Refinements
* [Step 7](#step7): Algorithm Evaluation and Validation

---
<a id='step0'></a>
## Step 0: Import Datasets

In [None]:
### Import libraries XXXXXXXXXXXXXXXXXXXX
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

%matplotlib inline

In [9]:
### Import Instacart Data
order_products_train_df = pd.read_csv("instacart_2017_05_01/order_products__train.csv")
order_products_prior_df = pd.read_csv("instacart_2017_05_01/order_products__prior.csv")
orders_df = pd.read_csv("instacart_2017_05_01/orders.csv")
products_df = pd.read_csv("instacart_2017_05_01/products.csv")
aisles_df = pd.read_csv("instacart_2017_05_01/aisles.csv")
departments_df = pd.read_csv("instacart_2017_05_01/departments.csv")

print('Total no. of orders: {}'.format(orders_df.shape[0]))
print('Total no. of products: {}'.format(products_df.shape[0]))
print('Total no. of aisles: {}'.format(aisles_df.shape[0]))
print('Total no. of departments: {}'.format(departments_df.shape[0]))

Total no. of orders: 3421083
Total no. of products: 49688
Total no. of aisles: 134
Total no. of departments: 21


---
<a id='step1'></a>
## Step 1: Data Exploration

orders_df tells to which set (prior, train, test) an order belongs. Will be predicting reordered items only for the 'test' set orders. 

In [4]:
orders_df.describe()

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


In [5]:
orders_df.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [23]:
order_cnt = orders_df.groupby("eval_set").user_id.aggregate({'total_user':'nunique'}).reset_index()
order_cnt

Unnamed: 0,eval_set,total_user
0,prior,206209
1,test,75000
2,train,131209


These data frames (order_products_[prior/train]_df specify which products were purchased in each order. order_products_prior_df contains previous order contents for all customers. 

'reordered' indicates that the customer has a previous order that contains the product. (Some orders will have no reordered items). We may predict an explicit 'None' value for orders with no reordered items. 

In [15]:
order_products_prior_df.describe()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
count,32434490.0,32434490.0,32434490.0,32434490.0
mean,1710749.0,25576.34,8.351076,0.5896975
std,987300.7,14096.69,7.126671,0.4918886
min,2.0,1.0,1.0,0.0
25%,855943.0,13530.0,3.0,0.0
50%,1711048.0,25256.0,6.0,1.0
75%,2565514.0,37935.0,11.0,1.0
max,3421083.0,49688.0,145.0,1.0


In [18]:
order_products_prior_df.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [16]:
order_products_train_df.describe()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
count,1384617.0,1384617.0,1384617.0,1384617.0
mean,1706298.0,25556.24,8.758044,0.5985944
std,989732.6,14121.27,7.423936,0.4901829
min,1.0,1.0,1.0,0.0
25%,843370.0,13380.0,3.0,0.0
50%,1701880.0,25298.0,7.0,1.0
75%,2568023.0,37940.0,12.0,1.0
max,3421070.0,49688.0,80.0,1.0


In [17]:
order_products_train_df.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


In [6]:
products_df.describe()

Unnamed: 0,product_id,aisle_id,department_id
count,49688.0,49688.0,49688.0
mean,24844.5,67.769582,11.728687
std,14343.834425,38.316162,5.85041
min,1.0,1.0,1.0
25%,12422.75,35.0,7.0
50%,24844.5,69.0,13.0
75%,37266.25,100.0,17.0
max,49688.0,134.0,21.0


In [32]:
print(products_df.head())
products_df.tail()

   product_id                                       product_name  aisle_id  \
0           1                         Chocolate Sandwich Cookies        61   
1           2                                   All-Seasons Salt       104   
2           3               Robust Golden Unsweetened Oolong Tea        94   
3           4  Smart Ones Classic Favorites Mini Rigatoni Wit...        38   
4           5                          Green Chile Anytime Sauce         5   

   department_id  
0             19  
1             13  
2              7  
3              1  
4             13  


Unnamed: 0,product_id,product_name,aisle_id,department_id
49683,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5
49684,49685,En Croute Roast Hazelnut Cranberry,42,1
49685,49686,Artisan Baguette,112,3
49686,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8
49687,49688,Fresh Foaming Cleanser,73,11


In [10]:
aisles_df.describe()

Unnamed: 0,aisle_id
count,134.0
mean,67.5
std,38.826537
min,1.0
25%,34.25
50%,67.5
75%,100.75
max,134.0


In [33]:
print(aisles_df.head())
print(aisles_df.tail())

   aisle_id                       aisle
0         1       prepared soups salads
1         2           specialty cheeses
2         3         energy granola bars
3         4               instant foods
4         5  marinades meat preparation
     aisle_id                       aisle
129       130    hot cereal pancake mixes
130       131                   dry pasta
131       132                      beauty
132       133  muscles joints pain relief
133       134  specialty wines champagnes


In [12]:
departments_df.describe()

Unnamed: 0,department_id
count,21.0
mean,11.0
std,6.204837
min,1.0
25%,6.0
50%,11.0
75%,16.0
max,21.0


In [30]:
departments_df

Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol
5,6,international
6,7,beverages
7,8,pets
8,9,dry goods pasta
9,10,bulk


Since products, aisles and departments data frames could be related amongst themselves using IDs as keys, we can merge them recursively and create a single data frame (pad_df for products, aisles and dept) for simplification as below:

In [22]:
pad_df = pd.merge(left=pd.merge(left=products_df, right=departments_df, how='left'), right=aisles_df, how='left')
pad_df.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,department,aisle
0,1,Chocolate Sandwich Cookies,61,19,snacks,cookies cakes
1,2,All-Seasons Salt,104,13,pantry,spices seasonings
2,3,Robust Golden Unsweetened Oolong Tea,94,7,beverages,tea
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,frozen,frozen meals
4,5,Green Chile Anytime Sauce,5,13,pantry,marinades meat preparation


<a id='step2'></a>
## Step 2: Exploratory Vizualizations

In [25]:
plt.figure(figsize=(12,8))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8)#, color=color[2])
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Maximum order number', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

NameError: name 'cnt_srs' is not defined

<matplotlib.figure.Figure at 0x1728b097208>

---
<a id='step3'></a>
## Step 3: Data Preprocessing


### Pre-process the Data

Starting with sorting the values based on user_id and order_number for the user.

In [27]:
orders_df.sort_values(by=['user_id', 'order_number'], inplace=True)
orders_df.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [28]:
#Replace NaN with mean
orders_df.days_since_prior_order.fillna(orders_df.days_since_prior_order.mean(), inplace=True)
orders_df.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,11.114836
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


---
<a id='step4'></a>
## Step 4: Benchmarks

---
<a id='step5'></a>
## Step 5: Algorithm and Techniques

PCA?

---
<a id='step6'></a>
## Step 6: Model Refinements


---
<a id='step7'></a>
## Step 7: Model Evaluation and Validation
