# Instacart Market Basket Analysis

## Index
1. Extract the data and fill the gaps
2. Understand the data
3. Preprocesssing stage
4. Training different models to set a Benchmark
5. Improvise the Model and Repeat
6. Predict on Test data.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
%matplotlib inline

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [None]:
import os
# List all the data we have.
print(os.listdir('data'))

# 1. Extract the Data

In [None]:
aisles      = pd.read_csv('data/aisles.csv')
departments = pd.read_csv('data/departments.csv')
order_prior = pd.read_csv('data/order_products__prior.csv')
order_train = pd.read_csv('data/order_products__train.csv')
orders      = pd.read_csv('data/orders.csv')
products    = pd.read_csv('data/products.csv')

In [None]:
print(aisles.shape)
print(aisles.head())

In [None]:
print(departments.shape)
print(departments.head())

In [None]:
print (order_prior.shape)
print(order_prior.sort_values('order_id').head(10))

In [None]:
print(order_train.shape)
print(order_train[ order_train['order_id']==1] )

In [None]:
print(orders.shape,'\n',orders.head())

In [None]:
print(products.shape)
print(products.head())

## 2. Understanding Data

# 2.1 Chance of Items being ordered for first time
Percentage of old items in order vs new items in orders ~ 60%
So even though we predict all the items that will be reordered there is a 40% chance that a new item might be added by 
the user with the next order.

In [None]:
detailed_train_data = order_train
reordered = (detailed_train_data[detailed_train_data['reordered']==1].shape[0])
not_reordered = (detailed_train_data[detailed_train_data['reordered']==0].shape[0])
print( "percentage of new orders =", reordered*100 / (reordered+not_reordered))

Lets see which group of items are popularly reordered in training set

In [None]:
# Joining orders , products and departments to give some meaningful names to each product

detailed_train_data = pd.merge(order_train,products,how='left', left_on=['product_id'], right_on = ['product_id'])
detailed_train_data = pd.merge(detailed_train_data,departments,how='left', left_on=['department_id'], right_on = ['department_id'])

In [None]:
# Extract only elements which have been ordered before
group = detailed_train_data[detailed_train_data['reordered']==1]

In [None]:
# Grouping according to individual department and getting how many orders have been placed categorically
group = group[['department','reordered','department_id']].groupby(['department','department_id']).sum()
group = group.reset_index()

# 2.2 Each department is popular in it's own way

The pie chart below shows that the quantitiy of reordered items of produce are the most, followed by dairy eggs.

In [None]:
group = group.sort_values(by=['reordered'],ascending=[False])
temp = group.reordered
label = np.array(group.department)
values = np.array(100*temp/temp.sum())
sns.set(font_scale=1.5)
plt.figure(figsize=(15,15))
plt.pie(values, labels=label, autopct='%1.1f%%', startangle=210)
plt.title("Departments distribution", fontsize=15)
plt.show()

# 2.2.1 It's not always about the quantity    
Although produce has more number of reorders, we need to see how many of the total produce orders are actually reordered. This explanation says dairy eggs have higher rate of reorder. Say if there are 20 orders although 7 of the orders of produce are reordered and 3 orders of dairy are reordered, there might have been a total of 10 produce, but only 3 of dairy all of which are reordered. Thus predicting dairy eggs to be reordered more would give us better results.

This analysis based on the trainig data gives us an idea that few categories are reordered relatively very high compared to others but It's not a fair way to estimate which category is being reordered more, since say if all the orders in babies department have been re-ordered It's corresponding measure should be high. So we should consider    

                       (total reorders in a category) /(total orders in each category)
                       
As we can see below though number of reorders snacks being high compared to say, pets. Ratio doesn't indicate the same. This gives a more clear picture say even if number of orders of snacks were relatively high, they are not quite reordered in the same way as pet products were.

This could indicate how we always try to experiment with snacks and also not to exclude different flavors for each product we have which might cause this behaviour as well.

In [None]:
total_group = detailed_train_data.groupby(['department','department_id']).count().reset_index()

group = group[['department','reordered','department_id']].groupby(['department','department_id']).sum()
group = group.reset_index()

In [None]:
ratio = pd.DataFrame({'department':total_group.department,'ratio':group['reordered']/total_group.reordered}).sort_values('ratio',ascending=False)

In [None]:
plt.figure(figsize=(20,10))
sns.set(font_scale=3) 
sns.pointplot(ratio.department,(ratio.ratio))
plt.xlabel('Department', fontsize=25)
plt.ylabel('Reorder ratio', fontsize=25)
plt.title("Rate of Reorders", fontsize=30)
plt.xticks(rotation='vertical')
plt.show()

In [None]:
total_products = pd.merge(products,departments,how='left', left_on=['department_id'], right_on = ['department_id'])
total_products = total_products.groupby('department').agg('count').reset_index().sort_values('product_id',ascending=False)
total_products.product_id = total_products.product_id 
plt.figure(figsize=(20,10))
sns.set(font_scale=3) 
sns.pointplot(total_products.department,total_products.product_id)
plt.xlabel('Department', fontsize=25)
plt.ylabel('Number of Products', fontsize=25)
plt.title("Size of each Department", fontsize=30)
plt.xticks(rotation='vertical')
plt.show()

In [None]:
joint = pd.merge(total_products[['department','product_id']],ratio,on='department').reset_index()
joint.columns=['i','department','product_count','reorder_ratio']
scaler = MinMaxScaler()
joint[['product_count','reorder_ratio']] = scaler.fit_transform(joint[['product_count','reorder_ratio']])

In [None]:
plt.figure(figsize=(20,10))
sns.set(font_scale=3) 
sns.pointplot(joint.department,joint.reorder_ratio)
sns.pointplot(joint.department,joint.product_count,color='red')
plt.xlabel('Department', fontsize=25)
plt.ylabel('Normalized value', fontsize=25)
plt.title("Red = product_count, Blue = reorder ratio", fontsize=30)
plt.xticks(rotation='vertical')
plt.show()

## Freedom of more
When users have more products say in case of first 3 departments the reorder ration is very less except for snacks.
It might indicate that these departments are very experimental and predicting these orders would be highly unlikely since they are available in abundance and users are not willing to order them again.

## Average time between reorders.

In [None]:
num_records = orders.shape[0]
print(orders.head(3))

In [None]:
prior_orders = orders[orders.eval_set == 'prior']
train_orders = orders[orders.eval_set == 'train']
test_orders  = orders[orders.eval_set == 'test']

In [None]:
prior_data = pd.merge(prior_orders,order_prior)
train_data = pd.merge(train_orders,order_train)

In [None]:
print(prior_data.sort_values(by=['user_id','order_number']).head())

In [None]:
# Create list of products that needs to be predicted.
predict_products = pd.DataFrame()
# select distinct user id's
predict_products['user_id'] = prior_data['user_id'].drop_duplicates()
temp = pd.DataFrame( prior_data.groupby(['user_id'])['product_id'].apply(set).reset_index() )
predict_products = pd.merge(predict_products,temp)

In [None]:
print(predict_products[predict_products['user_id']==3]['product_id'].values)

## Create Feature set

To create a benchmark I"ll be using 4 basic features.
1. Reorder ratio ( Number of times product has been rerordered )
2. Aisle ID ( The aisle from which the product is selected )
3. Add to cart order ( The average add to cart order of the product )
4. Avg number of items in the cart ( This would be the user property )

In [None]:
prior_user_order   = pd.DataFrame(prior_data.groupby(['user_id','order_id']).size().reset_index())
prior_user_product = prior_data[['user_id','order_id','product_id','reordered','add_to_cart_order','order_id']].groupby(['user_id','product_id']).agg(['mean'])

In [None]:
# Two features obtained.
feature_add_order = pd.DataFrame(prior_user_product['add_to_cart_order']['mean']).reset_index()
feature_reorder   = pd.DataFrame(prior_user_product['reordered']['mean']).reset_index()

In [None]:
# To calculate avg number of orders in a user basket, we need to group twice.
prior_user_order.columns=['user_id','order_number','num_products']
prior_user_order = prior_user_order[['user_id','num_products','order_id']].groupby('user_id').agg(['mean'])
feature_num_orders = pd.DataFrame((b['num_products']['mean'])).reset_index()

In [None]:
print(feature_add_order.head(),feature_num_orders.head(),feature_reorder.head())

In [None]:
feature_all = feature_add_order.merge(feature_reorder,on=(['user_id','product_id']),suffixes=('_add_order','_reorder'))

feature_all = feature_all.merge(feature_num_orders,on=(['user_id']),suffixes=('_','_num_orders'))

feature_all = feature_all.merge(products[['product_id','aisle_id']],on=(['product_id']))

feature_all = feature_all.rename(columns={'mean':'avg_orders','mean_add_order':'avg_add_order','mean_reorder':'reorder_ratio'})

In [None]:
# These are the features for all products. Now we have to create target variables from train data.
print(feature_all[ feature_all.user_id==1 ])