# Instacart Product Recommendation

**Source**:

* https://medium.com/datadriveninvestor/how-to-build-a-recommendation-system-for-purchase-data-step-by-step-d6d7a78800b6
* http://www.moorissatjokro.com/#home
* https://towardsdatascience.com/how-to-build-a-simple-recommender-system-in-python-375093c3fb7d
* **Possible Algorithm to use**: https://surprise.readthedocs.io/en/stable/co_clustering.html#surprise.prediction_algorithms.co_clustering.CoClustering
* **Similairity Models**: https://surprise.readthedocs.io/en/stable/similarities.html
* **Association Rule Learning**: https://en.wikipedia.org/wiki/Association_rule_learning

In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import time

from sklearn.model_selection import train_test_split

**Memory-based methods**
1. **User-based collaborative filtering**: In this model products are recommended to a user based on the fact that the products have been liked by users similar to the user. For example if Derrick and Dennis like the same movies and a new movie comes out that Derick likes,then we can recommend that movie to Dennis because Derrick and Dennis seem to like the same movies.
1. **Item-based collaborative filtering**: These systems identify similar items based on users’ previous ratings. For example if users A,B and C gave a 5 star rating to books X and Y then when a user D buys book Y they also get a recommendation to purchase book X because the system identifies book X and Y as similar based on the ratings of users A,B and C.

### Let's get the data into a way that can handle it. 

In [2]:
products = pd.read_csv('../../data/01_raw/instacart_2017_05_01/products.csv')
aisles = pd.read_csv('../../data/01_raw/instacart_2017_05_01/aisles.csv')
departments = pd.read_csv('../../data/01_raw/instacart_2017_05_01/departments.csv')

order_products__prior = pd.read_csv('../../data/01_raw/instacart_2017_05_01/order_products__prior.csv')
order_products__train = pd.read_csv('../../data/01_raw/instacart_2017_05_01/order_products__train.csv')
order_test = pd.read_csv('../../data/01_raw/instacart_2017_05_01/orders.csv')

In [3]:
len(order_products__prior)

32434489

In [4]:
prod_ailes = products.merge(aisles, 
              how='outer', 
              on='aisle_id', 
               suffixes=('_x', '_y')
              )

In [5]:
product_dataset = prod_ailes.merge(departments, 
                how='outer', 
                on='department_id')

In [6]:
product_dataset.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,aisle,department
0,1,Chocolate Sandwich Cookies,61,19,cookies cakes,snacks
1,78,Nutter Butter Cookie Bites Go-Pak,61,19,cookies cakes,snacks
2,102,Danish Butter Cookies,61,19,cookies cakes,snacks
3,172,Gluten Free All Natural Chocolate Chip Cookies,61,19,cookies cakes,snacks
4,285,Mini Nilla Wafers Munch Pack,61,19,cookies cakes,snacks


Let's make a dataset that allows us to see what items go with what orders 

In [7]:
specific_orders = order_products__prior.merge(product_dataset, 
                how='left', 
                on='product_id')

Now let's add user id information

In [8]:
baskets = specific_orders.merge(order_test, 
                     how='left', 
                     on='order_id')

This dataset is prohibitively large to work with. for right now, let's take a 30% sample and see if it makes the time faster. 

In [9]:
baskets_samp = baskets.sample(frac=0.3, replace=True, random_state=1)

Great, now let's see what we have! 

In [10]:
baskets_samp.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,aisle,department,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
12710949,1341545,5451,14,1,Organic Zucchini Spirals,83,4,fresh vegetables,produce,33844,prior,16,3,14,12.0
21463275,2263894,16185,4,1,Sharp Cheddar Cheese,21,16,packaged cheese,dairy eggs,171554,prior,5,1,15,5.0
6762380,713824,5876,3,1,Organic Lemon,24,4,fresh fruits,produce,129024,prior,10,6,9,3.0
12325960,1301066,6615,9,0,Mozzarella Cheese,21,16,packaged cheese,dairy eggs,164042,prior,4,0,21,30.0
491263,51941,18465,18,0,Organic Grade A Free Range Large Brown Eggs,86,16,eggs,dairy eggs,158657,prior,1,1,11,


Our data consists of over 200K unique users making orders and over 3 million unique orders. 

In [11]:
baskets_samp.user_id.nunique()

205091

In [12]:
baskets_samp.order_id.nunique()

2674092

Let's take a look at the types of products we have

In [13]:
# 50k unique products
baskets_samp.product_name.nunique()

47973

How many times does each user appear in the dataset? 

In [14]:
baskets_samp.columns

Index(['order_id', 'product_id', 'add_to_cart_order', 'reordered',
       'product_name', 'aisle_id', 'department_id', 'aisle', 'department',
       'user_id', 'eval_set', 'order_number', 'order_dow', 'order_hour_of_day',
       'days_since_prior_order'],
      dtype='object')

As we can see, we have over 200K different people making purchases at different frequencies and different amounts. 

In [15]:
baskets_samp.groupby(['user_id', 'order_id']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,aisle,department,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
user_id,order_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,431534,2,2,2,2,2,2,2,2,2,2,2,2,2
1,473747,1,1,1,1,1,1,1,1,1,1,1,1,1
1,550135,3,3,3,3,3,3,3,3,3,3,3,3,3
1,2254736,1,1,1,1,1,1,1,1,1,1,1,1,1
1,2295261,1,1,1,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
206209,2129269,4,4,4,4,4,4,4,4,4,4,4,4,4
206209,2266710,4,4,4,4,4,4,4,4,4,4,4,4,4
206209,2558525,3,3,3,3,3,3,3,3,3,3,3,3,3
206209,2977660,1,1,1,1,1,1,1,1,1,1,1,1,1


Let's re-arrange the columns in another order

In [16]:
baskets_samp = baskets_samp[['user_id', 'order_id', 'product_name', 
                             'product_id', 'days_since_prior_order', 
                             'add_to_cart_order', 'reordered', 'aisle_id', 
                             'department_id', 'aisle', 'department', 
                             'eval_set', 'order_number', 'order_dow', 
                             'order_hour_of_day']]

There are some interesting columns that we don't need
1. add to cart order
1. aile_id
1. depaertment_id
1. aisle
1. eval_set
1. order_number
1. order_dow
1. hour of the day
 

In [19]:
baskets_samp.drop(columns = ['add_to_cart_order', 'reordered', 'aisle_id', 
                        'department_id', 'aisle', 'eval_set', 'order_number', 
                        'order_dow', 'order_hour_of_day'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [20]:
baskets_samp.sort_values(by=['order_id'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


## Similarity Modules as Recommendation Engine

Source: https://surprise.readthedocs.io/en/stable/similarities.html

### Put the data in the correct format

#### Get rid of all non-food items 

In [21]:
baskets_samp.department.unique()

array(['pantry', 'dairy eggs', 'meat seafood', 'breakfast', 'beverages',
       'produce', 'deli', 'snacks', 'household', 'dry goods pasta',
       'bakery', 'canned goods', 'frozen', 'personal care',
       'international', 'bulk', 'pets', 'missing', 'babies', 'alcohol',
       'other'], dtype=object)

Personal care, household, babies and pets are obvious ones we should get rid of. let's take a look at the categories "other" and "missing" before we drop them. 

In [22]:
baskets_samp.loc[baskets_samp['department']=='other']

Unnamed: 0,user_id,order_id,product_name,product_id,days_since_prior_order,department
11214,191063,1160,"Detox, Bentonite, Great Plains",34907,3.0,other
11214,191063,1160,"Detox, Bentonite, Great Plains",34907,3.0,other
11920,77115,1238,Max AAA Batteries,44931,8.0,other
14363,22698,1504,Tulips,45884,8.0,other
17817,99220,1876,Roasted Almond Butter,38662,30.0,other
...,...,...,...,...,...,...
32414533,158409,3419015,Cotes De Provence,3622,4.0,other
32414812,61455,3419048,Oral Electrolyte Powder Assorted Flavors,13608,1.0,other
32426211,190400,3420233,93/7 Ground Beef,32115,4.0,other
32426944,68487,3420306,Margarita Salt,27371,,other


In [23]:
baskets_samp.loc[baskets_samp['department']=='missing']

Unnamed: 0,user_id,order_id,product_name,product_id,days_since_prior_order,department
1420,73310,154,Paleo Blueberry Muffin,11806,0.0,missing
3975,161762,420,Organic Poblano Pepper,7456,2.0,missing
8556,1264,889,Organic Pineapple Cottage Cheese,47105,4.0,missing
9978,136972,1019,Soft & Chewy Strawberry Newtons,30052,9.0,missing
12585,125432,1301,Green Lemonade,48551,7.0,missing
...,...,...,...,...,...,...
32430726,31550,3420689,"Fruit & Nut Bar, Dark Chocolate & Cherry Cashew",34347,2.0,missing
32430726,31550,3420689,"Fruit & Nut Bar, Dark Chocolate & Cherry Cashew",34347,2.0,missing
32431312,149798,3420747,Plain Organic Grassmilk Yogurt Cup,27767,4.0,missing
32434240,206030,3421050,Organic Riced Cauliflower,41149,5.0,missing


Looks like the other and missing categories contain a lot of food items. Let's keep them for now and only drop the ones that we know we will not use. 

In [26]:
baskets_food_samp = baskets_samp.loc[(baskets_samp['department']!='personal care')|
                                      (baskets_samp['department']!='household')|
                                      (baskets_samp['department']!='babies')|
                                      (baskets_samp['department']!='pets')|
                                      (baskets_samp['department']!='other')]

### Drop additional columns that are not needed for this algorithm 

In [27]:
baskets_food_samp.head()

Unnamed: 0,user_id,order_id,product_name,product_id,days_since_prior_order,department
4,202279,2,Natural Sweetener,30035,8.0,pantry
2,202279,2,Garlic Powder,9327,8.0,pantry
13,205970,3,Unsweetened Chocolate Almond Breeze Almond Milk,17668,12.0,dairy eggs
15,205970,3,Air Chilled Organic Boneless Skinless Chicken ...,17461,12.0,meat seafood
22,178520,4,Nutri-Grain Soft Baked Strawberry Cereal Break...,21351,7.0,breakfast


In [28]:
baskets_food_samp.drop(columns=['product_id', 'days_since_prior_order', 'department'], inplace=True)

In [29]:
baskets_food_samp

Unnamed: 0,user_id,order_id,product_name
4,202279,2,Natural Sweetener
2,202279,2,Garlic Powder
13,205970,3,Unsweetened Chocolate Almond Breeze Almond Milk
15,205970,3,Air Chilled Organic Boneless Skinless Chicken ...
22,178520,4,Nutri-Grain Soft Baked Strawberry Cereal Break...
...,...,...,...
32434485,25247,3421083,Organic Mini Sandwich Crackers Peanut Butter
32434488,25247,3421083,Organic Sweet & Salty Peanut Pretzel Granola ...
32434484,25247,3421083,Free & Clear Natural Dishwasher Detergent
32434479,25247,3421083,Freeze Dried Mango Slices


I now need rows of users and columns of all items. This means that all items that a user bought will be stored row-wise. If a user bought an item more than once then that will be reflected in the number in that column. 

In [None]:
baskets_food_samp.drop(columns=['order_id'], inplace=True)

### Association Rule Machine Learning Algorithm 

**Sources**: 

1. How to build your own algorithm: https://surprise.readthedocs.io/en/stable/building_custom_algo.html
1. Association Rule Wikipedia: https://en.wikipedia.org/wiki/Association_rule_learning
1. Rule-based collaborative filtering: Recommendor Systems: The Textbook (pg. 160) 