# Instacart Product Recommendation

**Source**:

* https://medium.com/datadriveninvestor/how-to-build-a-recommendation-system-for-purchase-data-step-by-step-d6d7a78800b6
* http://www.moorissatjokro.com/#home
* https://towardsdatascience.com/how-to-build-a-simple-recommender-system-in-python-375093c3fb7d
* **Possible Algorithm to use**: https://surprise.readthedocs.io/en/stable/co_clustering.html#surprise.prediction_algorithms.co_clustering.CoClustering
* **Similairity Models**: https://surprise.readthedocs.io/en/stable/similarities.html
* **Association Rule Learning**: https://en.wikipedia.org/wiki/Association_rule_learning
* collaborative filtering item - item article medium: https://medium.com/radon-dev/item-item-collaborative-filtering-with-binary-or-unary-data-e8f0b465b2c3
* **How to use Pyspark and AWS**: https://towardsdatascience.com/getting-started-with-pyspark-on-amazon-emr-c85154b6b921

In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import time

from sklearn.model_selection import train_test_split

**Memory-based methods**
1. **User-based collaborative filtering**: In this model products are recommended to a user based on the fact that the products have been liked by users similar to the user. For example if Derrick and Dennis like the same movies and a new movie comes out that Derick likes,then we can recommend that movie to Dennis because Derrick and Dennis seem to like the same movies.
1. **Item-based collaborative filtering**: These systems identify similar items based on users’ previous ratings. For example if users A,B and C gave a 5 star rating to books X and Y then when a user D buys book Y they also get a recommendation to purchase book X because the system identifies book X and Y as similar based on the ratings of users A,B and C.

### Let's get the data into a way that can handle it. 

In [2]:
products = pd.read_csv('../../data/01_raw/instacart_2017_05_01/products.csv')
aisles = pd.read_csv('../../data/01_raw/instacart_2017_05_01/aisles.csv')
departments = pd.read_csv('../../data/01_raw/instacart_2017_05_01/departments.csv')

order_products__prior = pd.read_csv('../../data/01_raw/instacart_2017_05_01/order_products__prior.csv')
order_products__train = pd.read_csv('../../data/01_raw/instacart_2017_05_01/order_products__train.csv')
order_test = pd.read_csv('../../data/01_raw/instacart_2017_05_01/orders.csv')

In [3]:
len(order_products__prior)

32434489

In [4]:
prod_ailes = products.merge(aisles, 
              how='outer', 
              on='aisle_id', 
               suffixes=('_x', '_y')
              )

In [5]:
product_dataset = prod_ailes.merge(departments, 
                how='outer', 
                on='department_id')

In [6]:
product_dataset.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,aisle,department
0,1,Chocolate Sandwich Cookies,61,19,cookies cakes,snacks
1,78,Nutter Butter Cookie Bites Go-Pak,61,19,cookies cakes,snacks
2,102,Danish Butter Cookies,61,19,cookies cakes,snacks
3,172,Gluten Free All Natural Chocolate Chip Cookies,61,19,cookies cakes,snacks
4,285,Mini Nilla Wafers Munch Pack,61,19,cookies cakes,snacks


Let's make a dataset that allows us to see what items go with what orders 

In [6]:
specific_orders = order_products__prior.merge(product_dataset, 
                how='left', 
                on='product_id')

Now let's add user id information

In [7]:
baskets = specific_orders.merge(order_test, 
                     how='left', 
                     on='order_id')

In [8]:
baskets_samp = baskets.sample(frac=0.1, replace=True, random_state=1)

In [9]:
baskets_samp.to_csv('../../data/02_intermediate/baskets_instacart.csv')

This dataset is prohibitively large to work with. for right now, let's take a 30% sample and see if it makes the time faster. 

In [9]:
baskets_samp = baskets.sample(frac=0.001, replace=True, random_state=1)

Great, now let's see what we have! 

In [10]:
baskets_samp.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,aisle,department,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
12710949,1341545,5451,14,1,Organic Zucchini Spirals,83,4,fresh vegetables,produce,33844,prior,16,3,14,12.0
21463275,2263894,16185,4,1,Sharp Cheddar Cheese,21,16,packaged cheese,dairy eggs,171554,prior,5,1,15,5.0
6762380,713824,5876,3,1,Organic Lemon,24,4,fresh fruits,produce,129024,prior,10,6,9,3.0
12325960,1301066,6615,9,0,Mozzarella Cheese,21,16,packaged cheese,dairy eggs,164042,prior,4,0,21,30.0
491263,51941,18465,18,0,Organic Grade A Free Range Large Brown Eggs,86,16,eggs,dairy eggs,158657,prior,1,1,11,


Our data consists of over 200K unique users making orders and over 3 million unique orders. 

In [11]:
baskets_samp.user_id.nunique()

26912

In [12]:
baskets_samp.order_id.nunique()

32176

Let's take a look at the types of products we have

In [13]:
# 50k unique products
baskets_samp.product_name.nunique()

9303

How many times does each user appear in the dataset? 

In [14]:
baskets_samp.columns

Index(['order_id', 'product_id', 'add_to_cart_order', 'reordered',
       'product_name', 'aisle_id', 'department_id', 'aisle', 'department',
       'user_id', 'eval_set', 'order_number', 'order_dow', 'order_hour_of_day',
       'days_since_prior_order'],
      dtype='object')

As we can see, we have over 200K different people making purchases at different frequencies and different amounts. 

In [15]:
baskets_samp.groupby(['user_id', 'order_id']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,aisle,department,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
user_id,order_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
18,1020460,1,1,1,1,1,1,1,1,1,1,1,1,1
19,2293453,1,1,1,1,1,1,1,1,1,1,1,1,1
27,3359528,1,1,1,1,1,1,1,1,1,1,1,1,1
28,2657750,1,1,1,1,1,1,1,1,1,1,1,1,1
35,2562704,2,2,2,2,2,2,2,2,2,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
206191,541447,1,1,1,1,1,1,1,1,1,1,1,1,1
206202,1251580,1,1,1,1,1,1,1,1,1,1,1,1,1
206204,2511735,1,1,1,1,1,1,1,1,1,1,1,1,1
206206,992810,1,1,1,1,1,1,1,1,1,1,1,1,1


Let's re-arrange the columns in another order

In [16]:
baskets_samp = baskets_samp[['user_id', 'order_id', 'product_name', 
                             'product_id', 'days_since_prior_order', 
                             'add_to_cart_order', 'reordered', 'aisle_id', 
                             'department_id', 'aisle', 'department', 
                             'eval_set', 'order_number', 'order_dow', 
                             'order_hour_of_day']]

There are some interesting columns that we don't need
1. add to cart order
1. aile_id
1. depaertment_id
1. aisle
1. eval_set
1. order_number
1. order_dow
1. hour of the day
 

In [17]:
baskets_samp.drop(columns = ['add_to_cart_order', 'reordered', 'aisle_id', 
                        'department_id', 'aisle', 'eval_set', 'order_number', 
                        'order_dow', 'order_hour_of_day'], inplace=True)

In [18]:
baskets_samp.sort_values(by=['order_id'], inplace=True)

## Similarity Modules as Recommendation Engine

Source: https://surprise.readthedocs.io/en/stable/similarities.html

### Put the data in the correct format

#### Get rid of all non-food items 

In [19]:
baskets_samp.department.unique()

array(['bakery', 'canned goods', 'produce', 'snacks', 'beverages',
       'pantry', 'dairy eggs', 'deli', 'household', 'breakfast', 'other',
       'frozen', 'dry goods pasta', 'meat seafood', 'pets', 'babies',
       'international', 'alcohol', 'missing', 'personal care', 'bulk'],
      dtype=object)

Personal care, household, babies and pets are obvious ones we should get rid of. let's take a look at the categories "other" and "missing" before we drop them. 

In [20]:
baskets_samp.loc[baskets_samp['department']=='other']

Unnamed: 0,user_id,order_id,product_name,product_id,days_since_prior_order,department
35105,100911,3669,Cinnamon Vanilla Creme Liquid Coffee Creamer,7807,,other
1622360,201925,171001,Ultra Thin Condoms,27043,30.0,other
2347720,53545,247667,Light CocoWhip! Coconut Whipped Topping,26756,13.0,other
2807957,68881,296360,Wipe Fresh Pet Wipes,28119,7.0,other
3930048,83993,414701,Facial Mask Age Defying Hydro Serum,34404,8.0,other
5880799,189695,620732,Roasted Almond Butter,38662,7.0,other
6375136,176228,672888,Giraffes Diapers Size 4 L,40110,9.0,other
6551746,18017,691563,Cherry Vanilla Granola,45856,7.0,other
7771323,79473,820296,Sweets Organic Lollipops,24456,1.0,other
10050515,4994,1061230,Children's Grape 24-Hour,46907,8.0,other


In [21]:
baskets_samp.loc[baskets_samp['department']=='missing']

Unnamed: 0,user_id,order_id,product_name,product_id,days_since_prior_order,department
323743,194500,34162,Organic Riced Cauliflower,41149,7.0,missing
459241,38758,48595,Organic English Seedless Cucumber,21182,5.0,missing
701422,206018,74168,Peanut Butter Ice Cream Cup,7035,5.0,missing
707245,113522,74764,Organic Cashew Nondairy Vanilla Yogurt,15063,7.0,missing
1011643,36308,106842,Raspberry Lemonade Cans,36992,9.0,missing
...,...,...,...,...,...,...
30255978,141875,3191589,Peanut Butter Ice Cream Cup,7035,8.0,missing
30282613,99735,3194385,Lime Grain-Free Tortilla Chips,2265,5.0,missing
31748786,171239,3348602,Matzo Ball Soup,7943,30.0,missing
32045227,149327,3380057,Organic Celery Bunch,38510,5.0,missing


Looks like the other and missing categories contain a lot of food items. Let's keep them for now and only drop the ones that we know we will not use. 

In [22]:
baskets_food_samp = baskets_samp.loc[(baskets_samp['department']!='personal care')|
                                      (baskets_samp['department']!='household')|
                                      (baskets_samp['department']!='babies')|
                                      (baskets_samp['department']!='pets')|
                                      (baskets_samp['department']!='other')]

### Drop additional columns that are not needed for this algorithm 

In [23]:
baskets_food_samp.head()

Unnamed: 0,user_id,order_id,product_name,product_id,days_since_prior_order,department
404,160959,51,Artesano Style Bread,30274,7.0,bakery
1701,189157,190,Organic Premium Tomato Paste,23400,30.0,canned goods
3460,120742,364,Organic Baby Spinach,21903,5.0,produce
3587,83896,380,Corn Tortillas,19508,26.0,bakery
3633,110980,384,Limes,26209,6.0,produce


In [24]:
baskets_food_samp.drop(columns=['product_id', 'days_since_prior_order', 'department'], inplace=True)

In [25]:
baskets_food_samp

Unnamed: 0,user_id,order_id,product_name
404,160959,51,Artesano Style Bread
1701,189157,190,Organic Premium Tomato Paste
3460,120742,364,Organic Baby Spinach
3587,83896,380,Corn Tortillas
3633,110980,384,Limes
...,...,...,...
32429587,40405,3420568,Brussels Sprouts
32430191,156190,3420632,Super Greens Salad
32431888,87965,3420807,Organic Tri-Colored Peppers
32432638,6511,3420886,Large Alfresco Eggs


I now need rows of users and columns of all items. This means that all items that a user bought will be stored row-wise. If a user bought an item more than once then that will be reflected in the number in that column. 

In [26]:
baskets_food_samp.drop(columns=['order_id'], inplace=True)

In [27]:
baskets_food_samp

Unnamed: 0,user_id,product_name
404,160959,Artesano Style Bread
1701,189157,Organic Premium Tomato Paste
3460,120742,Organic Baby Spinach
3587,83896,Corn Tortillas
3633,110980,Limes
...,...,...
32429587,40405,Brussels Sprouts
32430191,156190,Super Greens Salad
32431888,87965,Organic Tri-Colored Peppers
32432638,6511,Large Alfresco Eggs


In [28]:
basket_purcahse_count_samp = baskets_food_samp.groupby(['user_id', 'product_name'])\
                                              .agg({'product_name': 'count'})\
                                              .rename(columns={'product_name': 'purchase_count'}).reset_index()

In [33]:
basket_purcahse_count_samp.drop(columns=['purchase_count'], inplace=True)

In [34]:
basket_purcahse_count_samp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32307 entries, 0 to 32306
Data columns (total 2 columns):
user_id         32307 non-null int64
product_name    32307 non-null object
dtypes: int64(1), object(1)
memory usage: 504.9+ KB


### Association Rule Machine Learning Algorithm 

**Sources**: 

1. How to build your own algorithm: https://surprise.readthedocs.io/en/stable/building_custom_algo.html
1. Association Rule Wikipedia: https://en.wikipedia.org/wiki/Association_rule_learning
1. Rule-based collaborative filtering: Recommendor Systems: The Textbook (pg. 160) 