# Instacart Product Recommendation

**Source**:

* https://medium.com/datadriveninvestor/how-to-build-a-recommendation-system-for-purchase-data-step-by-step-d6d7a78800b6
* http://www.moorissatjokro.com/#home
* https://towardsdatascience.com/how-to-build-a-simple-recommender-system-in-python-375093c3fb7d
* **Possible Algorithm to use**: https://surprise.readthedocs.io/en/stable/co_clustering.html#surprise.prediction_algorithms.co_clustering.CoClustering
* **Similairity Models**: https://surprise.readthedocs.io/en/stable/similarities.html
* **Association Rule Learning**: https://en.wikipedia.org/wiki/Association_rule_learning
* collaborative filtering item - item article medium: https://medium.com/radon-dev/item-item-collaborative-filtering-with-binary-or-unary-data-e8f0b465b2c3
* **How to use Pyspark and AWS**: https://towardsdatascience.com/getting-started-with-pyspark-on-amazon-emr-c85154b6b921

**Memory-based methods**
1. **User-based collaborative filtering**: In this model products are recommended to a user based on the fact that the products have been liked by users similar to the user. For example if Derrick and Dennis like the same movies and a new movie comes out that Derick likes,then we can recommend that movie to Dennis because Derrick and Dennis seem to like the same movies.
1. **Item-based collaborative filtering**: These systems identify similar items based on users’ previous ratings. For example if users A,B and C gave a 5 star rating to books X and Y then when a user D buys book Y they also get a recommendation to purchase book X because the system identifies book X and Y as similar based on the ratings of users A,B and C.

### Let's get the data into a way that can handle it. 

In [10]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import time

from sklearn.model_selection import train_test_split

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [11]:
products = pd.read_csv('../../data/01_raw/instacart_2017_05_01/products.csv')
aisles = pd.read_csv('../../data/01_raw/instacart_2017_05_01/aisles.csv')
departments = pd.read_csv('../../data/01_raw/instacart_2017_05_01/departments.csv')

order_products__prior = pd.read_csv('../../data/01_raw/instacart_2017_05_01/order_products__prior.csv')
order_products__train = pd.read_csv('../../data/01_raw/instacart_2017_05_01/order_products__train.csv')
order_test = pd.read_csv('../../data/01_raw/instacart_2017_05_01/orders.csv')

In [12]:
len(order_products__prior)

32434489

In [13]:
prod_ailes = products.merge(aisles, 
              how='outer', 
              on='aisle_id', 
               suffixes=('_x', '_y')
              )

In [14]:
product_dataset = prod_ailes.merge(departments, 
                how='outer', 
                on='department_id')

In [15]:
product_dataset.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,aisle,department
0,1,Chocolate Sandwich Cookies,61,19,cookies cakes,snacks
1,78,Nutter Butter Cookie Bites Go-Pak,61,19,cookies cakes,snacks
2,102,Danish Butter Cookies,61,19,cookies cakes,snacks
3,172,Gluten Free All Natural Chocolate Chip Cookies,61,19,cookies cakes,snacks
4,285,Mini Nilla Wafers Munch Pack,61,19,cookies cakes,snacks


Let's make a dataset that allows us to see what items go with what orders 

In [16]:
specific_orders = order_products__prior.merge(product_dataset, 
                how='left', 
                on='product_id')

Now let's add user id information

In [17]:
baskets = specific_orders.merge(order_test, 
                     how='left', 
                     on='order_id')

In [20]:
# baskets_samp = baskets.sample(frac=0.0001, replace=True, random_state=1)

In [48]:
baskets_samp = baskets.head(3000)

In [49]:
len(baskets_samp)

3000

This dataset is prohibitively large to work with. for right now, let's take a 30% sample and see if it makes the time faster. 

Great, now let's see what we have! 

In [51]:
baskets_samp.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,aisle,department,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2,33120,1,1,Organic Egg Whites,86,16,eggs,dairy eggs,202279,prior,3,5,9,8.0
1,2,28985,2,1,Michigan Organic Kale,83,4,fresh vegetables,produce,202279,prior,3,5,9,8.0
2,2,9327,3,0,Garlic Powder,104,13,spices seasonings,pantry,202279,prior,3,5,9,8.0
3,2,45918,4,1,Coconut Butter,19,13,oils vinegars,pantry,202279,prior,3,5,9,8.0
4,2,30035,5,0,Natural Sweetener,17,13,baking ingredients,pantry,202279,prior,3,5,9,8.0


Our data consists of over 200K unique users making orders and over 3 million unique orders. 

In [52]:
baskets_samp.user_id.nunique()

307

In [53]:
baskets_samp.order_id.nunique()

308

Let's take a look at the types of products we have

In [54]:
# 50k unique products
baskets_samp.product_name.nunique()

1963

How many times does each user appear in the dataset? 

In [55]:
baskets_samp.columns

Index(['order_id', 'product_id', 'add_to_cart_order', 'reordered',
       'product_name', 'aisle_id', 'department_id', 'aisle', 'department',
       'user_id', 'eval_set', 'order_number', 'order_dow', 'order_hour_of_day',
       'days_since_prior_order'],
      dtype='object')

As we can see, we have over 200K different people making purchases at different frequencies and different amounts. 

In [56]:
baskets_samp.groupby(['user_id', 'order_id']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,aisle,department,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
user_id,order_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
382,40,4,4,4,4,4,4,4,4,4,4,4,4,4
503,214,9,9,9,9,9,9,9,9,9,9,9,9,9
971,178,2,2,2,2,2,2,2,2,2,2,2,2,2
1059,280,29,29,29,29,29,29,29,29,29,29,29,29,29
3107,8,1,1,1,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201744,31,10,10,10,10,10,10,10,10,10,10,10,10,10
202279,2,9,9,9,9,9,9,9,9,9,9,9,9,9
202527,35,5,5,5,5,5,5,5,5,5,5,5,5,0
204184,171,2,2,2,2,2,2,2,2,2,2,2,2,2


Let's re-arrange the columns in another order

In [57]:
baskets_samp = baskets_samp[['user_id', 'order_id', 'product_name', 
                             'product_id', 'days_since_prior_order', 
                             'add_to_cart_order', 'reordered', 'aisle_id', 
                             'department_id', 'aisle', 'department', 
                             'eval_set', 'order_number', 'order_dow', 
                             'order_hour_of_day']]

There are some interesting columns that we don't need
1. add to cart order
1. aile_id
1. depaertment_id
1. aisle
1. eval_set
1. order_number
1. order_dow
1. hour of the day
 

In [58]:
baskets_samp.drop(columns = ['add_to_cart_order', 'reordered', 'aisle_id', 
                        'department_id', 'aisle', 'eval_set', 'order_number', 
                        'order_dow', 'order_hour_of_day'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [59]:
baskets_samp.sort_values(by=['order_id'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


## Similarity Modules as Recommendation Engine

Source: https://surprise.readthedocs.io/en/stable/similarities.html

### Put the data in the correct format

#### Get rid of all non-food items 

In [60]:
baskets_samp.department.unique()

array(['dairy eggs', 'produce', 'pantry', 'bakery', 'meat seafood',
       'snacks', 'beverages', 'breakfast', 'personal care', 'household',
       'dry goods pasta', 'deli', 'international', 'frozen',
       'canned goods', 'babies', 'pets', 'alcohol', 'bulk', 'missing',
       'other'], dtype=object)

Personal care, household, babies and pets are obvious ones we should get rid of. let's take a look at the categories "other" and "missing" before we drop them. 

In [61]:
baskets_samp.loc[baskets_samp['department']=='other']

Unnamed: 0,user_id,order_id,product_name,product_id,days_since_prior_order,department
666,197745,79,Coffee Mate French Vanilla Creamer Packets,39461,3.0,other
996,86865,109,SleepGels Nighttime Sleep Aid,36066,17.0,other
1817,154766,202,Roasted Unsalted Almonds,20406,4.0,other
1876,43756,210,"Camilia, Single Liquid Doses",86,14.0,other


In [62]:
baskets_samp.loc[baskets_samp['department']=='missing']

Unnamed: 0,user_id,order_id,product_name,product_id,days_since_prior_order,department
629,106387,75,Tomato Basil Bisque Soup,44077,,missing
1419,73310,154,Cold Pressed Watermelon & Lemon Juice Blend,41801,0.0,missing
1420,73310,154,Paleo Blueberry Muffin,11806,0.0,missing


Looks like the other and missing categories contain a lot of food items. Let's keep them for now and only drop the ones that we know we will not use. 

In [63]:
baskets_food_samp = baskets_samp.loc[(baskets_samp['department']!='personal care')|
                                      (baskets_samp['department']!='household')|
                                      (baskets_samp['department']!='babies')|
                                      (baskets_samp['department']!='pets')|
                                      (baskets_samp['department']!='other')]

### Drop additional columns that are not needed for this algorithm 

In [64]:
baskets_food_samp.head()

Unnamed: 0,user_id,order_id,product_name,product_id,days_since_prior_order,department
0,202279,2,Organic Egg Whites,33120,8.0,dairy eggs
1,202279,2,Michigan Organic Kale,28985,8.0,produce
2,202279,2,Garlic Powder,9327,8.0,pantry
3,202279,2,Coconut Butter,45918,8.0,pantry
4,202279,2,Natural Sweetener,30035,8.0,pantry


In [65]:
baskets_food_samp.drop(columns=['product_id', 'days_since_prior_order', 'department', 'order_id'], inplace=True)

In [66]:
baskets_food_samp

Unnamed: 0,user_id,order_id,product_name
0,202279,2,Organic Egg Whites
1,202279,2,Michigan Organic Kale
2,202279,2,Garlic Powder
3,202279,2,Coconut Butter
4,202279,2,Natural Sweetener
...,...,...,...
2989,36278,322,Roasted Salted Cashews
2992,36278,322,Lactose Free Sour Cream
2997,36278,322,Artichokes
2998,60766,323,Organic Hass Avocado


I now need rows of users and columns of all items. This means that all items that a user bought will be stored row-wise. If a user bought an item more than once then that will be reflected in the number in that column. 

In [79]:
basket_purcahse_count_samp = baskets_food_samp.groupby(['user_id', 'product_name'])\
                                              .agg({'product_name': 'count'})\
                                              .rename(columns={'product_name': 'purchase_count'}).reset_index()

In [80]:
basket_purcahse_count_samp.head()

Unnamed: 0,user_id,product_name,purchase_count
0,382,Chocolate Milk 1% Milkfat,1
1,382,Macaroni & Cheese,1
2,382,Organic 1% Low Fat Milk,1
3,382,Sparkling Natural Mineral Water,1
4,503,Banana,1
...,...,...,...
2994,205970,Organic Ezekiel 49 Bread Cinnamon Raisin,1
2995,205970,Organic Ginger Root,1
2996,205970,Total 2% with Strawberry Lowfat Greek Strained...,1
2997,205970,Unsweetened Almondmilk,1


In [98]:
basket_purcahse_count_samp['purchase_count'] = 1

In [99]:
basket_purcahse_count_samp.head()

Unnamed: 0,user_id,product_name,purchase_count
0,382,Chocolate Milk 1% Milkfat,1
1,382,Macaroni & Cheese,1
2,382,Organic 1% Low Fat Milk,1
3,382,Sparkling Natural Mineral Water,1
4,503,Banana,1


In [101]:
usr_sparse = basket_purcahse_count_samp.pivot(index='user_id', columns='product_name', values='purchase_count')

In [104]:
usr_matrix = usr_sparse.replace(np.nan, 0)

In [105]:
usr_matrix.shape

(307, 1963)

In [110]:
usr_matrix[0% Greek Strained Yogurt', '1 Apple + 1 Pear Fruit Bar]

SyntaxError: invalid syntax (<ipython-input-110-6856c44b27cd>, line 1)

### Association Rule Machine Learning Algorithm 

**Sources**: 

1. How to build your own algorithm: https://surprise.readthedocs.io/en/stable/building_custom_algo.html
1. Association Rule Wikipedia: https://en.wikipedia.org/wiki/Association_rule_learning
1. Rule-based collaborative filtering: Recommendor Systems: The Textbook (pg. 160) 