# Instacart Product Recommendation

**Source**:

* https://medium.com/datadriveninvestor/how-to-build-a-recommendation-system-for-purchase-data-step-by-step-d6d7a78800b6
* http://www.moorissatjokro.com/#home
* https://towardsdatascience.com/how-to-build-a-simple-recommender-system-in-python-375093c3fb7d
* **Possible Algorithm to use**: https://surprise.readthedocs.io/en/stable/co_clustering.html#surprise.prediction_algorithms.co_clustering.CoClustering
* **Similairity Models**: https://surprise.readthedocs.io/en/stable/similarities.html
* **Association Rule Learning**: https://en.wikipedia.org/wiki/Association_rule_learning
* collaborative filtering item - item article medium: https://medium.com/radon-dev/item-item-collaborative-filtering-with-binary-or-unary-data-e8f0b465b2c3
* **How to use Pyspark and AWS**: https://towardsdatascience.com/getting-started-with-pyspark-on-amazon-emr-c85154b6b921
* **association rule algorithm**: https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/
* **appriori**: https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/

***
* **Practical Business Python**: https://pbpython.com/market-basket-analysis.html
* **Market Basket Analysis Notebook**: https://github.com/chris1610/pbpython/blob/master/notebooks/Market_Basket_Intro.ipynb

**Memory-based methods**
1. **User-based collaborative filtering**: In this model products are recommended to a user based on the fact that the products have been liked by users similar to the user. For example if Derrick and Dennis like the same movies and a new movie comes out that Derick likes,then we can recommend that movie to Dennis because Derrick and Dennis seem to like the same movies.
1. **Item-based collaborative filtering**: These systems identify similar items based on users’ previous ratings. For example if users A,B and C gave a 5 star rating to books X and Y then when a user D buys book Y they also get a recommendation to purchase book X because the system identifies book X and Y as similar based on the ratings of users A,B and C.

### Let's get the data into a way that can handle it. 

In [115]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import time

from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse
from sklearn.model_selection import train_test_split

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [11]:
products = pd.read_csv('../../data/01_raw/instacart_2017_05_01/products.csv')
aisles = pd.read_csv('../../data/01_raw/instacart_2017_05_01/aisles.csv')
departments = pd.read_csv('../../data/01_raw/instacart_2017_05_01/departments.csv')

order_products__prior = pd.read_csv('../../data/01_raw/instacart_2017_05_01/order_products__prior.csv')
order_products__train = pd.read_csv('../../data/01_raw/instacart_2017_05_01/order_products__train.csv')
order_test = pd.read_csv('../../data/01_raw/instacart_2017_05_01/orders.csv')

In [12]:
len(order_products__prior)

32434489

In [194]:
prod_ailes = products.merge(aisles, 
              how='outer', 
              on='aisle_id', 
               suffixes=('_x', '_y')
              )

In [195]:
product_dataset = prod_ailes.merge(departments, 
                how='outer', 
                on='department_id')

In [196]:
product_dataset.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,aisle,department
0,1,Chocolate Sandwich Cookies,61,19,cookies cakes,snacks
1,78,Nutter Butter Cookie Bites Go-Pak,61,19,cookies cakes,snacks
2,102,Danish Butter Cookies,61,19,cookies cakes,snacks
3,172,Gluten Free All Natural Chocolate Chip Cookies,61,19,cookies cakes,snacks
4,285,Mini Nilla Wafers Munch Pack,61,19,cookies cakes,snacks


Let's make a dataset that allows us to see what items go with what orders 

In [197]:
specific_orders = order_products__prior.merge(product_dataset, 
                how='left', 
                on='product_id')

Now let's add user id information

In [177]:
baskets = specific_orders.merge(order_test, 
                     how='left', 
                     on='order_id')

In [178]:
baskets_samp = baskets.head(3000)

In [179]:
len(baskets_samp)

3000

This dataset is prohibitively large to work with. for right now, let's take a 30% sample and see if it makes the time faster. 

Great, now let's see what we have! 

In [51]:
baskets_samp.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,aisle,department,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2,33120,1,1,Organic Egg Whites,86,16,eggs,dairy eggs,202279,prior,3,5,9,8.0
1,2,28985,2,1,Michigan Organic Kale,83,4,fresh vegetables,produce,202279,prior,3,5,9,8.0
2,2,9327,3,0,Garlic Powder,104,13,spices seasonings,pantry,202279,prior,3,5,9,8.0
3,2,45918,4,1,Coconut Butter,19,13,oils vinegars,pantry,202279,prior,3,5,9,8.0
4,2,30035,5,0,Natural Sweetener,17,13,baking ingredients,pantry,202279,prior,3,5,9,8.0


Our data consists of over 200K unique users making orders and over 3 million unique orders. 

In [52]:
baskets_samp.user_id.nunique()

307

In [53]:
baskets_samp.order_id.nunique()

308

Let's take a look at the types of products we have

In [54]:
# 50k unique products
baskets_samp.product_name.nunique()

1963

How many times does each user appear in the dataset? 

In [55]:
baskets_samp.columns

Index(['order_id', 'product_id', 'add_to_cart_order', 'reordered',
       'product_name', 'aisle_id', 'department_id', 'aisle', 'department',
       'user_id', 'eval_set', 'order_number', 'order_dow', 'order_hour_of_day',
       'days_since_prior_order'],
      dtype='object')

As we can see, we have over 200K different people making purchases at different frequencies and different amounts. 

In [56]:
baskets_samp.groupby(['user_id', 'order_id']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,aisle,department,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
user_id,order_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
382,40,4,4,4,4,4,4,4,4,4,4,4,4,4
503,214,9,9,9,9,9,9,9,9,9,9,9,9,9
971,178,2,2,2,2,2,2,2,2,2,2,2,2,2
1059,280,29,29,29,29,29,29,29,29,29,29,29,29,29
3107,8,1,1,1,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201744,31,10,10,10,10,10,10,10,10,10,10,10,10,10
202279,2,9,9,9,9,9,9,9,9,9,9,9,9,9
202527,35,5,5,5,5,5,5,5,5,5,5,5,5,0
204184,171,2,2,2,2,2,2,2,2,2,2,2,2,2


Let's re-arrange the columns in another order

In [193]:
baskets_samp = baskets_samp[['user_id', 'order_id', 'product_name', 
                             'product_id', 'days_since_prior_order', 
                             'add_to_cart_order', 'reordered', 'aisle_id', 
                             'department_id', 'aisle', 'department', 
                             'eval_set', 'order_number', 'order_dow', 
                             'order_hour_of_day']]

KeyError: "['reordered', 'order_hour_of_day', 'add_to_cart_order', 'order_number', 'department_id', 'order_dow', 'aisle_id', 'aisle', 'eval_set'] not in index"

There are some interesting columns that we don't need
1. add to cart order
1. aile_id
1. depaertment_id
1. aisle
1. eval_set
1. order_number
1. order_dow
1. hour of the day
 

In [181]:
baskets_samp.drop(columns = ['add_to_cart_order', 'reordered', 'aisle_id', 
                        'department_id', 'aisle', 'eval_set', 'order_number', 
                        'order_dow', 'order_hour_of_day'], inplace=True)

In [182]:
baskets_samp.sort_values(by=['order_id'], inplace=True)

## Similarity Modules as Recommendation Engine

Source: https://surprise.readthedocs.io/en/stable/similarities.html

### Put the data in the correct format

#### Get rid of all non-food items 

In [183]:
baskets_samp.department.unique()

array(['dairy eggs', 'produce', 'pantry', 'bakery', 'meat seafood',
       'snacks', 'beverages', 'breakfast', 'personal care', 'household',
       'dry goods pasta', 'deli', 'international', 'frozen',
       'canned goods', 'babies', 'pets', 'alcohol', 'bulk', 'missing',
       'other'], dtype=object)

Personal care, household, babies and pets are obvious ones we should get rid of. let's take a look at the categories "other" and "missing" before we drop them. 

In [184]:
baskets_samp.loc[baskets_samp['department']=='other']

Unnamed: 0,user_id,order_id,product_name,product_id,days_since_prior_order,department
666,197745,79,Coffee Mate French Vanilla Creamer Packets,39461,3.0,other
996,86865,109,SleepGels Nighttime Sleep Aid,36066,17.0,other
1817,154766,202,Roasted Unsalted Almonds,20406,4.0,other
1876,43756,210,"Camilia, Single Liquid Doses",86,14.0,other


In [185]:
baskets_samp.loc[baskets_samp['department']=='missing']

Unnamed: 0,user_id,order_id,product_name,product_id,days_since_prior_order,department
629,106387,75,Tomato Basil Bisque Soup,44077,,missing
1419,73310,154,Cold Pressed Watermelon & Lemon Juice Blend,41801,0.0,missing
1420,73310,154,Paleo Blueberry Muffin,11806,0.0,missing


Looks like the other and missing categories contain a lot of food items. Let's keep them for now and only drop the ones that we know we will not use. 

In [186]:
baskets_food_samp = baskets_samp.loc[(baskets_samp['department']!='personal care')|
                                      (baskets_samp['department']!='household')|
                                      (baskets_samp['department']!='babies')|
                                      (baskets_samp['department']!='pets')|
                                      (baskets_samp['department']!='other')|
                                     (baskets_samp['department']!='alcohol')|
                                    (baskets_samp['department']!='snacks')]

### Drop additional columns that are not needed for this algorithm 

In [187]:
baskets_food_samp.head()

Unnamed: 0,user_id,order_id,product_name,product_id,days_since_prior_order,department
0,202279,2,Organic Egg Whites,33120,8.0,dairy eggs
1,202279,2,Michigan Organic Kale,28985,8.0,produce
2,202279,2,Garlic Powder,9327,8.0,pantry
3,202279,2,Coconut Butter,45918,8.0,pantry
4,202279,2,Natural Sweetener,30035,8.0,pantry


In [188]:
baskets_food_samp.drop(columns=['product_id', 'days_since_prior_order', 'department'], inplace=True)

In [189]:
baskets_food_samp

Unnamed: 0,user_id,product_name
0,202279,Organic Egg Whites
1,202279,Michigan Organic Kale
2,202279,Garlic Powder
3,202279,Coconut Butter
4,202279,Natural Sweetener
...,...,...
2989,36278,Roasted Salted Cashews
2992,36278,Lactose Free Sour Cream
2997,36278,Artichokes
2998,60766,Organic Hass Avocado


I now need rows of users and columns of all items. This means that all items that a user bought will be stored row-wise. If a user bought an item more than once then that will be reflected in the number in that column. 

In [190]:
basket_purcahse_count_samp = baskets_food_samp.groupby(['user_id', 'product_name'])\
                                              .agg({'product_name': 'count'})\
                                              .rename(columns={'product_name': 'purchase_count'}).reset_index()

In [191]:
basket_purcahse_count_samp.head()

Unnamed: 0,user_id,product_name,purchase_count
0,382,Chocolate Milk 1% Milkfat,1
1,382,Macaroni & Cheese,1
2,382,Organic 1% Low Fat Milk,1
3,382,Sparkling Natural Mineral Water,1
4,503,Banana,1


In [135]:
# basket_purcahse_count_samp['purchase_count'] = 1

In [136]:
# basket_purcahse_count_samp.head()

Unnamed: 0,user_id,product_name,purchase_count
0,382,Chocolate Milk 1% Milkfat,1
1,382,Macaroni & Cheese,1
2,382,Organic 1% Low Fat Milk,1
3,382,Sparkling Natural Mineral Water,1
4,503,Banana,1


In [137]:
# usr_sparse = basket_purcahse_count_samp.pivot(index='user_id', columns='product_name', values='purchase_count')

In [138]:
# usr_matrix = usr_sparse.replace(np.nan, 0)

In [139]:
# usr_matrix.shape

(307, 1963)

### Let's run our first model

In [140]:
# def calculate_similarity(dataframe):
#     """Calculate the column-wise cosine similarity for a sparse
#     matrix. Return a new dataframe matrix with similarities.
#     """
#     data_sparse = sparse.csr_matrix(dataframe)
#     similarities = cosine_similarity(data_sparse.transpose())
#     sim = pd.DataFrame(data=similarities, index= dataframe.columns, columns= dataframe.columns)
#     return sim

In [141]:
# data_matrix = calculate_similarity(usr_matrix)

In [142]:
# list(basket_purcahse_count_samp.product_name.unique())

['Chocolate Milk 1% Milkfat',
 'Macaroni & Cheese',
 'Organic 1% Low Fat Milk',
 'Sparkling Natural Mineral Water',
 'Banana',
 'Berry Medley',
 'Boneless Skinless Chicken Breasts',
 'Broccoli Crown',
 'Colby Jack Cheese',
 'Instant Oatmeal Variety Pack',
 'Just Mayo',
 'Organic Cream Of Chicken Condensed Soup',
 'Organic Hearty Split Pea & Uncured Ham Soup',
 'Chunky Peanut Butter',
 'Organic Half & Half',
 'All Natural Gluten Free Teff Wraps',
 'Bag of Organic Bananas',
 'Blue Corn Tortilla Chips',
 'Boneless Skinless Chicken Breast',
 'Colby Cheese Sticks',
 'EnviroKidz Gluten Free & Wheat Free Gorilla Munch Corn Puff Cereal',
 'Gluten Free Omega Flax & Fiber Bread',
 'Gluten Free Pretzel Sticks',
 'Gluten Free Vanilla Granola',
 'Ground Turkey Breast',
 'Lactose Free Low Fat Vanilla Yogurt',
 'Mild Salsa',
 'Non-Fat Vanilla Yogurt',
 'Organic 1% Milk',
 'Organic Avocado',
 'Organic Balsamic Vinegar Of Modena',
 'Organic Cilantro',
 'Organic Corn Tortillas',
 'Organic Large Brown Gr

In [146]:
# print(data_matrix.loc['Coke Classic'].nlargest(11))

product_name
Chocolate Chip Granola           1.0
Chunky Chocolate Chip Cookies    1.0
Coke Classic                     1.0
Crunchy Oats 'n Honey Granola    1.0
Earl Grey  Black Tea Blend       1.0
Lactose Free 2% Milk             1.0
Lemon Sparkling Water            1.0
Lemon-Lime Fridge Pack Soda      1.0
Natural Pure Sparkling Water     1.0
Organic Green Tea                1.0
White Giant Paper Towel Rolls    1.0
Name: Coke Classic, dtype: float64


### Association Rule Machine Learning Algorithm 

**Sources**: 

1. How to build your own algorithm: https://surprise.readthedocs.io/en/stable/building_custom_algo.html
1. Association Rule Wikipedia: https://en.wikipedia.org/wiki/Association_rule_learning
1. Rule-based collaborative filtering: Recommendor Systems: The Textbook (pg. 160) 

***

In [171]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [192]:
basket_purcahse_count_samp.head()

Unnamed: 0,user_id,product_name,purchase_count
0,382,Chocolate Milk 1% Milkfat,1
1,382,Macaroni & Cheese,1
2,382,Organic 1% Low Fat Milk,1
3,382,Sparkling Natural Mineral Water,1
4,503,Banana,1


In [172]:
usr_matrix

product_name,0% Greek Strained Yogurt,1 Apple + 1 Pear Fruit Bar,1% Lowfat Milk,100 Calorie Per Bag Popcorn,100% Apple Juice Original,100% Cranberry Juice,100% Guava Juice,100% Juice No Added Sugar Orange Tangerine,100% Juice No Sugar Added Apple,100% Lactose Free Fat Free Milk,...,"Yogurt, Nonfat, Organic, Plain",Yuba Tofu Skin,Yukon Gold Potatoes 5lb Bag,ZBar Organic Chocolate Brownie Energy Snack,Zero Calorie Cream Soda,Zero Calories Berry Nutrient Enhanced Water,Zinfandel,Zucchini Noodles,gel hand wash sea minerals,with Crispy Almonds Cereal
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
382,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
503,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
971,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1059,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
202279,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
202527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
204184,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)