# Neighborhood Rules - Similarity: Instacart

## Load & Combine the Datasets

In [28]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import time
import random

from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Similarity Modules as Recommendation Engine

Source: https://surprise.readthedocs.io/en/stable/similarities.html

In [29]:
baskets = pd.read_csv('../../data/02_intermediate/baskets_spark.csv')

In [30]:
baskets.columns

Index(['Unnamed: 0', 'order_id', 'product_name', 'user_id', 'product_count'], dtype='object')

In [31]:
baskets.drop(columns=['Unnamed: 0'], inplace=True)

In [32]:
baskets.head()

Unnamed: 0,order_id,product_name,user_id,product_count
0,2,Organic Egg Whites,202279,1
1,2,Michigan Organic Kale,202279,1
2,2,Garlic Powder,202279,1
3,2,Coconut Butter,202279,1
4,2,Natural Sweetener,202279,1


In [33]:
baskets.shape

(32434489, 4)

In [34]:
baskets.user_id.nunique()

206209

We have over 200k unique users. Since this is too much for my computer to handle I am going to take a subsample of 50k users and go from there. 

In [35]:
insta_users_lst = list(baskets.user_id.unique())

In [36]:
len(insta_users_lst)

206209

Let's take a random sample of 50k of these user IDs

In [37]:
random_usrids_50k = random.sample(insta_users_lst, 50000)

In [38]:
mask = baskets['user_id'].isin(random_usrids_50k)

In [39]:
baskets_50k = baskets.loc[mask]

In [40]:
baskets_50k.user_id.nunique()

50000

In [41]:
len(baskets_50k)

7876775

## Let's drop columns and get everything into the right shape

In [42]:
baskets_50k.drop(columns=['order_id'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [43]:
baskets_50k.head()

Unnamed: 0,product_name,user_id,product_count
17,Plain Pre-Sliced Bagels,178520,1
18,Honey/Lemon Cough Drops,178520,1
19,Chewy 25% Low Sugar Chocolate Chip Granola,178520,1
20,Oats & Chocolate Chewy Bars,178520,1
21,Kellogg's Nutri-Grain Apple Cinnamon Cereal,178520,1


df_matrix = pd.pivot_table(baskets_50k, values='product_count', index='user_id', columns='product_name')

In [45]:
baskets_50k.reset_index(inplace=True)

In [46]:
baskets_50k.product_name.nunique()

46945

### break things up into 10k different products

We keep getting a unstack overflow error from having too many things. Let's break up the dataset further into types of products. 

In [48]:
product_list = list(baskets_50k.product_name.unique())

#### Product List 1

In [49]:
product_list_1 = product_list[0:10000]

In [50]:
len(product_list_1)

10000

In [56]:
mask_prod1 = baskets_50k['product_name'].isin(product_list_1)

In [57]:
baskets_prod1 = baskets_50k.loc[mask_prod1]

In [58]:
baskets_prod1.product_name.nunique()

10000

In [60]:
basket_matrix_1 = (baskets_prod1.groupby(['user_id', 'product_name'])['product_count']
          .sum().unstack().reset_index().fillna(0)
          .set_index('user_id'))

In [62]:
basket_matrix_1.head()

product_name,0% Fat Free Organic Milk,0% Fat Organic Greek Vanilla Yogurt,0% Fat Strawberry Greek Yogurt,0% Greek Strained Yogurt,0% Greek Yogurt Black Cherry on the Bottom,0% Milkfat Greek Yogurt Honey,1 % Lowfat Milk,1 Apple + 1 Mango Fruit Bar,1 Apple + 1 Pear Fruit Bar,1 Liter,...,Zucchini Banana & Amaranth Organic Baby Food,Zucchini Noodles,"\""Mokaccino\"" Milk + Blue Bottle Coffee Chocolate",for Tots Apple Juice,of Hanover 100 Calorie Pretzels Mini,smartwater® Electrolyte Enhanced Water,vitaminwater® XXX Acai Blueberry Pomegranate,w/Banana Pulp Free Juice,with Crispy Almonds Cereal,with Olive Oil Mayonnaise
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [65]:
product_list_2 = product_list[10000:20000]
mask_prod2 = baskets_50k['product_name'].isin(product_list_2)
baskets_prod2 = baskets_50k.loc[mask_prod2]
# pivot the dataset
basket_matrix_2 = (baskets_prod2.groupby(['user_id', 'product_name'])['product_count']
          .sum().unstack().reset_index().fillna(0)
          .set_index('user_id'))

In [66]:
basket_matrix_2.head()

product_name,#2 Coffee Filters,#4 Natural Brown Coffee Filters,& Go! Hazelnut Spread + Pretzel Sticks,0% Fat Black Cherry Greek Yogurt y,0% Fat Blueberry Greek Yogurt,0% Fat Vanilla Greek Yogurt,"0% Greek, Blueberry on the Bottom Yogurt",1 Mg Melatonin Sublingual Orange Tablets,1% Low Fat Chocolate Milk,1% Lowfat Chocolate Milk,...,go fresh Cool Moisture Beauty,in 100% Juice Mixed Fruit,o.b Super Plus Fluid Lock Tampons,rich kiss Olive & Aloe Moisturizer 2 in 1,smart Blend Chicken & Rice Formula Dry Dog Food,with Bleach Powder Cleanser,with Dawn Action Pacs Fresh Scent Dishwasher Detergent Pacs,with Olive Oil Mayonnaise Dressing,with Twist Ties Sandwich & Storage Bags,with Xylitol Original Flavor 18 Sticks Sugar Free Gum
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [67]:
product_list_3 = product_list[20000:30000]
mask_prod3 = baskets_50k['product_name'].isin(product_list_3)
baskets_prod3 = baskets_50k.loc[mask_prod3]
# pivot the dataset
basket_matrix_3 = (baskets_prod3.groupby(['user_id', 'product_name'])['product_count']
          .sum().unstack().reset_index().fillna(0)
          .set_index('user_id'))

In [68]:
basket_matrix_3.head()

product_name,+Energy Black Cherry Vegetable & Fruit Juice,0 Calorie Fuji Apple Pear Water Beverage,0 Calorie Strawberry Dragonfruit Water Beverage,0% Fat Superfruits Greek Yogurt,0% Milkfat Greek Plain Yogurt,1 Cup Measuring Cup,1 Ply White Luncheon Napkins,1 Step Kashmir Spinach Indian Cuisine,1 to 1 Gluten Free Baking Flour,1% Chocolate Milk,...,"\""Darn Good\"" Chili Mix",flings! Original Laundry Detergent Pacs,for Tots Apple White Grape Juice,for Women Maximum Absorbency L Underwear,from Concentrate Mango Nectar,iChef Casserole Pans with Lids (10 7/16 in x 8 in x 1 3/4 in),with Sweet Cinnamon Bunches Cereal,with Xylitol Cinnamon 18 Sticks Sugar Free Gum,with Xylitol Island Berry Lime 18 Sticks Sugar Free Gum,with a Splash of Mango Coconut Water
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [69]:
product_list_4 = product_list[30000:40000]
mask_prod4 = baskets_50k['product_name'].isin(product_list_4)
baskets_prod4 = baskets_50k.loc[mask_prod4]
# pivot the dataset
basket_matrix_4 = (baskets_prod4.groupby(['user_id', 'product_name'])['product_count']
          .sum().unstack().reset_index().fillna(0)
          .set_index('user_id'))

In [70]:
basket_matrix_4.head()

product_name,#2 Cone White Coffee Filters,#2 Mechanical Pencils,(70% Juice!) Mountain Raspberry Juice Squeeze,0 Calorie Acai Raspberry Water Beverage,0% Fat Greek Yogurt Black Cherry on the Bottom,0% Fat Greek Yogurt Vanilla,0% Fat Peach Greek Yogurt,1 Ply Napkins,1 Razor Handle and 2 Freesia Scented Razor Refills Premium BladeRazor System,"1% Hydrocortisone Anti-Itch Cream, Tube Anti-Itch",...,of Norwich Original English Mustard Powder Double Superfine,pumpkin spice,with Bleach Disinfectant Cleanser Scratch Free Lavender Fresh,with Color Safe Brightener Power Paks 2in1 Stain Fighter,with Pump Rebalancing Shampoo,with Seasoned Roasted Potatoes Scrambled Eggs & Sausage,with Xylitol Minty Sweet Twist 18 Sticks Sugar Free Gum,with Xylitol Unwrapped Spearmint 50 Sticks Sugar Free Gum,with Xylitol Watermelon Twist 18 Sticks Sugar Free Gum,with a Splash of Pineapple Coconut Water
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
36,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
71,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [73]:
product_list_5 = product_list[40000:]
mask_prod5 = baskets_50k['product_name'].isin(product_list_5)
baskets_prod5 = baskets_50k.loc[mask_prod5]
# pivot the dataset
basket_matrix_5 = (baskets_prod5.groupby(['user_id', 'product_name'])['product_count']
          .sum().unstack().reset_index().fillna(0)
          .set_index('user_id'))

In [74]:
basket_matrix_5.head()

product_name,".5\"" Waterproof Tape",1 Step-1 Minute Noodles Toasted Sesame,1% Hydrocortisone Anti-Itch Liquid Maximum Strength with Healing Aloe,"1,000 Mg Vitamin C Lemon Lime Effervescent Drink Mix",1/2 Caff Medium Ground Coffee,"10.25\"" Cast Iron Skillet",100 Calorie Healthy Pop Butter Microwave Pop Corn,100% All Pomegranate Juice,100% Apple Berry Cherry Juice,100% Black Cherry & Concord Grape Juice,...,Ziti Rigate Penne,ZuZu Luxe Onyx Mascara,Zyflamend Whole Body Liquid Vcaps,"\""Louis Ba-Kahn\"" Chocolate Chip Cookie & Brown Butter Candied Bacon Ice Cream Sandwich",by Mennen Power Antiperspirant/Deodorant Fresh,"flings! Laundry Detergent Pacs, Original, 57 Count Laundry",with Lime Juice Mayonesa Mayonnaise,with Mac & Cheese Fish Sticks,with Sweet & Smoky BBQ Sauce Cheeseburger Sliders,with Xylitol Unwrapped Original Flavor 50 Sticks Sugar Free Gum
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
36,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
94,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Let's merge all the small dataframes into a large one

### We are finally ready to put our data in the right shape

In [44]:
basket_matrix = (baskets_50k.groupby(['user_id', 'product_name'])['product_count']
          .sum().unstack().reset_index().fillna(0)
          .set_index('order_id'))

ValueError: Unstacked DataFrame is too big, causing int32 overflow

In [None]:
# usr_matrix.shape

### Let's run our first model

In [None]:
def calculate_similarity(dataframe):
    """Calculate the column-wise cosine similarity for a sparse
    matrix. Return a new dataframe matrix with similarities.
    """
    data_sparse = sparse.csr_matrix(dataframe)
    similarities = cosine_similarity(data_sparse.transpose())
    sim = pd.DataFrame(data=similarities, index= dataframe.columns, columns= dataframe.columns)
    return sim

In [None]:
# data_matrix = calculate_similarity(usr_matrix)

In [None]:
# list(basket_purcahse_count_samp.product_name.unique())

In [None]:
# print(data_matrix.loc['Coke Classic'].nlargest(11))

### Association Rule Machine Learning Algorithm 

**Sources**: 

1. How to build your own algorithm: https://surprise.readthedocs.io/en/stable/building_custom_algo.html
1. Association Rule Wikipedia: https://en.wikipedia.org/wiki/Association_rule_learning
1. Rule-based collaborative filtering: Recommendor Systems: The Textbook (pg. 160) 

***

In [None]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [None]:
basket_purcahse_count_samp.head()

In [None]:
usr_matrix

In [None]:
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)

**Source**:

* https://medium.com/datadriveninvestor/how-to-build-a-recommendation-system-for-purchase-data-step-by-step-d6d7a78800b6
* http://www.moorissatjokro.com/#home
* https://towardsdatascience.com/how-to-build-a-simple-recommender-system-in-python-375093c3fb7d
* **Possible Algorithm to use**: https://surprise.readthedocs.io/en/stable/co_clustering.html#surprise.prediction_algorithms.co_clustering.CoClustering
* **Similairity Models**: https://surprise.readthedocs.io/en/stable/similarities.html
* **Association Rule Learning**: https://en.wikipedia.org/wiki/Association_rule_learning
* collaborative filtering item - item article medium: https://medium.com/radon-dev/item-item-collaborative-filtering-with-binary-or-unary-data-e8f0b465b2c3
* **How to use Pyspark and AWS**: https://towardsdatascience.com/getting-started-with-pyspark-on-amazon-emr-c85154b6b921
* **association rule algorithm**: https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/
* **appriori**: https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/

***
* **Practical Business Python**: https://pbpython.com/market-basket-analysis.html
* **Market Basket Analysis Notebook**: https://github.com/chris1610/pbpython/blob/master/notebooks/Market_Basket_Intro.ipynb

**Memory-based methods**
1. **User-based collaborative filtering**: In this model products are recommended to a user based on the fact that the products have been liked by users similar to the user. For example if Derrick and Dennis like the same movies and a new movie comes out that Derick likes,then we can recommend that movie to Dennis because Derrick and Dennis seem to like the same movies.
1. **Item-based collaborative filtering**: These systems identify similar items based on users’ previous ratings. For example if users A,B and C gave a 5 star rating to books X and Y then when a user D buys book Y they also get a recommendation to purchase book X because the system identifies book X and Y as similar based on the ratings of users A,B and C.