# Neighborhood Rules - Similarity: Instacart

## Load & Combine the Datasets

In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import time
import random

from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
import pickle

## Similarity Modules as Recommendation Engine

Source: https://surprise.readthedocs.io/en/stable/similarities.html

In [2]:
baskets = pd.read_csv('../../data/05_model_output/baskets_newprodlist_2.csv')

In [3]:
baskets.columns

Index(['order_id', 'product_name', 'user_id', 'all_ones', 'new_prod_list'], dtype='object')

the new product list made from my crf model reduced the number of products from 24K to just over 4k. 

In [4]:
baskets.new_prod_list.nunique()

4086

In [5]:
baskets.product_name.nunique()

24495

In [6]:
baskets.shape

(24890363, 5)

In [7]:
baskets.user_id.nunique()

204454

We have over 200k unique users. Since this is too much for my computer to handle I am going to take a subsample of 50k users and go from there. 

In [8]:
insta_users_lst = list(baskets.user_id.unique())

In [9]:
len(insta_users_lst)

204454

Let's take a random sample of 50k of these user IDs

In [10]:
random_usrids_100k = random.sample(insta_users_lst, 100000)

In [11]:
len(random_usrids_100k)

100000

In [12]:
mask = baskets['user_id'].isin(random_usrids_100k)

In [13]:
baskets_100k = baskets.loc[mask]

In [14]:
baskets_100k.user_id.nunique()

100000

In [15]:
len(baskets_100k)

12165725

## Let's drop columns and get everything into the right shape

In [16]:
baskets_100k.drop(columns=['user_id'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [17]:
baskets_100k.head()

Unnamed: 0,order_id,product_name,all_ones,new_prod_list
17,4,Plain Pre-Sliced Bagels,1,bagels
18,4,Oats & Chocolate Chewy Bars,1,oats
19,4,Kellogg's Nutri-Grain Apple Cinnamon Cereal,1,kellogg
20,4,Nutri-Grain Soft Baked Strawberry Cereal Break...,1,breakfast bars
21,4,Kellogg's Nutri-Grain Blueberry Cereal,1,cereal


df_matrix = pd.pivot_table(baskets_50k, values='product_count', index='user_id', columns='product_name')

In [18]:
baskets_100k.reset_index(inplace=True)

In [19]:
len(baskets_100k)

12165725

In [20]:
baskets_100k.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [21]:
len(baskets_100k)

12155376

In [22]:
baskets_100k.new_prod_list.nunique()

4055

### break things up into 10k different products

We keep getting a unstack overflow error from having too many things. Let's break up the dataset further into types of products. 

In [23]:
product_list = list(baskets_100k.new_prod_list.unique())

In [24]:
len(product_list)

4055

#### Product List 1

In [25]:
baskets_100k.columns

Index(['index', 'order_id', 'product_name', 'all_ones', 'new_prod_list'], dtype='object')

In [26]:
baskets_complete = baskets_100k.drop(columns=['index', 'product_name'])

In [27]:
baskets_complete.head()

Unnamed: 0,order_id,all_ones,new_prod_list
0,4,1,bagels
1,4,1,oats
2,4,1,kellogg
3,4,1,breakfast bars
4,4,1,cereal


In [29]:
product_list_1 = product_list[:1000]
mask_prod1 = baskets_complete['new_prod_list'].isin(product_list_1)
baskets_prod1 = baskets_complete.loc[mask_prod1]
# pivot the dataset
basket_matrix_1 = (baskets_prod1.groupby(['order_id', 'new_prod_list'])['all_ones']
          .sum().unstack().reset_index().fillna(0)
          .set_index('order_id'))

In [30]:
product_list_2 = product_list[1000:2000]
mask_prod2 = baskets_complete['new_prod_list'].isin(product_list_2)
baskets_prod2 = baskets_complete.loc[mask_prod2]
# pivot the dataset
basket_matrix_2 = (baskets_prod2.groupby(['order_id', 'new_prod_list'])['all_ones']
          .sum().unstack().reset_index().fillna(0)
          .set_index('order_id'))

In [31]:
product_list_3 = product_list[2000:3000]
mask_prod3 = baskets_complete['new_prod_list'].isin(product_list_3)
baskets_prod3 = baskets_complete.loc[mask_prod3]
# pivot the dataset
basket_matrix_3 = (baskets_prod3.groupby(['order_id', 'new_prod_list'])['all_ones']
          .sum().unstack().reset_index().fillna(0)
          .set_index('order_id'))

In [32]:
product_list_4 = product_list[3000:]
mask_prod4 = baskets_complete['new_prod_list'].isin(product_list_4)
baskets_prod4 = baskets_complete.loc[mask_prod4]
# pivot the dataset
basket_matrix_4 = (baskets_prod4.groupby(['order_id', 'new_prod_list'])['all_ones']
          .sum().unstack().reset_index().fillna(0)
          .set_index('order_id'))

#### Let's merge all the small dataframes into a large one

In [33]:
print(basket_matrix_1.shape)
print(basket_matrix_2.shape)
print(basket_matrix_3.shape)
print(basket_matrix_4.shape)

(1482547, 1000)
(331379, 1000)
(72578, 1000)
(13970, 1055)


In [34]:
matrix1 = basket_matrix_1.merge(basket_matrix_2, 
                      how='outer', 
                      on='order_id')

In [35]:
matrix2 = matrix1.merge(basket_matrix_3, 
                      how='outer', 
                      on='order_id')

In [36]:
basket_matrix_usr = matrix2.merge(basket_matrix_4, 
                      how='outer', 
                      on='order_id')

In [1]:
basket_matrix_usr.replace(np.nan, 0, inplace=True)

NameError: name 'basket_matrix_usr' is not defined

In [None]:
basket_matrix_usr.shape

In [None]:
basket_matrix_usr.isnull().sum()

### Let's run our first model

In [None]:
def calculate_similarity(dataframe):
    """Calculate the column-wise cosine similarity for a sparse
    matrix. Return a new dataframe matrix with similarities.
    """
    data_sparse = sparse.csr_matrix(dataframe)
    similarities = cosine_similarity(data_sparse.transpose())
    sim = pd.DataFrame(data=similarities, index= dataframe.columns, columns= dataframe.columns)
    return sim

In [None]:
data_matrix = calculate_similarity(basket_matrix_usr)

In [None]:
product_list

In [None]:
print(data_matrix.loc['Teriyaki Veggie Burgers'].nlargest(11))

In [None]:
data_matrix.to_csv('../../data/04_models/similarity_matrix.csv')

### Association Rule Machine Learning Algorithm 

**Sources**: 

1. How to build your own algorithm: https://surprise.readthedocs.io/en/stable/building_custom_algo.html
1. Association Rule Wikipedia: https://en.wikipedia.org/wiki/Association_rule_learning
1. Rule-based collaborative filtering: Recommendor Systems: The Textbook (pg. 160) 

***

In [None]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [None]:
basket_purcahse_count_samp.head()

In [None]:
usr_matrix

In [None]:
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)

**Source**:

* https://medium.com/datadriveninvestor/how-to-build-a-recommendation-system-for-purchase-data-step-by-step-d6d7a78800b6
* http://www.moorissatjokro.com/#home
* https://towardsdatascience.com/how-to-build-a-simple-recommender-system-in-python-375093c3fb7d
* **Possible Algorithm to use**: https://surprise.readthedocs.io/en/stable/co_clustering.html#surprise.prediction_algorithms.co_clustering.CoClustering
* **Similairity Models**: https://surprise.readthedocs.io/en/stable/similarities.html
* **Association Rule Learning**: https://en.wikipedia.org/wiki/Association_rule_learning
* collaborative filtering item - item article medium: https://medium.com/radon-dev/item-item-collaborative-filtering-with-binary-or-unary-data-e8f0b465b2c3
* **How to use Pyspark and AWS**: https://towardsdatascience.com/getting-started-with-pyspark-on-amazon-emr-c85154b6b921
* **association rule algorithm**: https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/
* **appriori**: https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/

***
* **Practical Business Python**: https://pbpython.com/market-basket-analysis.html
* **Market Basket Analysis Notebook**: https://github.com/chris1610/pbpython/blob/master/notebooks/Market_Basket_Intro.ipynb

**Memory-based methods**
1. **User-based collaborative filtering**: In this model products are recommended to a user based on the fact that the products have been liked by users similar to the user. For example if Derrick and Dennis like the same movies and a new movie comes out that Derick likes,then we can recommend that movie to Dennis because Derrick and Dennis seem to like the same movies.
1. **Item-based collaborative filtering**: These systems identify similar items based on users’ previous ratings. For example if users A,B and C gave a 5 star rating to books X and Y then when a user D buys book Y they also get a recommendation to purchase book X because the system identifies book X and Y as similar based on the ratings of users A,B and C.