# Neighborhood Rules - Similarity: Instacart

## Load & Combine the Datasets

In [5]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import time
import random

from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
import pickle

## Similarity Modules as Recommendation Engine

Source: https://surprise.readthedocs.io/en/stable/similarities.html

In [6]:
baskets = pd.read_csv('../../data/02_intermediate/baskets_spark.csv')

In [8]:
baskets.product_name.nunique()

24495

In [None]:
baskets.drop(columns=['Unnamed: 0'], inplace=True)

In [None]:
baskets.shape

In [None]:
baskets.user_id.nunique()

We have over 200k unique users. Since this is too much for my computer to handle I am going to take a subsample of 50k users and go from there. 

In [None]:
insta_users_lst = list(baskets.user_id.unique())

In [None]:
len(insta_users_lst)

Let's take a random sample of 50k of these user IDs

In [None]:
random_usrids_50k = random.sample(insta_users_lst, 100000)

In [None]:
mask = baskets['user_id'].isin(random_usrids_50k)

In [None]:
baskets_50k = baskets.loc[mask]

In [None]:
baskets_50k.order_id.nunique()

In [None]:
len(baskets_50k)

## Let's drop columns and get everything into the right shape

In [None]:
baskets_50k.drop(columns=['user_id'], inplace=True)

In [None]:
baskets_50k.head()

df_matrix = pd.pivot_table(baskets_50k, values='product_count', index='user_id', columns='product_name')

In [None]:
baskets_50k.reset_index(inplace=True)

In [None]:
baskets_50k.product_name.nunique()

### break things up into 10k different products

We keep getting a unstack overflow error from having too many things. Let's break up the dataset further into types of products. 

In [None]:
product_list = list(baskets_50k.product_name.unique())

In [None]:
len(product_list)

#### Product List 1

In [None]:
product_list_1 = product_list[0:10000]

In [None]:
len(product_list_1)

In [None]:
mask_prod1 = baskets_50k['product_name'].isin(product_list_1)

In [None]:
baskets_prod1 = baskets_50k.loc[mask_prod1]

In [None]:
baskets_prod1.product_name.nunique()

In [None]:
basket_matrix_1 = (baskets_prod1.groupby(['order_id', 'product_name'])['all_ones']
          .sum().unstack().reset_index().fillna(0)
          .set_index('order_id'))

In [None]:
basket_matrix_1.head()

In [None]:
product_list_2 = product_list[10000:]
mask_prod2 = baskets_50k['product_name'].isin(product_list_2)
baskets_prod2 = baskets_50k.loc[mask_prod2]
# pivot the dataset
basket_matrix_2 = (baskets_prod2.groupby(['order_id', 'product_name'])['all_ones']
          .sum().unstack().reset_index().fillna(0)
          .set_index('order_id'))

In [None]:
basket_matrix_2.head()

In [None]:
basket_matrix_2.shape

In [None]:
# product_list_3 = product_list[20000:30000]
# mask_prod3 = baskets_50k['product_name'].isin(product_list_3)
# baskets_prod3 = baskets_50k.loc[mask_prod3]
# # pivot the dataset
# basket_matrix_3 = (baskets_prod3.groupby(['user_id', 'product_name'])['product_count']
#           .sum().unstack().reset_index().fillna(0)
#           .set_index('user_id'))

In [None]:
# basket_matrix_3.head()

In [None]:
# product_list_4 = product_list[30000:40000]
# mask_prod4 = baskets_50k['product_name'].isin(product_list_4)
# baskets_prod4 = baskets_50k.loc[mask_prod4]
# # pivot the dataset
# basket_matrix_4 = (baskets_prod4.groupby(['user_id', 'product_name'])['product_count']
#           .sum().unstack().reset_index().fillna(0)
#           .set_index('user_id'))

In [None]:
# basket_matrix_4.head()

In [None]:
# product_list_5 = product_list[40000:]
# mask_prod5 = baskets_50k['product_name'].isin(product_list_5)
# baskets_prod5 = baskets_50k.loc[mask_prod5]
# # pivot the dataset
# basket_matrix_5 = (baskets_prod5.groupby(['user_id', 'product_name'])['product_count']
#           .sum().unstack().reset_index().fillna(0)
#           .set_index('user_id'))

In [None]:
# basket_matrix_5.head()

#### Let's merge all the small dataframes into a large one

In [None]:
print(basket_matrix_1.shape)
print(basket_matrix_2.shape)
# print(basket_matrix_3.shape)
# print(basket_matrix_4.shape)
# print(basket_matrix_5.shape)

In [None]:
basket_matrix_usr = basket_matrix_1.merge(basket_matrix_2, 
                      how='outer', 
                      on='order_id')

In [None]:
basket_matrix_usr.replace(np.nan, 0, inplace=True)

In [None]:
basket_matrix_usr.shape

In [None]:
basket_matrix_usr.isnull().sum()

### Let's run our first model

In [None]:
def calculate_similarity(dataframe):
    """Calculate the column-wise cosine similarity for a sparse
    matrix. Return a new dataframe matrix with similarities.
    """
    data_sparse = sparse.csr_matrix(dataframe)
    similarities = cosine_similarity(data_sparse.transpose())
    sim = pd.DataFrame(data=similarities, index= dataframe.columns, columns= dataframe.columns)
    return sim

In [None]:
data_matrix = calculate_similarity(basket_matrix_usr)

In [None]:
product_list

In [None]:
print(data_matrix.loc['Teriyaki Veggie Burgers'].nlargest(11))

In [None]:
data_matrix.to_csv('../../data/04_models/similarity_matrix.csv')

### Association Rule Machine Learning Algorithm 

**Sources**: 

1. How to build your own algorithm: https://surprise.readthedocs.io/en/stable/building_custom_algo.html
1. Association Rule Wikipedia: https://en.wikipedia.org/wiki/Association_rule_learning
1. Rule-based collaborative filtering: Recommendor Systems: The Textbook (pg. 160) 

***

In [None]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [None]:
basket_purcahse_count_samp.head()

In [None]:
usr_matrix

In [None]:
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)

**Source**:

* https://medium.com/datadriveninvestor/how-to-build-a-recommendation-system-for-purchase-data-step-by-step-d6d7a78800b6
* http://www.moorissatjokro.com/#home
* https://towardsdatascience.com/how-to-build-a-simple-recommender-system-in-python-375093c3fb7d
* **Possible Algorithm to use**: https://surprise.readthedocs.io/en/stable/co_clustering.html#surprise.prediction_algorithms.co_clustering.CoClustering
* **Similairity Models**: https://surprise.readthedocs.io/en/stable/similarities.html
* **Association Rule Learning**: https://en.wikipedia.org/wiki/Association_rule_learning
* collaborative filtering item - item article medium: https://medium.com/radon-dev/item-item-collaborative-filtering-with-binary-or-unary-data-e8f0b465b2c3
* **How to use Pyspark and AWS**: https://towardsdatascience.com/getting-started-with-pyspark-on-amazon-emr-c85154b6b921
* **association rule algorithm**: https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/
* **appriori**: https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/

***
* **Practical Business Python**: https://pbpython.com/market-basket-analysis.html
* **Market Basket Analysis Notebook**: https://github.com/chris1610/pbpython/blob/master/notebooks/Market_Basket_Intro.ipynb

**Memory-based methods**
1. **User-based collaborative filtering**: In this model products are recommended to a user based on the fact that the products have been liked by users similar to the user. For example if Derrick and Dennis like the same movies and a new movie comes out that Derick likes,then we can recommend that movie to Dennis because Derrick and Dennis seem to like the same movies.
1. **Item-based collaborative filtering**: These systems identify similar items based on users’ previous ratings. For example if users A,B and C gave a 5 star rating to books X and Y then when a user D buys book Y they also get a recommendation to purchase book X because the system identifies book X and Y as similar based on the ratings of users A,B and C.