# Neighborhood Rules - Similarity: Instacart

## Load & Combine the Datasets

In [28]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import time
import random

from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Similarity Modules as Recommendation Engine

Source: https://surprise.readthedocs.io/en/stable/similarities.html

In [29]:
baskets = pd.read_csv('../../data/02_intermediate/baskets_spark.csv')

In [30]:
baskets.columns

Index(['Unnamed: 0', 'order_id', 'product_name', 'user_id', 'product_count'], dtype='object')

In [31]:
baskets.drop(columns=['Unnamed: 0'], inplace=True)

In [32]:
baskets.head()

Unnamed: 0,order_id,product_name,user_id,product_count
0,2,Organic Egg Whites,202279,1
1,2,Michigan Organic Kale,202279,1
2,2,Garlic Powder,202279,1
3,2,Coconut Butter,202279,1
4,2,Natural Sweetener,202279,1


In [33]:
baskets.shape

(32434489, 4)

In [34]:
baskets.user_id.nunique()

206209

We have over 200k unique users. Since this is too much for my computer to handle I am going to take a subsample of 50k users and go from there. 

In [35]:
insta_users_lst = list(baskets.user_id.unique())

In [36]:
len(insta_users_lst)

206209

Let's take a random sample of 50k of these user IDs

In [37]:
random_usrids_50k = random.sample(insta_users_lst, 50000)

In [38]:
mask = baskets['user_id'].isin(random_usrids_50k)

In [39]:
baskets_50k = baskets.loc[mask]

In [40]:
baskets_50k.user_id.nunique()

50000

In [41]:
len(baskets_50k)

7876775

## Let's drop columns and get everything into the right shape

In [42]:
baskets_50k.drop(columns=['order_id'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [43]:
baskets_50k.head()

Unnamed: 0,product_name,user_id,product_count
17,Plain Pre-Sliced Bagels,178520,1
18,Honey/Lemon Cough Drops,178520,1
19,Chewy 25% Low Sugar Chocolate Chip Granola,178520,1
20,Oats & Chocolate Chewy Bars,178520,1
21,Kellogg's Nutri-Grain Apple Cinnamon Cereal,178520,1


df_matrix = pd.pivot_table(baskets_50k, values='product_count', index='user_id', columns='product_name')

In [45]:
baskets_50k.reset_index(inplace=True)

In [46]:
baskets_50k.product_name.nunique()

46945

### break things up into 10k different products

We keep getting a unstack overflow error from having too many things. Let's break up the dataset further into types of products. 

In [None]:
ba

### We are finally ready to put our data in the right shape

In [44]:
basket_matrix = (baskets_50k.groupby(['user_id', 'product_name'])['product_count']
          .sum().unstack().reset_index().fillna(0)
          .set_index('order_id'))

ValueError: Unstacked DataFrame is too big, causing int32 overflow

In [None]:
# usr_matrix.shape

### Let's run our first model

In [None]:
def calculate_similarity(dataframe):
    """Calculate the column-wise cosine similarity for a sparse
    matrix. Return a new dataframe matrix with similarities.
    """
    data_sparse = sparse.csr_matrix(dataframe)
    similarities = cosine_similarity(data_sparse.transpose())
    sim = pd.DataFrame(data=similarities, index= dataframe.columns, columns= dataframe.columns)
    return sim

In [None]:
# data_matrix = calculate_similarity(usr_matrix)

In [None]:
# list(basket_purcahse_count_samp.product_name.unique())

In [None]:
# print(data_matrix.loc['Coke Classic'].nlargest(11))

### Association Rule Machine Learning Algorithm 

**Sources**: 

1. How to build your own algorithm: https://surprise.readthedocs.io/en/stable/building_custom_algo.html
1. Association Rule Wikipedia: https://en.wikipedia.org/wiki/Association_rule_learning
1. Rule-based collaborative filtering: Recommendor Systems: The Textbook (pg. 160) 

***

In [None]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [None]:
basket_purcahse_count_samp.head()

In [None]:
usr_matrix

In [None]:
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)

**Source**:

* https://medium.com/datadriveninvestor/how-to-build-a-recommendation-system-for-purchase-data-step-by-step-d6d7a78800b6
* http://www.moorissatjokro.com/#home
* https://towardsdatascience.com/how-to-build-a-simple-recommender-system-in-python-375093c3fb7d
* **Possible Algorithm to use**: https://surprise.readthedocs.io/en/stable/co_clustering.html#surprise.prediction_algorithms.co_clustering.CoClustering
* **Similairity Models**: https://surprise.readthedocs.io/en/stable/similarities.html
* **Association Rule Learning**: https://en.wikipedia.org/wiki/Association_rule_learning
* collaborative filtering item - item article medium: https://medium.com/radon-dev/item-item-collaborative-filtering-with-binary-or-unary-data-e8f0b465b2c3
* **How to use Pyspark and AWS**: https://towardsdatascience.com/getting-started-with-pyspark-on-amazon-emr-c85154b6b921
* **association rule algorithm**: https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/
* **appriori**: https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/

***
* **Practical Business Python**: https://pbpython.com/market-basket-analysis.html
* **Market Basket Analysis Notebook**: https://github.com/chris1610/pbpython/blob/master/notebooks/Market_Basket_Intro.ipynb

**Memory-based methods**
1. **User-based collaborative filtering**: In this model products are recommended to a user based on the fact that the products have been liked by users similar to the user. For example if Derrick and Dennis like the same movies and a new movie comes out that Derick likes,then we can recommend that movie to Dennis because Derrick and Dennis seem to like the same movies.
1. **Item-based collaborative filtering**: These systems identify similar items based on usersâ€™ previous ratings. For example if users A,B and C gave a 5 star rating to books X and Y then when a user D buys book Y they also get a recommendation to purchase book X because the system identifies book X and Y as similar based on the ratings of users A,B and C.