# Neighborhood Rules - Similarity: Instacart

## Load & Combine the Datasets

In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import time
import random

from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
import pickle

## Similarity Modules as Recommendation Engine

Source: https://surprise.readthedocs.io/en/stable/similarities.html

In [2]:
baskets = pd.read_csv('../../data/05_model_output/baskets_newprodlist_2.csv')

In [None]:
mask = ((baskets['new_prod_list']!='1')&(baskets['new_prod_list']!='100')&(baskets['new_prod_list']!='11')&(baskets['new_prod_list']!='118')&(baskets['new_prod_list']!='2')&(baskets['new_prod_list']!='24')&(baskets['new_prod_list']!='3')&(baskets['new_prod_list']!='3 cheese')&(baskets['new_prod_list']!='30')&(baskets['new_prod_list']!='328')&(baskets['new_prod_list']!='4')&(baskets['new_prod_list']!='5')&(baskets['new_prod_list']!='50')&(baskets['new_prod_list']!='6')&(baskets['new_prod_list']!='6 cheese')&(baskets['new_prod_list']!='60')&(baskets['new_prod_list']!='7')&(baskets['new_prod_list']!='70')&(baskets['new_prod_list']!='8')&(baskets['new_prod_list']!='85')&(baskets['new_prod_list']!='9')&(baskets['new_prod_list']!='95')&(baskets['new_prod_list']!='97')&(baskets['new_prod_list']!='98')&(baskets['new_prod_list']!='a')&(baskets['new_prod_list']!='a garlic butter sauce')&(baskets['new_prod_list']!=np.nan)&(baskets['new_prod_list']!='nan'))
baskets = baskets[mask]

the new product list made from my crf model reduced the number of products from 24K to just over 4k. 

In [None]:
print('Number of Products After Running Names through CRF Mode: ',baskets.new_prod_list.nunique())
print('Number of products in the original list: ',baskets.product_name.nunique())
print('Number of unique users: ',baskets.user_id.nunique())

In [None]:
baskets.shape

We have over 200k unique users. Since this is too much for my computer to handle I am going to take a subsample of 50k users and go from there. 

In [None]:
insta_users_lst = list(baskets.user_id.unique())
len(insta_users_lst)

Let's take a random sample of 100k of these user IDs

In [None]:
random_usrids_100k = random.sample(insta_users_lst, 100000)
mask = baskets['user_id'].isin(random_usrids_100k)
baskets_100k = baskets.loc[mask]

In [None]:
print('Number of User IDs: ',baskets_100k.user_id.nunique())

## Let's drop columns and get everything into the right shape

In [None]:
baskets_100k.drop(columns=['user_id'], inplace=True)

In [None]:
baskets_100k.head()

In [None]:
baskets_100k.reset_index(inplace=True)
baskets_100k.dropna(inplace=True)
baskets_100k.new_prod_list.nunique()

### break things up into 10k different products

We keep getting a unstack overflow error from having too many things. Let's break up the dataset further into types of products. 

In [None]:
product_list = list(baskets_100k.new_prod_list.unique())

#### Product List 1

Let's drop the index and old product name column

In [None]:
baskets_complete = baskets_100k.drop(columns=['index', 'product_name'])

In [None]:
product_list_1 = product_list[:1000]
mask_prod1 = baskets_complete['new_prod_list'].isin(product_list_1)
baskets_prod1 = baskets_complete.loc[mask_prod1]
# pivot the dataset
basket_matrix_1 = (baskets_prod1.groupby(['order_id', 'new_prod_list'])['all_ones']
          .sum().unstack().reset_index().fillna(0)
          .set_index('order_id'))

In [None]:
product_list_2 = product_list[1000:2000]
mask_prod2 = baskets_complete['new_prod_list'].isin(product_list_2)
baskets_prod2 = baskets_complete.loc[mask_prod2]
# pivot the dataset
basket_matrix_2 = (baskets_prod2.groupby(['order_id', 'new_prod_list'])['all_ones']
          .sum().unstack().reset_index().fillna(0)
          .set_index('order_id'))

In [None]:
product_list_3 = product_list[2000:3000]
mask_prod3 = baskets_complete['new_prod_list'].isin(product_list_3)
baskets_prod3 = baskets_complete.loc[mask_prod3]
# pivot the dataset
basket_matrix_3 = (baskets_prod3.groupby(['order_id', 'new_prod_list'])['all_ones']
          .sum().unstack().reset_index().fillna(0)
          .set_index('order_id'))

In [None]:
product_list_4 = product_list[3000:]
mask_prod4 = baskets_complete['new_prod_list'].isin(product_list_4)
baskets_prod4 = baskets_complete.loc[mask_prod4]
# pivot the dataset
basket_matrix_4 = (baskets_prod4.groupby(['order_id', 'new_prod_list'])['all_ones']
          .sum().unstack().reset_index().fillna(0)
          .set_index('order_id'))

#### Let's merge all the small dataframes into a large one

In [None]:
print(basket_matrix_1.shape)
print(basket_matrix_2.shape)
print(basket_matrix_3.shape)
print(basket_matrix_4.shape)

In [None]:
matrix1 = basket_matrix_1.merge(basket_matrix_2, 
                      how='outer', 
                      on='order_id')

In [None]:
matrix2 = matrix1.merge(basket_matrix_3, 
                      how='outer', 
                      on='order_id')

In [None]:
basket_matrix_usr = matrix2.merge(basket_matrix_4, 
                      how='outer', 
                      on='order_id')

In [None]:
basket_matrix_usr.to_csv('../../data/03_processed/basket_matrix_usr.csv', index=False)

### Let's run our model

Let's train our model in a file located in src data folder and read the result into the data 05_model output folder. 

#### item-item calculations

In [None]:
def calculate_similarity(data_items):
    """Calculate the column-wise cosine similarity for a sparse
    matrix. Return a new dataframe matrix with similarities.
    """
    data_sparse = sparse.csr_matrix(data_items)
    similarities = cosine_similarity(data_sparse.transpose())
    sim = pd.DataFrame(data=similarities, index= data_items.columns, columns= data_items.columns)
    return sim

In [3]:
data_matrix = pd.read_csv('../../data/05_model_output/data_matrix_sim.csv')

In [4]:
data_matrix.set_index('Unnamed: 0', inplace=True)

In [5]:
print(data_matrix.loc['potato'].nlargest(11))

potato     1.000000
onion      0.120991
milk       0.107052
tomato     0.106734
carrots    0.101938
cheese     0.098204
garlic     0.097276
avocado    0.096195
eggs       0.092582
apple      0.092447
butter     0.092278
Name: potato, dtype: float64


#### user-item model

In [8]:
# Construct a new dataframe with the 10 closest neighbours (most similar)
# for each artist.
data_neighbours = pd.DataFrame(index=data_matrix.columns, columns=range(1,11))
for i in range(0, len(data_matrix.columns)):
    data_neighbours.ix[i,:10] = data_matrix.ix[0:,i].sort_values(ascending=False)[:10].index
# Get the artists the user has played.
known_user_likes = ['avocados', 'blueberries', 'bell pepper', 'onions', 'chicken', 'beef', 'chocolate']

# Construct the neighbourhood from the most similar items to the
# ones our user has already liked.
most_similar_to_likes = data_neighbours.loc[known_user_likes]
similar_list = most_similar_to_likes.values.tolist()
similar_list = list(set([item for sublist in similar_list for item in sublist]))

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  return getattr(section, self.name)[new_key]


***

**Source**:

* https://medium.com/datadriveninvestor/how-to-build-a-recommendation-system-for-purchase-data-step-by-step-d6d7a78800b6
* http://www.moorissatjokro.com/#home
* https://towardsdatascience.com/how-to-build-a-simple-recommender-system-in-python-375093c3fb7d
* **Possible Algorithm to use**: https://surprise.readthedocs.io/en/stable/co_clustering.html#surprise.prediction_algorithms.co_clustering.CoClustering
* **Similairity Models**: https://surprise.readthedocs.io/en/stable/similarities.html
* **Association Rule Learning**: https://en.wikipedia.org/wiki/Association_rule_learning
* collaborative filtering item - item article medium: https://medium.com/radon-dev/item-item-collaborative-filtering-with-binary-or-unary-data-e8f0b465b2c3
* **How to use Pyspark and AWS**: https://towardsdatascience.com/getting-started-with-pyspark-on-amazon-emr-c85154b6b921
* **association rule algorithm**: https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/
* **appriori**: https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/

***
* **Practical Business Python**: https://pbpython.com/market-basket-analysis.html
* **Market Basket Analysis Notebook**: https://github.com/chris1610/pbpython/blob/master/notebooks/Market_Basket_Intro.ipynb

**Memory-based methods**
1. **User-based collaborative filtering**: In this model products are recommended to a user based on the fact that the products have been liked by users similar to the user. For example if Derrick and Dennis like the same movies and a new movie comes out that Derick likes,then we can recommend that movie to Dennis because Derrick and Dennis seem to like the same movies.
1. **Item-based collaborative filtering**: These systems identify similar items based on users’ previous ratings. For example if users A,B and C gave a 5 star rating to books X and Y then when a user D buys book Y they also get a recommendation to purchase book X because the system identifies book X and Y as similar based on the ratings of users A,B and C.