# Product Recommendation Engine Analysis
Product Lead: Jeeth Joseph 
Based on the ideas so elegantly put by Moorissa Tjokro

## Problem statement
The steps below aim to recommend users their top 10 items to place into their basket. The final output will be a csv file in the `output` folder, and a function that searches for a recommendation list based on a speficied user:
* Input: user - customer ID
* Returns: ranked list of items (product IDs), that the user is most likely to want to put in his/her (empty) "basket"

## 1. Import modules
* `pandas` and `numpy` for data manipulation
* `turicreate` for performing model selection and evaluation
* `sklearn` for splitting the data into train and test set
* `xlrd` for excel import
* sudo apt-get install libatlas-base-dev for missing package

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import time
import turicreate as tc
from sklearn.model_selection import train_test_split

import sys
sys.path.append("..")
import scripts.data_layer as data_layer

## 2. Load data
Single dataset from db, which can be found in `data` folder: 
* Lyb data QUEST JAN WITH PURQTY 10k (to avoid memory error)
* XLSX Format
* Possible error expected dude to difference between expected purchase frequency and purchase qty



In [2]:
s=time.time()

data=pd.read_excel('../data/Lyb data QUEST JAN WITH PURQTY original.xlsx')

print("Import time:", round((time.time()-s)/60,2), "minutes")

print(data.shape)
data.head(2)

Import time: 0.49 minutes
(352444, 3)


Unnamed: 0,LYBID,ITEMID,TotalQtyPurchased
0,10004,29009,1.0
1,10004,33815,2.0


### 3 Create dummy
* Dummy for marking whether a customer bought that item or not.
* If one buys an item, then `purchase_dummy` are marked as 1
* Why create a dummy instead of normalizing it?
    * Normalizing the purchase count, say by each user, would not work because customers may have different buying frequency don't have the same taste
    * However, we can normalize items by purchase frequency across all users, which is done in section 3.3. below.

In [3]:
def create_data_dummy(data):
    data_dummy = data.copy()
    data_dummy['purchase_dummy'] = 1
    return data_dummy

In [4]:
data_dummy = create_data_dummy(data)

In [5]:
print(data_dummy.shape)
data_dummy.head()

(352444, 4)


Unnamed: 0,LYBID,ITEMID,TotalQtyPurchased,purchase_dummy
0,10004,29009,1.0,1
1,10004,33815,2.0,1
2,10004,43517,1.0,1
3,10004,43519,2.0,1
4,10004,43598,1.0,1


### 3.3. Normalize item values across users
* To do this, we normalize purchase frequency of each item across users by first creating a user-item matrix as follows

In [6]:
s=time.time()
df_matrix = pd.pivot_table(data, values='TotalQtyPurchased', index='LYBID', columns='ITEMID')
print("Import time:", round((time.time()-s)/60,2), "minutes")
df_matrix.head()

Import time: 2.49 minutes


ITEMID,69,198,200,204,208,325,460,481,504,506,...,44257,44270,44271,44273,44278,44280,44282,44283,44289,44304
LYBID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10004,,,,,,,,,,,...,,,,,,,,,,
10009,,,,,,,,,,,...,,,,,,,,,,
10012,,,,,,,,,,,...,,,,,,,,,,
10022,,,,,,,,,,,...,,,,,,,,,,
10023,,,,,,,,,,,...,,,,,,,,,,


In [7]:
(df_matrix.shape)

(101140, 1467)

In [8]:
s=time.time()
df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min())
print("Import time:", round((time.time()-s)/60,2), "minutes")
print(df_matrix_norm.shape)
df_matrix_norm.head()

Import time: 5.38 minutes
(101140, 1467)


ITEMID,69,198,200,204,208,325,460,481,504,506,...,44257,44270,44271,44273,44278,44280,44282,44283,44289,44304
LYBID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10004,,,,,,,,,,,...,,,,,,,,,,
10009,,,,,,,,,,,...,,,,,,,,,,
10012,,,,,,,,,,,...,,,,,,,,,,
10022,,,,,,,,,,,...,,,,,,,,,,
10023,,,,,,,,,,,...,,,,,,,,,,


In [9]:
# create a table for input to the modeling
s=time.time()
d = df_matrix_norm.reset_index()
d.index.names = ['scaled_purchase_freq']
data_norm = pd.melt(d, id_vars=['LYBID'], value_name='scaled_purchase_freq').dropna()
print("Import time:", round((time.time()-s)/60,2), "minutes")
print(data_norm.shape)
data_norm.head()

Import time: 32.99 minutes
(346351, 3)


Unnamed: 0,LYBID,ITEMID,scaled_purchase_freq
1156,30526012,69,0.0
2055,30559273,69,0.0
2535,30572236,69,0.0
5042,30693525,69,0.0
5341,30697710,69,0.0


#### Define a function for normalizing data

In [10]:
def normalize_data(data):
    df_matrix = pd.pivot_table(data, values='TotalQtyPurchased', index='LYBID', columns='ITEMID')
    df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min())
    d = df_matrix_norm.reset_index()
    d.index.names = ['scaled_purchase_freq']
    return pd.melt(d, id_vars=['customerId'], value_name='scaled_purchase_freq').dropna()

* We can normalize the their purchase history, from 0-1 (with 1 being the most number of purchase for an item and 0 being 0 purchase count for that item).

## 4. Split train and test set
* Splitting the data into training and testing sets is an important part of evaluating predictive modeling, in this case a collaborative filtering model. Typically, we use a larger portion of the data for training and a smaller portion for testing. 
* We use 80:20 ratio for our train-test set size.
* Our training portion will be used to develop a predictive model, while the other to evaluate the model's performance.
* Now that we have three datasets with purchase counts, purchase dummy, and scaled purchase counts, we would like to split each.

In [11]:
train, test = train_test_split(data, test_size = .2)
print(train.shape, test.shape)

(281955, 3) (70489, 3)


In [12]:
# Using turicreate library, we convert dataframe to SFrame - this will be useful in the modeling part

train_data = tc.SFrame(train)
test_data = tc.SFrame(test)

In [13]:
train_data

LYBID,ITEMID,TotalQtyPurchased
31701386,43865,1.0
32220468,43549,1.0
32462468,35545,1.0
32462499,4758,1.0
1031342697,43488,1.0
32152251,21226,1.0
32083418,6621,3.0
1030796590,15886,1.0
32481167,28878,1.0
30578858,38028,1.0


In [14]:
test_data

LYBID,ITEMID,TotalQtyPurchased
1032143190,44168,1.0
30988095,43509,1.0
31854038,43545,2.0
1030719440,43285,1.0
32211823,43646,1.0
31258032,38055,1.0
30954318,35540,1.0
32102938,44003,1.0
31511258,29036,1.0
30979588,43506,1.0


#### Define a `split_data` function for splitting data to training and test set

In [15]:
# We can define a function for this step as follows

def split_data(data):
    '''
    Splits dataset into training and test set.
    
    Args:
        data (pandas.DataFrame)
        
    Returns
        train_data (tc.SFrame)
        test_data (tc.SFrame)
    '''
    train, test = train_test_split(data, test_size = .2)
    train_data = tc.SFrame(train)
    test_data = tc.SFrame(test)
    return train_data, test_data

In [16]:
# lets try with both dummy table and scaled/normalized purchase table

train_data_dummy, test_data_dummy = split_data(data_dummy)
train_data_norm, test_data_norm = split_data(data_norm)

## 5. Baseline Model
Before running a more complicated approach such as collaborative filtering, we would like to use a baseline model to compare and evaluate models. Since baseline typically uses a very simple approach, techniques used beyond this approach should be chosen if they show relatively better accuracy and complexity.

### 5.1. Using a Popularity model as a baseline
* The popularity model takes the most popular items for recommendation. These items are products with the highest number of sells across customers.
* We use `turicreate` library for running and evaluating both baseline and collaborative filtering models below
* Training data is used for model selection
* Yet to evaluate is the math behind turicerate.popularity model

#### Using purchase counts

In [17]:
# variables to define field names
user_id = 'LYBID'
item_id = 'ITEMID'
target = 'TotalQtyPurchased'
users_to_recommend = list(data[user_id].unique())
n_rec = 5 # number of items to recommend
n_display = 30

In [None]:
#print(list(data[user_id].unique()))  #tester for error raakshsasan20190222

In [18]:
popularity_model = tc.popularity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target)

In [52]:
# Get recommendations for a list of users to recommend (from data file)
# Printed below is head / top 30 rows for first 6 customers with 5 recommendations each

popularity_recomm = popularity_model.recommend(users=users_to_recommend, k=n_rec)
popularity_recomm.print_rows(n_display)

+-------+--------+--------------------+------+
| LYBID | ITEMID |       score        | rank |
+-------+--------+--------------------+------+
| 10004 | 44002  |        6.0         |  1   |
| 10004 | 35928  |        6.0         |  2   |
| 10004 |  3560  |        5.0         |  3   |
| 10004 |  3557  |        4.25        |  4   |
| 10004 | 33867  | 3.6666666666666665 |  5   |
| 10009 | 44002  |        6.0         |  1   |
| 10009 | 35928  |        6.0         |  2   |
| 10009 |  3560  |        5.0         |  3   |
| 10009 |  3557  |        4.25        |  4   |
| 10009 | 33867  | 3.6666666666666665 |  5   |
| 10012 | 44002  |        6.0         |  1   |
| 10012 | 35928  |        6.0         |  2   |
| 10012 |  3560  |        5.0         |  3   |
| 10012 |  3557  |        4.25        |  4   |
| 10012 | 33867  | 3.6666666666666665 |  5   |
| 10022 | 44002  |        6.0         |  1   |
| 10022 | 35928  |        6.0         |  2   |
| 10022 |  3560  |        5.0         |  3   |
| 10022 |  35

#### Define a `model` function for model selection

In [20]:
# Since turicreate is very accessible library, we can define a model selection function as below

def model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display):
    if name == 'popularity':
        model = tc.popularity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target)
    elif name == 'cosine':
        model = tc.item_similarity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target, 
                                                    similarity_type='cosine')
    elif name == 'pearson':
        model = tc.item_similarity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target, 
                                                    similarity_type='pearson')
        
    recom = model.recommend(users=users_to_recommend, k=n_rec)
    recom.print_rows(n_display)
    return model

In [21]:
# variables to define field names
# constant variables include:
user_id = 'LYBID'
item_id = 'ITEMID'
users_to_recommend = list(data[user_id].unique())
n_rec = 5 # number of items to recommend
n_display = 30 # to print the head / first few rows in a defined dataset

#### Using purchase dummy

In [22]:
# these variables will change accordingly
name = 'popularity'
target = 'purchase_dummy'
pop_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+-------+--------+-------+------+
| LYBID | ITEMID | score | rank |
+-------+--------+-------+------+
| 10004 | 33819  |  1.0  |  1   |
| 10004 | 42733  |  1.0  |  2   |
| 10004 | 43931  |  1.0  |  3   |
| 10004 | 37910  |  1.0  |  4   |
| 10004 | 37912  |  1.0  |  5   |
| 10009 | 33819  |  1.0  |  1   |
| 10009 | 42733  |  1.0  |  2   |
| 10009 | 43931  |  1.0  |  3   |
| 10009 | 37910  |  1.0  |  4   |
| 10009 | 37912  |  1.0  |  5   |
| 10012 | 33819  |  1.0  |  1   |
| 10012 | 42733  |  1.0  |  2   |
| 10012 | 43931  |  1.0  |  3   |
| 10012 | 37910  |  1.0  |  4   |
| 10012 | 37912  |  1.0  |  5   |
| 10022 | 33819  |  1.0  |  1   |
| 10022 | 42733  |  1.0  |  2   |
| 10022 | 43931  |  1.0  |  3   |
| 10022 | 37910  |  1.0  |  4   |
| 10022 | 37912  |  1.0  |  5   |
| 10023 | 43901  |  1.0  |  1   |
| 10023 | 33819  |  1.0  |  2   |
| 10023 | 42733  |  1.0  |  3   |
| 10023 | 37910  |  1.0  |  4   |
| 10023 | 37912  |  1.0  |  5   |
| 10029 | 33819  |  1.0  |  1   |
| 10029 | 4273

#### Using normalized purchase count

In [23]:
name = 'popularity'
target = 'scaled_purchase_freq'
pop_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+-------+--------+--------------------+------+
| LYBID | ITEMID |       score        | rank |
+-------+--------+--------------------+------+
| 10004 |  5035  |        1.0         |  1   |
| 10004 | 43563  |        1.0         |  2   |
| 10004 |  3557  | 0.8666666666666667 |  3   |
| 10004 | 29031  | 0.5357142857142857 |  4   |
| 10004 | 35859  |        0.5         |  5   |
| 10009 |  5035  |        1.0         |  1   |
| 10009 | 43563  |        1.0         |  2   |
| 10009 |  3557  | 0.8666666666666667 |  3   |
| 10009 | 29031  | 0.5357142857142857 |  4   |
| 10009 | 35859  |        0.5         |  5   |
| 10012 |  5035  |        1.0         |  1   |
| 10012 | 43563  |        1.0         |  2   |
| 10012 |  3557  | 0.8666666666666667 |  3   |
| 10012 | 29031  | 0.5357142857142857 |  4   |
| 10012 | 35859  |        0.5         |  5   |
| 10022 |  5035  |        1.0         |  1   |
| 10022 | 43563  |        1.0         |  2   |
| 10022 |  3557  | 0.8666666666666667 |  3   |
| 10022 | 290

## 6. Collaborative Filtering Model

* In collaborative filtering, we would recommend items based on how similar users purchase items. For instance, if customer 1 and customer 2 bought similar items, e.g. 1 bought X, Y, Z and 2 bought X, Y, we would recommend an item Z to customer 2.

* To define similarity across users, we use the following steps:
    1. Create a user-item matrix, where index values represent unique customer IDs and column values represent unique product IDs
    
    2. Create an item-to-item similarity matrix. The idea is to calculate how similar a product is to another product. There are a number of ways of calculating this. In steps 6.1 and 6.2, we use cosine and pearson similarity measure, respectively.  
    
        * To calculate similarity between products X and Y, look at all customers who have bought both these items. For example, both X and Y have been bought by customers 1 and 2. 
        * We then create two item-vectors, v1 for item X and v2 for item Y, in the user-space of (1, 2) and then find the `cosine` or `pearson` angle/distance between these vectors. A zero angle or overlapping vectors with cosine value of 1 means total similarity (or per user, across all items, there is same rating) and an angle of 90 degree would mean cosine of 0 or no similarity.
        
    3. For each customer, we then predict his likelihood to buy a product (or his purchase counts) for products that he had not bought. 
    
        * For our example, we will calculate rating for user 2 in the case of item Z (target item). To calculate this we weigh the just-calculated similarity-measure between the target item and other items that customer has already bought. The weighing factor is the purchase counts given by the user to items already bought by him. 
        * We then scale this weighted sum with the sum of similarity-measures so that the calculated rating remains within a predefined limits. Thus, the predicted rating for item Z for user 2 would be calculated using similarity measures.

* While I wrote python scripts for all the process including finding similarity using python scripts (which can be found in `scripts` folder, we can use `turicreate` library for now to capture different measures like using `cosine` and `pearson` distance, and evaluate the best model.

### 6.1. `Cosine` similarity
* Similarity is the cosine of the angle between the 2 vectors of the item vectors of A and B
* It is defined by the following formula
![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTnRHSAx1c084UXF2wIHYwaHJLmq2qKtNk_YIv3RjHUO00xwlkt)
* Closer the vectors, smaller will be the angle and larger the cosine

#### Using purchase count

In [24]:
# these variables will change accordingly
name = 'cosine'
target = 'TotalQtyPurchased'
cos = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+-------+--------+----------------------+------+
| LYBID | ITEMID |        score         | rank |
+-------+--------+----------------------+------+
| 10004 | 44011  | 0.19692957401275635  |  1   |
| 10004 | 37922  | 0.15904393792152405  |  2   |
| 10004 | 43646  |  0.1293746531009674  |  3   |
| 10004 | 43589  | 0.11456459760665894  |  4   |
| 10004 | 43872  | 0.10788850982983907  |  5   |
| 10009 | 43496  |  0.1773882806301117  |  1   |
| 10009 | 43920  | 0.12164735794067383  |  2   |
| 10009 | 43921  | 0.11765480041503906  |  3   |
| 10009 | 43913  | 0.09768617153167725  |  4   |
| 10009 | 43591  |  0.0933111310005188  |  5   |
| 10012 | 42654  | 0.15234637260437012  |  1   |
| 10012 | 42655  | 0.10382649302482605  |  2   |
| 10012 | 37922  | 0.08947828412055969  |  3   |
| 10012 | 44271  | 0.08785995841026306  |  4   |
| 10012 | 28876  | 0.08511623740196228  |  5   |
| 10022 | 43507  | 0.08691252171993255  |  1   |
| 10022 | 43872  | 0.08076946437358856  |  2   |
| 10022 | 43508  | 0

#### Using purchase dummy

In [25]:
# these variables will change accordingly
name = 'cosine'
target = 'purchase_dummy'
cos_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+-------+--------+----------------------+------+
| LYBID | ITEMID |        score         | rank |
+-------+--------+----------------------+------+
| 10004 | 43508  | 0.053184373038155694 |  1   |
| 10004 | 15862  | 0.04502768175942557  |  2   |
| 10004 | 43987  | 0.043013530118124824 |  3   |
| 10004 | 43507  | 0.04183342627116612  |  4   |
| 10004 | 21089  | 0.04172945874077933  |  5   |
| 10009 | 35578  | 0.05308213829994202  |  1   |
| 10009 | 43515  | 0.04503849148750305  |  2   |
| 10009 | 43604  | 0.043797820806503296 |  3   |
| 10009 | 43679  | 0.02784678339958191  |  4   |
| 10009 | 28876  | 0.027010053396224976 |  5   |
| 10012 | 42655  | 0.12548951307932535  |  1   |
| 10012 | 15862  |  0.0521841843922933  |  2   |
| 10012 | 43442  |  0.0407183567682902  |  3   |
| 10012 | 33584  | 0.037069479624430336 |  4   |
| 10012 | 43604  | 0.03675452868143717  |  5   |
| 10022 | 15862  | 0.03857937124040392  |  1   |
| 10022 | 42653  | 0.03554911414782206  |  2   |
| 10022 | 43507  | 0

#### Using normalized purchase count

In [26]:
name = 'cosine'
target = 'scaled_purchase_freq'
cos_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+-------+--------+-----------------------+------+
| LYBID | ITEMID |         score         | rank |
+-------+--------+-----------------------+------+
| 10004 | 43604  |  0.007880640029907227 |  1   |
| 10004 | 44011  |  0.005519402027130127 |  2   |
| 10004 |  6601  | 0.0043921232223510746 |  3   |
| 10004 | 43600  |  0.004345345497131348 |  4   |
| 10004 |  6600  |  0.003797304630279541 |  5   |
| 10009 | 41055  |  0.018518507480621338 |  1   |
| 10009 | 35849  |  0.00967097282409668  |  2   |
| 10009 | 42674  |  0.008649984995524088 |  3   |
| 10009 | 43937  |  0.004211644331614177 |  4   |
| 10009 | 43305  |  0.003130197525024414 |  5   |
| 10012 | 43861  |  0.01790716250737508  |  1   |
| 10012 | 35583  |  0.01790716250737508  |  2   |
| 10012 | 42655  |  0.014899671077728271 |  3   |
| 10012 | 33576  |  0.014077246189117432 |  4   |
| 10012 | 43587  |  0.013362367947896322 |  5   |
| 10022 | 43488  |  0.005943498190711527 |  1   |
| 10022 | 37922  |  0.005595287855933695 |  2   |


### 6.2. `Pearson` similarity
* Similarity is the pearson coefficient between the two vectors.
* It is defined by the following formula
![](http://critical-numbers.group.shef.ac.uk/glossary/images/correlationKT1.png)

#### Using purchase count

In [27]:
# these variables will change accordingly
name = 'pearson'
target = 'TotalQtyPurchased'
pear = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+-------+--------+--------------------+------+
| LYBID | ITEMID |       score        | rank |
+-------+--------+--------------------+------+
| 10004 | 44002  |        6.0         |  1   |
| 10004 | 35928  |        6.0         |  2   |
| 10004 |  3560  |        5.0         |  3   |
| 10004 |  3557  |        4.25        |  4   |
| 10004 | 33867  | 3.6666666666666625 |  5   |
| 10009 | 44002  |        6.0         |  1   |
| 10009 | 35928  |        6.0         |  2   |
| 10009 |  3560  |        5.0         |  3   |
| 10009 |  3557  |        4.25        |  4   |
| 10009 | 33867  | 3.6666666666666625 |  5   |
| 10012 | 44002  |        6.0         |  1   |
| 10012 | 35928  |        6.0         |  2   |
| 10012 |  3560  |        5.0         |  3   |
| 10012 |  3557  |        4.25        |  4   |
| 10012 | 33867  | 3.6666666666666625 |  5   |
| 10022 | 44002  |        6.0         |  1   |
| 10022 | 35928  |        6.0         |  2   |
| 10022 |  3560  |        5.0         |  3   |
| 10022 |  35

#### Using purchase dummy

In [28]:
# these variables will change accordingly
name = 'pearson'
target = 'purchase_dummy'
pear_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+-------+--------+-------+------+
| LYBID | ITEMID | score | rank |
+-------+--------+-------+------+
| 10004 | 33819  |  0.0  |  1   |
| 10004 | 42733  |  0.0  |  2   |
| 10004 | 43931  |  0.0  |  3   |
| 10004 | 37910  |  0.0  |  4   |
| 10004 | 37912  |  0.0  |  5   |
| 10009 | 33819  |  0.0  |  1   |
| 10009 | 42733  |  0.0  |  2   |
| 10009 | 43931  |  0.0  |  3   |
| 10009 | 37910  |  0.0  |  4   |
| 10009 | 37912  |  0.0  |  5   |
| 10012 | 33819  |  0.0  |  1   |
| 10012 | 42733  |  0.0  |  2   |
| 10012 | 43931  |  0.0  |  3   |
| 10012 | 37910  |  0.0  |  4   |
| 10012 | 37912  |  0.0  |  5   |
| 10022 | 33819  |  0.0  |  1   |
| 10022 | 42733  |  0.0  |  2   |
| 10022 | 43931  |  0.0  |  3   |
| 10022 | 37910  |  0.0  |  4   |
| 10022 | 37912  |  0.0  |  5   |
| 10023 | 43901  |  0.0  |  1   |
| 10023 | 33819  |  0.0  |  2   |
| 10023 | 42733  |  0.0  |  3   |
| 10023 | 37910  |  0.0  |  4   |
| 10023 | 37912  |  0.0  |  5   |
| 10029 | 33819  |  0.0  |  1   |
| 10029 | 4273

#### Using normalized purchase count

In [29]:
name = 'pearson'
target = 'scaled_purchase_freq'
pear_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+-------+--------+--------------------+------+
| LYBID | ITEMID |       score        | rank |
+-------+--------+--------------------+------+
| 10004 |  5035  |        1.0         |  1   |
| 10004 | 43563  |        1.0         |  2   |
| 10004 |  3557  | 0.8666666666666667 |  3   |
| 10004 | 29031  | 0.5357142857142857 |  4   |
| 10004 | 35859  |        0.5         |  5   |
| 10009 |  5035  |        1.0         |  1   |
| 10009 | 43563  |        1.0         |  2   |
| 10009 |  3557  | 0.8666666666666667 |  3   |
| 10009 | 29031  | 0.5357142857142857 |  4   |
| 10009 | 35859  |        0.5         |  5   |
| 10012 |  5035  |        1.0         |  1   |
| 10012 | 43563  |        1.0         |  2   |
| 10012 |  3557  | 0.8666666666666667 |  3   |
| 10012 | 29031  | 0.5357142857142857 |  4   |
| 10012 | 35859  |        0.5         |  5   |
| 10022 |  5035  |        1.0         |  1   |
| 10022 | 43563  |        1.0         |  2   |
| 10022 |  3557  | 0.8666666666666667 |  3   |
| 10022 | 290

#### Note
* In collaborative filtering above, we used two approaches: cosine and pearson distance. We also got to apply them to three training datasets with normal counts, dummy, or normalized counts of items purchase.
* We can see that the recommendations are different for each user. This suggests that personalization does exist. 
* But how good is this model compared to the baseline, and to each other? We need some means of evaluating a recommendation engine. Lets focus on that in the next section.

## 7. Model Evaluation
For evaluating recommendation engines, we can use the concept of precision-recall.

* RMSE (Root Mean Squared Errors)
    * Measures the error of predicted values
    * Lesser the RMSE value, better the recommendations
* Recall
    * What percentage of products that a user buys are actually recommended?
    * If a customer buys 5 products and the recommendation decided to show 3 of them, then the recall is 0.6
* Precision
    * Out of all the recommended items, how many the user actually liked?
    * If 5 products were recommended to the customer out of which he buys 4 of them, then precision is 0.8
    
* Why are both recall and precision important?
    * Consider a case where we recommend all products, so our customers will surely cover the items that they liked and bought. In this case, we have 100% recall! Does this mean our model is good?
    * We have to consider precision. If we recommend 300 items but user likes and buys only 3 of them, then precision is 0.1%! This very low precision indicates that the model is not great, despite their excellent recall.
    * So our aim has to be optimizing both recall and precision (to be close to 1 as possible).

Lets compare all the models we have built based on precision-recall characteristics:

In [30]:
# create initial callable variables

models_w_counts = [popularity_model, cos, pear]
models_w_dummy = [pop_dummy, cos_dummy, pear_dummy]
models_w_norm = [pop_norm, cos_norm, pear_norm]

names_w_counts = ['Popularity Model on Purchase Counts', 'Cosine Similarity on Purchase Counts', 'Pearson Similarity on Purchase Counts']
names_w_dummy = ['Popularity Model on Purchase Dummy', 'Cosine Similarity on Purchase Dummy', 'Pearson Similarity on Purchase Dummy']
names_w_norm = ['Popularity Model on Scaled Purchase Counts', 'Cosine Similarity on Scaled Purchase Counts', 'Pearson Similarity on Scaled Purchase Counts']

#### Models on purchase counts

In [31]:
eval_counts = tc.recommender.util.compare_models(test_data, models_w_counts, model_names=names_w_counts)

PROGRESS: Evaluate model Popularity Model on Purchase Counts



Precision and recall summary statistics by cutoff
+--------+------------------------+-----------------------+
| cutoff |     mean_precision     |      mean_recall      |
+--------+------------------------+-----------------------+
|   1    |          0.0           |          0.0          |
|   2    |          0.0           |          0.0          |
|   3    | 1.4254151521630688e-05 | 4.276245456489242e-05 |
|   4    | 1.0690613641223106e-05 | 4.276245456489242e-05 |
|   5    | 0.0004917682274962601  | 0.0016846463350632733 |
|   6    | 0.00041693393200769983 | 0.0017006822555251113 |
|   7    | 0.0004948226885366114  | 0.0023040146725201247 |
|   8    | 0.00047038700021380945 | 0.0024870277965210784 |
|   9    | 0.00046801130829354017 | 0.0027284574879186804 |
|   10   | 0.00047466324567029694 |  0.003046487045845624 |
+--------+------------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 3.344918252784026

Per User RMSE (best)
+----------+------+-------+
|  


Precision and recall summary statistics by cutoff
+--------+----------------------+----------------------+
| cutoff |    mean_precision    |     mean_recall      |
+--------+----------------------+----------------------+
|   1    | 0.041971349155441234 | 0.03013302177830709  |
|   2    | 0.04004703870002108  | 0.056595810702327026 |
|   3    | 0.035991732592117597 | 0.07524628750476908  |
|   4    | 0.03357921744708165  | 0.09368728210515946  |
|   5    |  0.0310455420141118  | 0.10752407321815662  |
|   6    | 0.02909985033140889  | 0.12037778884458636  |
|   7    | 0.027264119246158864 | 0.13125680225076675  |
|   8    | 0.025807141329912294 | 0.14140056055611172  |
|   9    | 0.024586035682892787 |  0.1515475160669864  |
|   10   | 0.023427410733375984 | 0.16004945511575297  |
+--------+----------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 3.563652072483777

Per User RMSE (best)
+----------+---------------------+-------+
|  LYBID   |         rmse     


Precision and recall summary statistics by cutoff
+--------+------------------------+------------------------+
| cutoff |     mean_precision     |      mean_recall       |
+--------+------------------------+------------------------+
|   1    | 2.1381227282446286e-05 | 5.345306820611571e-06  |
|   2    | 1.0690613641223143e-05 | 5.345306820611571e-06  |
|   3    | 2.1381227282446313e-05 | 4.8107761385503194e-05 |
|   4    | 2.1381227282446482e-05 | 6.948898866795051e-05  |
|   5    | 0.0005003207184092333  |  0.001700682255525091  |
|   6    | 0.00043475162140973995 | 0.0017666077063126639  |
|   7    | 0.0005070405326980125  | 0.0023628130475468337  |
|   8    | 0.00048642292067564507 | 0.0025582985541292118  |
|   9    | 0.0004822654598151698  |  0.00279972824552683   |
|   10   | 0.00048321573658327826 | 0.0030840823704839197  |
+--------+------------------------+------------------------+
[10 rows x 3 columns]


Overall RMSE: 3.3398832780987644

Per User RMSE (best)
+----------+----

#### Models on purchase dummy

In [32]:
eval_dummy = tc.recommender.util.compare_models(test_data_dummy, models_w_dummy, model_names=names_w_dummy)

PROGRESS: Evaluate model Popularity Model on Purchase Dummy



Precision and recall summary statistics by cutoff
+--------+------------------------+------------------------+
| cutoff |     mean_precision     |      mean_recall       |
+--------+------------------------+------------------------+
|   1    | 0.0002772919244059546  | 0.00014646701648109324 |
|   2    | 0.00022396655432788563 | 0.0002202337784224231  |
|   3    | 0.00020619143096853147 | 0.0003275955235129346  |
|   4    | 0.00017597372125762633 | 0.00039691850461441927 |
|   5    | 0.0001535770658248355  | 0.00042891372666126323 |
|   6    | 0.00014220098687485035 | 0.00045450990429873693 |
|   7    | 0.0001462638722141318  | 0.0005824907924860955  |
|   8    | 0.00013597969369907428 | 0.0006144860145329414  |
|   9    | 0.00014931103621858996 | 0.0007708563140285332  |
|   10   | 0.0001578430954310815  | 0.0009379424736064813  |
+--------+------------------------+------------------------+
[10 rows x 3 columns]


Overall RMSE: 0.0

Per User RMSE (best)
+------------+------+-------+
|


Precision and recall summary statistics by cutoff
+--------+----------------------+---------------------+
| cutoff |    mean_precision    |     mean_recall     |
+--------+----------------------+---------------------+
|   1    | 0.09112239238940292  | 0.06299956010285963 |
|   2    | 0.07237319226995476  | 0.09842490435035275 |
|   3    | 0.06150903687271666  | 0.12424859625671045 |
|   4    | 0.05408792287018459  | 0.14500307578447863 |
|   5    |  0.0487607183993858  | 0.16259764179157668 |
|   6    | 0.04452312899051525  | 0.17725783853471813 |
|   7    | 0.041109289584184375 | 0.19081146927981893 |
|   8    | 0.038290281984557153 |  0.2029215596371595 |
|   9    | 0.03591522925169074  | 0.21384947363518733 |
|   10   | 0.03385947698477067  |  0.2236334504123605 |
+--------+----------------------+---------------------+
[10 rows x 3 columns]


Overall RMSE: 0.9814789708981958

Per User RMSE (best)
+------------+--------------------+-------+
|   LYBID    |        rmse        | count 


Precision and recall summary statistics by cutoff
+--------+------------------------+------------------------+
| cutoff |     mean_precision     |      mean_recall       |
+--------+------------------------+------------------------+
|   1    | 0.00010665074015613798 | 5.4036375012442895e-05 |
|   2    | 8.532059212490934e-05  | 9.314164640302634e-05  |
|   3    | 7.821054278116642e-05  | 0.0001429119918092236  |
|   4    | 7.465551810929518e-05  | 0.00018023975086387106 |
|   5    | 6.825647369992793e-05  | 0.0001916158298138589  |
|   6    | 6.754546876555276e-05  | 0.00022538856419663386 |
|   7    | 6.094328008922046e-05  | 0.00022805483270053737 |
|   8    | 5.332537007806865e-05  | 0.00022805483270053737 |
|   9    | 5.925041119785348e-05  | 0.00030182159464186604 |
|   10   | 5.972441448743649e-05  | 0.0003409268660324512  |
+--------+------------------------+------------------------+
[10 rows x 3 columns]


Overall RMSE: 1.0

Per User RMSE (best)
+------------+------+-------+
|

#### Models on normalized purchase frequency

In [33]:
eval_norm = tc.recommender.util.compare_models(test_data_norm, models_w_norm, model_names=names_w_norm)

PROGRESS: Evaluate model Popularity Model on Scaled Purchase Counts



Precision and recall summary statistics by cutoff
+--------+------------------------+------------------------+
| cutoff |     mean_precision     |      mean_recall       |
+--------+------------------------+------------------------+
|   1    | 2.1466137168616598e-05 | 2.1466137168616598e-05 |
|   2    | 2.146613716861659e-05  | 4.293227433723318e-05  |
|   3    | 2.146613716861659e-05  | 5.366534292154137e-05  |
|   4    | 3.7565740045078776e-05 | 0.00010017530678687633 |
|   5    | 3.0052592036063467e-05 | 0.00010017530678687633 |
|   6    | 3.219920575292455e-05  | 0.00013237451253980171 |
|   7    |  2.75993192167929e-05  | 0.00013237451253980171 |
|   8    | 2.4149404314693747e-05 | 0.00013237451253980171 |
|   9    | 2.1466137168616683e-05 | 0.00013237451253980171 |
|   10   | 2.1466137168616693e-05 | 0.00013666773997352637 |
+--------+------------------------+------------------------+
[10 rows x 3 columns]


Overall RMSE: 0.09174209744262064

Per User RMSE (best)
+------------+-


Precision and recall summary statistics by cutoff
+--------+----------------------+----------------------+
| cutoff |    mean_precision    |     mean_recall      |
+--------+----------------------+----------------------+
|   1    | 0.008242996672748776 | 0.006029991260215607 |
|   2    | 0.007255554362992399 | 0.010076279748062519 |
|   3    | 0.008006869163894011 | 0.01741083302229571  |
|   4    | 0.007153590211441343 | 0.02040836689457507  |
|   5    | 0.006701728024042116 | 0.023626038636449777 |
|   6    | 0.006282422811348418 | 0.02644744370509102  |
|   7    | 0.00602278477130899  | 0.029116957387001322 |
|   8    | 0.005801223569818598 | 0.03175096315559997  |
|   9    | 0.00552872288409478  | 0.03394861935078762  |
|   10   | 0.005428786089943147 |  0.0372792020448995  |
+--------+----------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.09588570371649034

Per User RMSE (best)
+------------+------+-------+
|   LYBID    | rmse | count |
+----------


Precision and recall summary statistics by cutoff
+--------+------------------------+------------------------+
| cutoff |     mean_precision     |      mean_recall       |
+--------+------------------------+------------------------+
|   1    | 2.1466137168616598e-05 | 2.1466137168616598e-05 |
|   2    | 2.1466137168616642e-05 | 4.2932274337233284e-05 |
|   3    | 2.1466137168616676e-05 | 5.3665342921541215e-05 |
|   4    | 3.756574004507888e-05  | 0.00010017530678687646 |
|   5    | 3.0052592036063433e-05 | 0.00010017530678687646 |
|   6    | 2.8621516224821725e-05 | 0.00011090837537118519 |
|   7    | 2.4532728192704725e-05 | 0.00011090837537118519 |
|   8    | 2.414940431469368e-05  | 0.00013237451253980144 |
|   9    | 2.385126352068513e-05  | 0.00015384064970841655 |
|   10   | 2.3612750885478223e-05 | 0.00015813387714214177 |
+--------+------------------------+------------------------+
[10 rows x 3 columns]


Overall RMSE: 0.09134139179218768

Per User RMSE (best)
+------------+-

## 8. Final Output
* In this step, we would like to manipulate format for recommendation output to one we can export to csv, and also a function that will return recommendation list given a customer ID.
* We need to first rerun the model using the whole dataset, as we came to a final model using train data and evaluated with test set.

In [44]:
users_to_recommend = list(data[user_id].unique())

##choice of model not evaluated yet
final_model = tc.item_similarity_recommender.create(tc.SFrame(data_dummy), 
                                            user_id=user_id, 
                                            item_id=item_id, 
                                            target='purchase_dummy', 
                                            similarity_type='cosine')

recom = final_model.recommend(users=users_to_recommend, k=n_rec)
recom.print_rows(n_display)

+-------+--------+----------------------+------+
| LYBID | ITEMID |        score         | rank |
+-------+--------+----------------------+------+
| 10004 | 43508  | 0.07162949868610927  |  1   |
| 10004 | 43646  | 0.05709114245006016  |  2   |
| 10004 | 43507  | 0.05089901174817767  |  3   |
| 10004 | 21089  | 0.05017979655947004  |  4   |
| 10004 | 35650  | 0.04822603293827602  |  5   |
| 10009 | 42701  | 0.10445459683736165  |  1   |
| 10009 | 43515  | 0.04533270994822184  |  2   |
| 10009 | 43604  | 0.03413587808609009  |  3   |
| 10009 | 35578  | 0.03187417984008789  |  4   |
| 10009 | 35545  | 0.025119324525197346 |  5   |
| 10012 | 42655  | 0.15089134375254312  |  1   |
| 10012 | 43442  | 0.05090361833572388  |  2   |
| 10012 | 15862  | 0.05022907257080078  |  3   |
| 10012 | 43507  | 0.048831939697265625 |  4   |
| 10012 | 33584  | 0.048500259717305504 |  5   |
| 10022 | 43507  | 0.03914217154184977  |  1   |
| 10022 | 15862  | 0.03719216017496018  |  2   |
| 10022 | 43853  | 0

### 8.1. CSV output file

In [45]:
df_rec = recom.to_dataframe()
print(df_rec.shape)
df_rec.head()

(505700, 4)


Unnamed: 0,LYBID,ITEMID,score,rank
0,10004,43508,0.071629,1
1,10004,43646,0.057091,2
2,10004,43507,0.050899,3
3,10004,21089,0.05018,4
4,10004,35650,0.048226,5


In [46]:
df_rec['recommendedProducts'] = df_rec.groupby([user_id])[item_id].transform(lambda x: '|'.join(x.astype(str)))
df_output = df_rec[['LYBID', 'recommendedProducts']].drop_duplicates().sort_values('LYBID').set_index('LYBID')

#### Define a function to create a desired output

In [50]:
def create_output(model, users_to_recommend, n_rec, print_csv=True):
    recomendation = model.recommend(users=users_to_recommend, k=n_rec)
    df_rec = recomendation.to_dataframe()
    df_rec['recommendedProducts'] = df_rec.groupby([user_id])[item_id] \
        .transform(lambda x: '|'.join(x.astype(str)))
    df_output = df_rec[['LYBID', 'recommendedProducts']].drop_duplicates() \
        .sort_values('LYBID').set_index('LYBID')
    if print_csv:
        df_output.to_csv('../output/sos_dummy_recommendation.csv')
        print("An output file can be found in 'output' folder with name 'new_recommendation.csv'")
    return df_output

In [51]:
df_output = create_output(cos_dummy, users_to_recommend, n_rec, print_csv=True)
print(df_output.shape)
df_output.head()

An output file can be found in 'output' folder with name 'new_recommendation.csv'
(101140, 1)


Unnamed: 0_level_0,recommendedProducts
LYBID,Unnamed: 1_level_1
10004,43508|15862|43987|43507|21089
10009,35578|43515|43604|43679|28876
10012,42655|15862|43442|33584|43604
10022,15862|42653|43507|43853|43442
10023,43507|28901|43612|43442|15862


### 8.2. Customer recommendation function

In [48]:
def customer_recomendation(LYBID):
    if LYBID not in df_output.index:
        print('Customer not found.')
        return LYBID
    return df_output.loc[LYBID]

In [49]:
customer_recomendation(10012)

recommendedProducts    42655|15862|43442|33584|43604
Name: 10012, dtype: object

In [None]:
customer_recomendation(10012)