# Product Recommendation Engine Analysis
Product Lead: Jeeth Joseph 
Based on the ideas so elegantly put by Moorissa Tjokro

## Problem statement
The steps below aim to recommend users their top 10 items to place into their basket. The final output will be a csv file in the `output` folder, and a function that searches for a recommendation list based on a speficied user:
* Input: user - customer ID
* Returns: ranked list of items (product IDs), that the user is most likely to want to put in his/her (empty) "basket"

## 1. Import modules
* `pandas` and `numpy` for data manipulation
* `turicreate` for performing model selection and evaluation
* `sklearn` for splitting the data into train and test set
* `xlrd` for excel import
* sudo apt-get install libatlas-base-dev for missing package

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import time
import turicreate as tc
from sklearn.model_selection import train_test_split

import sys
sys.path.append("..")
import scripts.data_layer as data_layer

## 2. Load data
Single dataset from db, which can be found in `data` folder: 
* Lyb data QUEST JAN WITH PURQTY 10k (to avoid memory error)
* XLSX Format
* Possible error expected dude to difference between expected purchase frequency and purchase qty



In [2]:
s=time.time()

data=pd.read_excel('../data/Lyb data QUEST JAN WITH PURQTY 10k.xlsx')

print("Import time:", round((time.time()-s)/60,2), "minutes")

print(data.shape)
data.head(2)

Import time: 0.11 minutes
(9999, 3)


Unnamed: 0,LYBID,ITEMID,TotalQtyPurchased
0,10004,29009,1
1,10004,33815,2


### 3 Create dummy
* Dummy for marking whether a customer bought that item or not.
* If one buys an item, then `purchase_dummy` are marked as 1
* Why create a dummy instead of normalizing it?
    * Normalizing the purchase count, say by each user, would not work because customers may have different buying frequency don't have the same taste
    * However, we can normalize items by purchase frequency across all users, which is done in section 3.3. below.

In [3]:
def create_data_dummy(data):
    data_dummy = data.copy()
    data_dummy['purchase_dummy'] = 1
    return data_dummy

In [4]:
data_dummy = create_data_dummy(data)

In [5]:
print(data_dummy.shape)
data_dummy.head()

(9999, 4)


Unnamed: 0,LYBID,ITEMID,TotalQtyPurchased,purchase_dummy
0,10004,29009,1,1
1,10004,33815,2,1
2,10004,43517,1,1
3,10004,43519,2,1
4,10004,43598,1,1


### 3.3. Normalize item values across users
* To do this, we normalize purchase frequency of each item across users by first creating a user-item matrix as follows

In [6]:
s=time.time()
df_matrix = pd.pivot_table(data, values='TotalQtyPurchased', index='LYBID', columns='ITEMID')
print("Import time:", round((time.time()-s)/60,2), "minutes")
df_matrix.head()

Import time: 0.01 minutes


ITEMID,69,200,208,325,481,504,1672,1717,1969,2041,...,44168,44169,44182,44186,44196,44197,44199,44257,44270,44289
LYBID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10004,,,,,,,,,,,...,,,,,,,,,,
10009,,,,,,,,,,,...,,,,,,,,,,
10012,,,,,,,,,,,...,,,,,,,,,,
10022,,,,,,,,,,,...,,,,,,,,,,
10023,,,,,,,,,,,...,,,,,,,,,,


In [7]:
(df_matrix.shape)

(2616, 921)

In [8]:
s=time.time()
df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min())
print("Import time:", round((time.time()-s)/60,2), "minutes")
print(df_matrix_norm.shape)
df_matrix_norm.head()

Import time: 0.01 minutes
(2616, 921)


ITEMID,69,200,208,325,481,504,1672,1717,1969,2041,...,44168,44169,44182,44186,44196,44197,44199,44257,44270,44289
LYBID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10004,,,,,,,,,,,...,,,,,,,,,,
10009,,,,,,,,,,,...,,,,,,,,,,
10012,,,,,,,,,,,...,,,,,,,,,,
10022,,,,,,,,,,,...,,,,,,,,,,
10023,,,,,,,,,,,...,,,,,,,,,,


In [9]:
# create a table for input to the modeling
s=time.time()
d = df_matrix_norm.reset_index()
d.index.names = ['scaled_purchase_freq']
data_norm = pd.melt(d, id_vars=['LYBID'], value_name='scaled_purchase_freq').dropna()
data_norm
print("Import time:", round((time.time()-s)/60,2), "minutes")
print(data_norm.shape)
data_norm.head()

Import time: 0.01 minutes
(7637, 3)


Unnamed: 0,LYBID,ITEMID,scaled_purchase_freq
24168,30516806,2041,0.0
24588,30524044,2041,0.0
24605,30524259,2041,0.0
25035,30529803,2041,1.0
25090,30551695,2041,0.0


#### Define a function for normalizing data

In [10]:
def normalize_data(data):
    df_matrix = pd.pivot_table(data, values='TotalQtyPurchased', index='LYBID', columns='ITEMID')
    df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min())
    d = df_matrix_norm.reset_index()
    d.index.names = ['scaled_purchase_freq']
    return pd.melt(d, id_vars=['customerId'], value_name='scaled_purchase_freq').dropna()

* We can normalize the their purchase history, from 0-1 (with 1 being the most number of purchase for an item and 0 being 0 purchase count for that item).

## 4. Split train and test set
* Splitting the data into training and testing sets is an important part of evaluating predictive modeling, in this case a collaborative filtering model. Typically, we use a larger portion of the data for training and a smaller portion for testing. 
* We use 80:20 ratio for our train-test set size.
* Our training portion will be used to develop a predictive model, while the other to evaluate the model's performance.
* Now that we have three datasets with purchase counts, purchase dummy, and scaled purchase counts, we would like to split each.

In [11]:
train, test = train_test_split(data, test_size = .2)
print(train.shape, test.shape)

(7999, 3) (2000, 3)


In [12]:
# Using turicreate library, we convert dataframe to SFrame - this will be useful in the modeling part

train_data = tc.SFrame(train)
test_data = tc.SFrame(test)

In [13]:
train_data

LYBID,ITEMID,TotalQtyPurchased
30567875,35849,1
30515049,35841,1
30524091,35826,1
30519862,38022,1
30524518,43987,1
30516084,33834,3
30526727,43334,1
81036,15146,1
30556603,42657,1
30529265,41058,1


In [14]:
test_data

LYBID,ITEMID,TotalQtyPurchased
30552847,43511,1
30553343,43442,1
30571834,43581,1
30564442,35652,1
30569504,43507,1
30517195,43669,1
30572315,43929,1
30528062,35650,1
30514917,43914,1
30559992,15082,1


#### Define a `split_data` function for splitting data to training and test set

In [16]:
# We can define a function for this step as follows

def split_data(data):
    '''
    Splits dataset into training and test set.
    
    Args:
        data (pandas.DataFrame)
        
    Returns
        train_data (tc.SFrame)
        test_data (tc.SFrame)
    '''
    train, test = train_test_split(data, test_size = .2)
    train_data = tc.SFrame(train)
    test_data = tc.SFrame(test)
    return train_data, test_data

In [17]:
# lets try with both dummy table and scaled/normalized purchase table

train_data_dummy, test_data_dummy = split_data(data_dummy)
train_data_norm, test_data_norm = split_data(data_norm)

## 5. Baseline Model
Before running a more complicated approach such as collaborative filtering, we would like to use a baseline model to compare and evaluate models. Since baseline typically uses a very simple approach, techniques used beyond this approach should be chosen if they show relatively better accuracy and complexity.

### 5.1. Using a Popularity model as a baseline
* The popularity model takes the most popular items for recommendation. These items are products with the highest number of sells across customers.
* We use `turicreate` library for running and evaluating both baseline and collaborative filtering models below
* Training data is used for model selection
* Yet to evaluate is the math behind turicerate.popularity model

#### Using purchase counts

In [18]:
# variables to define field names
user_id = 'LYBID'
item_id = 'ITEMID'
target = 'TotalQtyPurchased'
users_to_recommend = list(data[user_id].unique())
n_rec = 5 # number of items to recommend
n_display = 30

In [None]:
#print(list(data[user_id].unique()))  #tester for error raakshsasan20190222

In [19]:
popularity_model = tc.popularity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target)

In [20]:
# Get recommendations for a list of users to recommend (from data file)
# Printed below is head / top 30 rows for first 6 customers with 5 recommendations each

popularity_recomm = popularity_model.recommend(users=users_to_recommend, k=n_rec)
popularity_recomm.print_rows(n_display)

+-------+--------+-------+------+
| LYBID | ITEMID | score | rank |
+-------+--------+-------+------+
| 10004 | 42048  |  19.0 |  1   |
| 10004 | 35912  |  5.5  |  2   |
| 10004 | 35906  |  4.0  |  3   |
| 10004 | 15885  |  3.6  |  4   |
| 10004 | 15834  |  3.2  |  5   |
| 10009 | 42048  |  19.0 |  1   |
| 10009 | 35912  |  5.5  |  2   |
| 10009 | 35906  |  4.0  |  3   |
| 10009 | 15885  |  3.6  |  4   |
| 10009 | 15834  |  3.2  |  5   |
| 10012 | 42048  |  19.0 |  1   |
| 10012 | 35912  |  5.5  |  2   |
| 10012 | 35906  |  4.0  |  3   |
| 10012 | 15885  |  3.6  |  4   |
| 10012 | 15834  |  3.2  |  5   |
| 10022 | 42048  |  19.0 |  1   |
| 10022 | 35912  |  5.5  |  2   |
| 10022 | 35906  |  4.0  |  3   |
| 10022 | 15885  |  3.6  |  4   |
| 10022 | 15834  |  3.2  |  5   |
| 10023 | 42048  |  19.0 |  1   |
| 10023 | 35912  |  5.5  |  2   |
| 10023 | 35906  |  4.0  |  3   |
| 10023 | 15885  |  3.6  |  4   |
| 10023 | 15834  |  3.2  |  5   |
| 10029 | 42048  |  19.0 |  1   |
| 10029 | 3591

#### Define a `model` function for model selection

In [21]:
# Since turicreate is very accessible library, we can define a model selection function as below

def model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display):
    if name == 'popularity':
        model = tc.popularity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target)
    elif name == 'cosine':
        model = tc.item_similarity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target, 
                                                    similarity_type='cosine')
    elif name == 'pearson':
        model = tc.item_similarity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target, 
                                                    similarity_type='pearson')
        
    recom = model.recommend(users=users_to_recommend, k=n_rec)
    recom.print_rows(n_display)
    return model

In [22]:
# variables to define field names
# constant variables include:
user_id = 'LYBID'
item_id = 'ITEMID'
users_to_recommend = list(data[user_id].unique())
n_rec = 5 # number of items to recommend
n_display = 30 # to print the head / first few rows in a defined dataset

#### Using purchase dummy

In [23]:
# these variables will change accordingly
name = 'popularity'
target = 'purchase_dummy'
pop_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+-------+--------+-------+------+
| LYBID | ITEMID | score | rank |
+-------+--------+-------+------+
| 10004 | 38011  |  1.0  |  1   |
| 10004 | 33586  |  1.0  |  2   |
| 10004 | 29036  |  1.0  |  3   |
| 10004 | 38012  |  1.0  |  4   |
| 10004 | 33819  |  1.0  |  5   |
| 10009 | 38011  |  1.0  |  1   |
| 10009 | 33586  |  1.0  |  2   |
| 10009 | 29036  |  1.0  |  3   |
| 10009 | 38012  |  1.0  |  4   |
| 10009 | 33819  |  1.0  |  5   |
| 10012 | 38011  |  1.0  |  1   |
| 10012 | 33586  |  1.0  |  2   |
| 10012 | 29036  |  1.0  |  3   |
| 10012 | 38012  |  1.0  |  4   |
| 10012 | 33819  |  1.0  |  5   |
| 10022 | 38011  |  1.0  |  1   |
| 10022 | 33586  |  1.0  |  2   |
| 10022 | 29036  |  1.0  |  3   |
| 10022 | 38012  |  1.0  |  4   |
| 10022 | 33819  |  1.0  |  5   |
| 10023 | 38011  |  1.0  |  1   |
| 10023 | 33586  |  1.0  |  2   |
| 10023 | 29036  |  1.0  |  3   |
| 10023 | 38012  |  1.0  |  4   |
| 10023 | 33819  |  1.0  |  5   |
| 10029 | 38011  |  1.0  |  1   |
| 10029 | 3358

#### Using normalized purchase count

In [24]:
name = 'popularity'
target = 'scaled_purchase_freq'
pop_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+-------+--------+-------+------+
| LYBID | ITEMID | score | rank |
+-------+--------+-------+------+
| 10004 | 28884  |  1.0  |  1   |
| 10004 |  4370  |  1.0  |  2   |
| 10004 | 42048  |  1.0  |  3   |
| 10004 |  3848  |  1.0  |  4   |
| 10004 | 35796  |  1.0  |  5   |
| 10009 | 28884  |  1.0  |  1   |
| 10009 |  4370  |  1.0  |  2   |
| 10009 | 42048  |  1.0  |  3   |
| 10009 |  3848  |  1.0  |  4   |
| 10009 | 35796  |  1.0  |  5   |
| 10012 | 28884  |  1.0  |  1   |
| 10012 |  4370  |  1.0  |  2   |
| 10012 | 42048  |  1.0  |  3   |
| 10012 |  3848  |  1.0  |  4   |
| 10012 | 35796  |  1.0  |  5   |
| 10022 | 28884  |  1.0  |  1   |
| 10022 |  4370  |  1.0  |  2   |
| 10022 | 42048  |  1.0  |  3   |
| 10022 |  3848  |  1.0  |  4   |
| 10022 | 35796  |  1.0  |  5   |
| 10023 | 28884  |  1.0  |  1   |
| 10023 |  4370  |  1.0  |  2   |
| 10023 | 42048  |  1.0  |  3   |
| 10023 |  3848  |  1.0  |  4   |
| 10023 | 35796  |  1.0  |  5   |
| 10029 | 28884  |  1.0  |  1   |
| 10029 |  437

## 6. Collaborative Filtering Model

* In collaborative filtering, we would recommend items based on how similar users purchase items. For instance, if customer 1 and customer 2 bought similar items, e.g. 1 bought X, Y, Z and 2 bought X, Y, we would recommend an item Z to customer 2.

* To define similarity across users, we use the following steps:
    1. Create a user-item matrix, where index values represent unique customer IDs and column values represent unique product IDs
    
    2. Create an item-to-item similarity matrix. The idea is to calculate how similar a product is to another product. There are a number of ways of calculating this. In steps 6.1 and 6.2, we use cosine and pearson similarity measure, respectively.  
    
        * To calculate similarity between products X and Y, look at all customers who have bought both these items. For example, both X and Y have been bought by customers 1 and 2. 
        * We then create two item-vectors, v1 for item X and v2 for item Y, in the user-space of (1, 2) and then find the `cosine` or `pearson` angle/distance between these vectors. A zero angle or overlapping vectors with cosine value of 1 means total similarity (or per user, across all items, there is same rating) and an angle of 90 degree would mean cosine of 0 or no similarity.
        
    3. For each customer, we then predict his likelihood to buy a product (or his purchase counts) for products that he had not bought. 
    
        * For our example, we will calculate rating for user 2 in the case of item Z (target item). To calculate this we weigh the just-calculated similarity-measure between the target item and other items that customer has already bought. The weighing factor is the purchase counts given by the user to items already bought by him. 
        * We then scale this weighted sum with the sum of similarity-measures so that the calculated rating remains within a predefined limits. Thus, the predicted rating for item Z for user 2 would be calculated using similarity measures.

* While I wrote python scripts for all the process including finding similarity using python scripts (which can be found in `scripts` folder, we can use `turicreate` library for now to capture different measures like using `cosine` and `pearson` distance, and evaluate the best model.

### 6.1. `Cosine` similarity
* Similarity is the cosine of the angle between the 2 vectors of the item vectors of A and B
* It is defined by the following formula
![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTnRHSAx1c084UXF2wIHYwaHJLmq2qKtNk_YIv3RjHUO00xwlkt)
* Closer the vectors, smaller will be the angle and larger the cosine

#### Using purchase count

In [25]:
# these variables will change accordingly
name = 'cosine'
target = 'TotalQtyPurchased'
cos = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+-------+--------+---------------------+------+
| LYBID | ITEMID |        score        | rank |
+-------+--------+---------------------+------+
| 10004 | 43496  |  0.2952265938123067 |  1   |
| 10004 | 43872  | 0.22404481967290243 |  2   |
| 10004 | 41058  | 0.21192516883214316 |  3   |
| 10004 | 43511  | 0.20645380020141602 |  4   |
| 10004 | 43609  | 0.20537441968917847 |  5   |
| 10009 | 43496  |  0.3829347292582194 |  1   |
| 10009 | 35633  |  0.3442651828130086 |  2   |
| 10009 | 35647  | 0.20256135861078897 |  3   |
| 10009 | 15822  | 0.17181557416915894 |  4   |
| 10009 | 43504  | 0.17181557416915894 |  5   |
| 10012 | 42655  | 0.46770358085632324 |  1   |
| 10012 |  4351  | 0.23408228158950806 |  2   |
| 10012 | 33583  | 0.22229552268981934 |  3   |
| 10012 | 43925  | 0.21592989563941956 |  4   |
| 10012 | 35636  |  0.2003745138645172 |  5   |
| 10022 | 43496  |  0.4334707119885613 |  1   |
| 10022 | 43872  |  0.4110976527718937 |  2   |
| 10022 | 43609  | 0.34399348497390747 |

#### Using purchase dummy

In [26]:
# these variables will change accordingly
name = 'cosine'
target = 'purchase_dummy'
cos_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+-------+--------+----------------------+------+
| LYBID | ITEMID |        score         | rank |
+-------+--------+----------------------+------+
| 10004 | 27278  | 0.13114018738269806  |  1   |
| 10004 | 43508  |  0.1029571145772934  |  2   |
| 10004 | 43829  | 0.07702833414077759  |  3   |
| 10004 | 43875  |  0.0738956481218338  |  4   |
| 10004 | 40988  |  0.0738956481218338  |  5   |
| 10009 | 35633  |  0.1178511381149292  |  1   |
| 10009 | 42701  | 0.09855206807454427  |  2   |
| 10009 | 29027  | 0.09042261044184367  |  3   |
| 10009 | 43504  | 0.08087644974390666  |  4   |
| 10009 | 15822  | 0.08087644974390666  |  5   |
| 10012 | 43679  |  0.0769800345102946  |  1   |
| 10012 | 15862  | 0.06513200203577678  |  2   |
| 10012 | 33619  | 0.060858070850372314 |  3   |
| 10012 | 43330  | 0.060858070850372314 |  4   |
| 10012 | 43431  | 0.060858070850372314 |  5   |
| 10022 | 43932  | 0.05661747285297939  |  1   |
| 10022 | 43506  | 0.045780424560819356 |  2   |
| 10022 |  6430  | 0

#### Using normalized purchase count

In [27]:
name = 'cosine'
target = 'scaled_purchase_freq'
cos_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+-------+--------+----------------------+------+
| LYBID | ITEMID |        score         | rank |
+-------+--------+----------------------+------+
| 10004 | 28897  | 0.05773502588272095  |  1   |
| 10004 | 43674  | 0.03779643774032593  |  2   |
| 10004 | 43511  | 0.03771816492080689  |  3   |
| 10004 | 42006  | 0.03764121532440186  |  4   |
| 10004 | 43609  | 0.03718677759170532  |  5   |
| 10009 | 15086  | 0.09622504313786824  |  1   |
| 10009 | 43641  | 0.036810497442881264 |  2   |
| 10009 | 43488  | 0.025269925594329834 |  3   |
| 10009 | 43442  | 0.016325573126475017 |  4   |
| 10009 | 43545  |         0.0          |  5   |
| 10012 | 43327  | 0.22360679507255554  |  1   |
| 10012 | 42655  | 0.22360679507255554  |  2   |
| 10012 |  6421  | 0.18257418274879456  |  3   |
| 10012 | 43925  | 0.15811389684677124  |  4   |
| 10012 | 33583  | 0.12909945845603943  |  5   |
| 10022 | 43936  | 0.04441155837132381  |  1   |
| 10022 | 43327  | 0.03440104539577778  |  2   |
| 10022 | 43909  | 0

### 6.2. `Pearson` similarity
* Similarity is the pearson coefficient between the two vectors.
* It is defined by the following formula
![](http://critical-numbers.group.shef.ac.uk/glossary/images/correlationKT1.png)

#### Using purchase count

In [28]:
# these variables will change accordingly
name = 'pearson'
target = 'TotalQtyPurchased'
pear = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+-------+--------+--------------------+------+
| LYBID | ITEMID |       score        | rank |
+-------+--------+--------------------+------+
| 10004 | 42048  | 18.999803711970646 |  1   |
| 10004 | 35912  |        5.5         |  2   |
| 10004 | 35906  |        4.0         |  3   |
| 10004 | 15885  | 3.5998572170734406 |  4   |
| 10004 | 15834  |        3.2         |  5   |
| 10009 | 42048  |        19.0        |  1   |
| 10009 | 35912  |        5.5         |  2   |
| 10009 | 35906  |        4.0         |  3   |
| 10009 | 15885  | 3.599674399693807  |  4   |
| 10009 | 15834  |        3.2         |  5   |
| 10012 | 42048  |        19.0        |  1   |
| 10012 | 35912  |        5.5         |  2   |
| 10012 | 35906  |        4.0         |  3   |
| 10012 | 15885  |        3.6         |  4   |
| 10012 | 15834  |        3.2         |  5   |
| 10022 | 42048  |        19.0        |  1   |
| 10022 | 35912  |        5.5         |  2   |
| 10022 | 35906  |        4.0         |  3   |
| 10022 | 158

#### Using purchase dummy

In [29]:
# these variables will change accordingly
name = 'pearson'
target = 'purchase_dummy'
pear_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+-------+--------+-------+------+
| LYBID | ITEMID | score | rank |
+-------+--------+-------+------+
| 10004 | 38011  |  0.0  |  1   |
| 10004 | 33586  |  0.0  |  2   |
| 10004 | 29036  |  0.0  |  3   |
| 10004 | 38012  |  0.0  |  4   |
| 10004 | 33819  |  0.0  |  5   |
| 10009 | 38011  |  0.0  |  1   |
| 10009 | 33586  |  0.0  |  2   |
| 10009 | 29036  |  0.0  |  3   |
| 10009 | 38012  |  0.0  |  4   |
| 10009 | 33819  |  0.0  |  5   |
| 10012 | 38011  |  0.0  |  1   |
| 10012 | 33586  |  0.0  |  2   |
| 10012 | 29036  |  0.0  |  3   |
| 10012 | 38012  |  0.0  |  4   |
| 10012 | 33819  |  0.0  |  5   |
| 10022 | 38011  |  0.0  |  1   |
| 10022 | 33586  |  0.0  |  2   |
| 10022 | 29036  |  0.0  |  3   |
| 10022 | 38012  |  0.0  |  4   |
| 10022 | 33819  |  0.0  |  5   |
| 10023 | 38011  |  0.0  |  1   |
| 10023 | 33586  |  0.0  |  2   |
| 10023 | 29036  |  0.0  |  3   |
| 10023 | 38012  |  0.0  |  4   |
| 10023 | 33819  |  0.0  |  5   |
| 10029 | 38011  |  0.0  |  1   |
| 10029 | 3358

#### Using normalized purchase count

In [30]:
name = 'pearson'
target = 'scaled_purchase_freq'
pear_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+-------+--------+-------+------+
| LYBID | ITEMID | score | rank |
+-------+--------+-------+------+
| 10004 | 28884  |  1.0  |  1   |
| 10004 |  4370  |  1.0  |  2   |
| 10004 | 42048  |  1.0  |  3   |
| 10004 |  3848  |  1.0  |  4   |
| 10004 | 35796  |  1.0  |  5   |
| 10009 | 28884  |  1.0  |  1   |
| 10009 |  4370  |  1.0  |  2   |
| 10009 | 42048  |  1.0  |  3   |
| 10009 |  3848  |  1.0  |  4   |
| 10009 | 35796  |  1.0  |  5   |
| 10012 | 28884  |  1.0  |  1   |
| 10012 |  4370  |  1.0  |  2   |
| 10012 | 42048  |  1.0  |  3   |
| 10012 |  3848  |  1.0  |  4   |
| 10012 | 35796  |  1.0  |  5   |
| 10022 | 28884  |  1.0  |  1   |
| 10022 |  4370  |  1.0  |  2   |
| 10022 | 42048  |  1.0  |  3   |
| 10022 |  3848  |  1.0  |  4   |
| 10022 | 35796  |  1.0  |  5   |
| 10023 | 28884  |  1.0  |  1   |
| 10023 |  4370  |  1.0  |  2   |
| 10023 | 42048  |  1.0  |  3   |
| 10023 |  3848  |  1.0  |  4   |
| 10023 | 35796  |  1.0  |  5   |
| 10029 | 28884  |  1.0  |  1   |
| 10029 |  437

#### Note
* In collaborative filtering above, we used two approaches: cosine and pearson distance. We also got to apply them to three training datasets with normal counts, dummy, or normalized counts of items purchase.
* We can see that the recommendations are different for each user. This suggests that personalization does exist. 
* But how good is this model compared to the baseline, and to each other? We need some means of evaluating a recommendation engine. Lets focus on that in the next section.

## 7. Model Evaluation
For evaluating recommendation engines, we can use the concept of precision-recall.

* RMSE (Root Mean Squared Errors)
    * Measures the error of predicted values
    * Lesser the RMSE value, better the recommendations
* Recall
    * What percentage of products that a user buys are actually recommended?
    * If a customer buys 5 products and the recommendation decided to show 3 of them, then the recall is 0.6
* Precision
    * Out of all the recommended items, how many the user actually liked?
    * If 5 products were recommended to the customer out of which he buys 4 of them, then precision is 0.8
    
* Why are both recall and precision important?
    * Consider a case where we recommend all products, so our customers will surely cover the items that they liked and bought. In this case, we have 100% recall! Does this mean our model is good?
    * We have to consider precision. If we recommend 300 items but user likes and buys only 3 of them, then precision is 0.1%! This very low precision indicates that the model is not great, despite their excellent recall.
    * So our aim has to be optimizing both recall and precision (to be close to 1 as possible).

Lets compare all the models we have built based on precision-recall characteristics:

In [31]:
# create initial callable variables

models_w_counts = [popularity_model, cos, pear]
models_w_dummy = [pop_dummy, cos_dummy, pear_dummy]
models_w_norm = [pop_norm, cos_norm, pear_norm]

names_w_counts = ['Popularity Model on Purchase Counts', 'Cosine Similarity on Purchase Counts', 'Pearson Similarity on Purchase Counts']
names_w_dummy = ['Popularity Model on Purchase Dummy', 'Cosine Similarity on Purchase Dummy', 'Pearson Similarity on Purchase Dummy']
names_w_norm = ['Popularity Model on Scaled Purchase Counts', 'Cosine Similarity on Scaled Purchase Counts', 'Pearson Similarity on Scaled Purchase Counts']

#### Models on purchase counts

In [32]:
eval_counts = tc.recommender.util.compare_models(test_data, models_w_counts, model_names=names_w_counts)

PROGRESS: Evaluate model Popularity Model on Purchase Counts



Precision and recall summary statistics by cutoff
+--------+------------------------+------------------------+
| cutoff |     mean_precision     |      mean_recall       |
+--------+------------------------+------------------------+
|   1    |          0.0           |          0.0           |
|   2    |          0.0           |          0.0           |
|   3    |          0.0           |          0.0           |
|   4    | 0.00019546520719312008 | 0.00039093041438624016 |
|   5    | 0.0003127443315089923  | 0.0007818608287724801  |
|   6    | 0.00026062027625749315 | 0.0007818608287724801  |
|   7    | 0.00022338880822070866 | 0.0007818608287724801  |
|   8    | 0.00019546520719312003 | 0.0007818608287724801  |
|   9    | 0.0001737468508383287  | 0.0007818608287724801  |
|   10   | 0.0003127443315089918  |  0.001759186864738078  |
+--------+------------------------+------------------------+
[10 rows x 3 columns]


Overall RMSE: 0.8169896621803668

Per User RMSE (best)
+----------+----


Precision and recall summary statistics by cutoff
+--------+----------------------+----------------------+
| cutoff |    mean_precision    |     mean_recall      |
+--------+----------------------+----------------------+
|   1    | 0.01876465989053949  | 0.01329163408913214  |
|   2    | 0.01798279906176703  | 0.026127182694813667 |
|   3    | 0.01641907740422207  | 0.033906697941099816 |
|   4    | 0.016419077404222073 | 0.044899288878960567 |
|   5    | 0.015324472243940583 | 0.05144085781302352  |
|   6    | 0.01472504560854834  | 0.059220373059309765 |
|   7    | 0.013961800513794259 | 0.06521463941323208  |
|   8    | 0.013389366692728694 | 0.06965821512342238  |
|   9    | 0.013031013812874657 | 0.07662980751331035  |
|   10   | 0.012978889757623126 |  0.0821493726497635  |
+--------+----------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 1.4180401672720027

Per User RMSE (best)
+-------+---------------------+-------+
| LYBID |         rmse        | 


Precision and recall summary statistics by cutoff
+--------+------------------------+------------------------+
| cutoff |     mean_precision     |      mean_recall       |
+--------+------------------------+------------------------+
|   1    |          0.0           |          0.0           |
|   2    |          0.0           |          0.0           |
|   3    |          0.0           |          0.0           |
|   4    | 0.00019546520719312003 | 0.00039093041438624006 |
|   5    | 0.0003127443315089923  | 0.0007818608287724801  |
|   6    | 0.00026062027625749315 | 0.0007818608287724801  |
|   7    | 0.00022338880822070866 | 0.0007818608287724801  |
|   8    | 0.00019546520719312003 | 0.0007818608287724801  |
|   9    | 0.0001737468508383287  | 0.0007818608287724801  |
|   10   | 0.00023455824863174383 | 0.0015637216575449587  |
+--------+------------------------+------------------------+
[10 rows x 3 columns]


Overall RMSE: 0.8436356486668803

Per User RMSE (best)
+----------+----

#### Models on purchase dummy

In [None]:
eval_dummy = tc.recommender.util.compare_models(test_data_dummy, models_w_dummy, model_names=names_w_dummy)

#### Models on normalized purchase frequency

In [None]:
eval_norm = tc.recommender.util.compare_models(test_data_norm, models_w_norm, model_names=names_w_norm)

## 8. Final Output
* In this step, we would like to manipulate format for recommendation output to one we can export to csv, and also a function that will return recommendation list given a customer ID.
* We need to first rerun the model using the whole dataset, as we came to a final model using train data and evaluated with test set.

In [None]:
users_to_recommend = list(data[user_id].unique())

##have to make the choice of the model that works for us --Raakshasan
final_model = tc.item_similarity_recommender.create(tc.SFrame(data_dummy), 
                                            user_id=user_id, 
                                            item_id=item_id, 
                                            target='purchase_dummy', 
                                            similarity_type='cosine')

recom = final_model.recommend(users=users_to_recommend, k=n_rec)
recom.print_rows(n_display)

### 8.1. CSV output file

In [None]:
df_rec = recom.to_dataframe()
print(df_rec.shape)
df_rec.head()

In [None]:
df_rec['recommendedProducts'] = df_rec.groupby([user_id])[item_id].transform(lambda x: '|'.join(x.astype(str)))
df_output = df_rec[['customerId', 'recommendedProducts']].drop_duplicates().sort_values('customerId').set_index('customerId')

#### Define a function to create a desired output

In [None]:
def create_output(model, users_to_recommend, n_rec, print_csv=True):
    recomendation = model.recommend(users=users_to_recommend, k=n_rec)
    df_rec = recomendation.to_dataframe()
    df_rec['recommendedProducts'] = df_rec.groupby([user_id])[item_id] \
        .transform(lambda x: '|'.join(x.astype(str)))
    df_output = df_rec[['customerId', 'recommendedProducts']].drop_duplicates() \
        .sort_values('customerId').set_index('customerId')
    if print_csv:
        df_output.to_csv('../output/option1_recommendation.csv')
        print("An output file can be found in 'output' folder with name 'option1_recommendation.csv'")
    return df_output

In [None]:
df_output = create_output(pear_norm, users_to_recommend, n_rec, print_csv=True)
print(df_output.shape)
df_output.head()

### 8.2. Customer recommendation function

In [None]:
def customer_recomendation(customer_id):
    if customer_id not in df_output.index:
        print('Customer not found.')
        return customer_id
    return df_output.loc[customer_id]

In [None]:
customer_recomendation(4)

In [None]:
customer_recomendation(21)

## Summary
In this exercise, we were able to traverse a step-by-step process for making recommendations to customers. We used Collaborative Filtering approaches with `cosine` and `pearson` measure and compare the models with our baseline popularity model. We also prepared three sets of data that include regular buying count, buying dummy, as well as normalized purchase frequency as our target variable. Using RMSE, precision and recall, we evaluated our models and observed the impact of personalization. Finally, we selected the Cosine approach in dummy purchase data. 