### Adventure Works Group - Recommendation Model

The Goal - 
In solving these problems, we will build collaborative filtering models for recommending products to customers using purchase data. 

## The tool will also be able to search for a recommendation list based on a specified user, such that:
Input: customer ID


Returns: ranked list of items (product IDs), that the user is most likely to want to put in his/her (empty) “basket”

#### 1. Import modules
pandas and numpy for data manipulation

turicreate for performing model selection and evaluation

sklearn for splitting the data into train and test set


In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import time
import turicreate as tc
from sklearn.cross_validation import train_test_split

import sys
sys.path.append("..")

  from ._conv import register_converters as _register_converters


#### 2. Load data

In [2]:
customers = pd.read_excel('/Users/joeldias/Desktop/Customers only.xlsx')

In [3]:
data = pd.read_excel('/Users/joeldias/Desktop/CustomerPurchaseData.xlsx')

In [4]:
data.head()

Unnamed: 0,CustomerID,ProductName,PurchaseCount
0,11000,"Touring-1000 Blue, 46",1
1,11000,"Mountain-200 Silver, 38",1
2,11000,Touring Tire,1
3,11000,"Mountain-100 Silver, 38",1
4,11000,Touring Tire Tube,1


#### 3.1. Create dummy
Dummy for marking whether a customer bought that item or not.
If one buys an item, then purchase_dummy are marked as 1


Why create a dummy instead of normalizing it. Normalizing the purchase count, say by each user, would not work because customers with different buying frequencies do not have the same taste. However, we can normalize items by purchase frequency across all users, which is done in the next section

In [5]:
def create_data_dummy(data):
    data_dummy = data.copy()
    data_dummy['purchase_dummy'] = 1
    return data_dummy
data_dummy = create_data_dummy(data)

#### 3.3. Normalize item values across users

To do this, we normalize purchase frequency of each item across users by first creating a user-item matrix as follows

In [6]:
df_matrix = pd.pivot_table(data, values='PurchaseCount', index='CustomerID', columns='ProductName')

In [7]:
df_matrix.head()

ProductName,AWC Logo Cap,All-Purpose Bike Stand,Bike Wash - Dissolver,"Classic Vest, L","Classic Vest, M","Classic Vest, S",Fender Set - Mountain,HL Mountain Tire,HL Road Tire,"Half-Finger Gloves, L",...,"Touring-3000 Blue, 62","Touring-3000 Yellow, 44","Touring-3000 Yellow, 50","Touring-3000 Yellow, 54","Touring-3000 Yellow, 58","Touring-3000 Yellow, 62",Water Bottle - 30 oz.,"Women's Mountain Shorts, L","Women's Mountain Shorts, M","Women's Mountain Shorts, S"
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11000,,,,,,,1.0,,,,...,,,,,,,,,,
11001,1.0,,,,,,1.0,,,,...,,,,,,,2.0,,,
11002,,,,,,,,,,,...,,,,,,,,,,
11003,1.0,,,,,,,,,,...,,,,,,,1.0,,,
11004,,,,,,,1.0,,,,...,,,,,,,,,,


In [8]:
df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min())

In [9]:
df_matrix_norm.head()

ProductName,AWC Logo Cap,All-Purpose Bike Stand,Bike Wash - Dissolver,"Classic Vest, L","Classic Vest, M","Classic Vest, S",Fender Set - Mountain,HL Mountain Tire,HL Road Tire,"Half-Finger Gloves, L",...,"Touring-3000 Blue, 62","Touring-3000 Yellow, 44","Touring-3000 Yellow, 50","Touring-3000 Yellow, 54","Touring-3000 Yellow, 58","Touring-3000 Yellow, 62",Water Bottle - 30 oz.,"Women's Mountain Shorts, L","Women's Mountain Shorts, M","Women's Mountain Shorts, S"
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11000,,,,,,,0.0,,,,...,,,,,,,,,,
11001,0.0,,,,,,0.0,,,,...,,,,,,,0.25,,,
11002,,,,,,,,,,,...,,,,,,,,,,
11003,0.0,,,,,,,,,,...,,,,,,,0.0,,,
11004,,,,,,,0.0,,,,...,,,,,,,,,,


#### Create a table for input to the modeling

In [1]:

d = df_matrix_norm.reset_index() 
d.index.names = ['scaled_purchase_freq'] 
data_norm = pd.melt(d, id_vars=['CustomerID'], value_name='scaled_purchase_freq').dropna()
print(data_norm.shape)
data_norm.head()

NameError: name 'df_matrix_norm' is not defined

#### The above steps can be combined to a single function as defined below:

In [11]:
def normalize_data(data):
    df_matrix = pd.pivot_table(data, values='purchase_count', index='customerId', columns='productId')
    df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min())
    d = df_matrix_norm.reset_index()
    d.index.names = ['scaled_purchase_freq']
    return pd.melt(d, id_vars=['customerId'], value_name='scaled_purchase_freq').dropna()

#### 4. Split train and test set
Splitting the data into training and testing sets is an important part of evaluating predictive modeling, in this case a collaborative filtering model. Typically, we use a larger portion of the data for training and a smaller portion for testing.

We use 80:20 ratio for our train-test set size.

Our training portion will be used to develop a predictive model, while the other to evaluate the model’s performance.
Let’s define a splitting function below.

In [12]:
def split_data(data):
    '''
    Splits dataset into training and test set.
    
    Args:
        data (pandas.DataFrame)
        
    Returns
        train_data (tc.SFrame)
        test_data (tc.SFrame)
    '''
    train, test = train_test_split(data, test_size = .2)
    train_data = tc.SFrame(train)
    test_data = tc.SFrame(test)
    return train_data, test_data

#### Now that we have three datasets with purchase counts, purchase dummy, and scaled purchase counts, we would like to split each for modeling.

In [13]:
train_data, test_data = split_data(data)
train_data_dummy, test_data_dummy = split_data(data_dummy)
train_data_norm, test_data_norm = split_data(data_norm)

#### 5. Define Models using Turicreate library

Before running a more complicated approach such as collaborative filtering, we should run a baseline model to compare and evaluate models. Since baseline typically uses a very simple approach, techniques used beyond this approach should be chosen if they show relatively better accuracy and complexity. In this case, we will be using popularity model.

A more complicated but common approach to predict purchase items is collaborative filtering. We will discuss more about the popularity model and collaborative filtering in the later section. For now, let’s first define our variables to use in the models:

In [14]:
# constant variables to define field names include:
user_id = 'CustomerID'
item_id = 'ProductName'
users_to_recommend = list(customers[user_id])
n_rec = 10 # number of items to recommend
n_display = 30 # to display the first few rows in an output dataset

# Create excel sheet customers and read to customers as in ref sheet


#### Turicreate has made it easy for us to call a modeling technique, so let’s define our function for all models as follows:

In [15]:
def model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display):
    if name == 'popularity':
        model = tc.popularity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target)
    elif name == 'cosine':
        model = tc.item_similarity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target, 
                                                    similarity_type='cosine')
    elif name == 'pearson':
        model = tc.item_similarity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target, 
                                                    similarity_type='pearson')
        
    recom = model.recommend(users=users_to_recommend, k=n_rec)
    recom.print_rows(n_display)
    return model

#### 6. Popularity Model as Baseline
The popularity model takes the most popular items for recommendation. These items are products with the highest number of sells across customers.

Training data is used for model selection


#### i. Using purchase count

In [16]:
name = 'popularity'
target = 'PurchaseCount'
popularity = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+------------------------+--------------------+------+
| CustomerID |      ProductName       |       score        | rank |
+------------+------------------------+--------------------+------+
|   11000    |  Patch Kit/8 Patches   | 1.084774356811472  |  1   |
|   11000    |     Road Tire Tube     | 1.0689265536723165 |  2   |
|   11000    |      HL Road Tire      | 1.0578386605783867 |  3   |
|   11000    | Water Bottle - 30 oz.  | 1.0411084043848964 |  4   |
|   11000    | Bike Wash - Dissolver  | 1.0410764872521245 |  5   |
|   11000    |    LL Mountain Tire    | 1.0408472012102874 |  6   |
|   11000    |   Mountain Tire Tube   | 1.0406330196749358 |  7   |
|   11000    |      ML Road Tire      | 1.0383522727272727 |  8   |
|   11000    | Mountain-200 Black, 46 | 1.0359408033826638 |  9   |
|   11000    | Sport-100 Helmet, Blue | 1.0347137637028014 |  10  |
|   11001    |  Patch Kit/8 Patches   | 1.084774356811472  |  1   |
|   11001    |     Road Tire Tube     | 1.068926

#### ii. Using purchase dummy

In [17]:

name = 'popularity'
target = 'purchase_dummy'
pop_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+--------------------------------+-------+------+
| CustomerID |          ProductName           | score | rank |
+------------+--------------------------------+-------+------+
|   11000    |     Road-550-W Yellow, 44      |  1.0  |  1   |
|   11000    |     Water Bottle - 30 oz.      |  1.0  |  2   |
|   11000    |    Mountain-200 Silver, 42     |  1.0  |  3   |
|   11000    | Short-Sleeve Classic Jersey, L |  1.0  |  4   |
|   11000    |       Road-750 Black, 48       |  1.0  |  5   |
|   11000    |     Sport-100 Helmet, Blue     |  1.0  |  6   |
|   11000    |   Women's Mountain Shorts, M   |  1.0  |  7   |
|   11000    |         Road Tire Tube         |  1.0  |  8   |
|   11000    |          LL Road Tire          |  1.0  |  9   |
|   11000    |     Mountain-500 Black, 42     |  1.0  |  10  |
|   11001    |        HL Mountain Tire        |  1.0  |  1   |
|   11001    |     Road-550-W Yellow, 44      |  1.0  |  2   |
|   11001    |    Mountain-200 Silver, 42     |  1.0  |

#### iii. Using scaled purchase count

In [18]:
name = 'popularity'
target = 'scaled_purchase_freq'
pop_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-------------------------+----------------------+------+
| CustomerID |       ProductName       |        score         | rank |
+------------+-------------------------+----------------------+------+
|   11000    |  Mountain-200 Black, 46 | 0.033126293995859216 |  1   |
|   11000    |  Mountain-200 Black, 38 | 0.029545454545454545 |  2   |
|   11000    |    Road-250 Black, 52   | 0.02880658436213992  |  3   |
|   11000    |    Road-250 Black, 48   | 0.02390438247011952  |  4   |
|   11000    |  Half-Finger Gloves, S  | 0.020833333333333332 |  5   |
|   11000    | Mountain-200 Silver, 42 | 0.020833333333333332 |  6   |
|   11000    |  All-Purpose Bike Stand | 0.02040816326530612  |  7   |
|   11000    |     Road-250 Red, 58    | 0.02040816326530612  |  8   |
|   11000    |    Road-250 Black, 58   | 0.019417475728155338 |  9   |
|   11000    |    Road-250 Black, 44   | 0.018779342723004695 |  10  |
|   11001    |  Mountain-200 Black, 46 | 0.033126293995859216 |  1   |
|   11

#### 6.1. Baseline Summary

Once we created the model, we predicted the recommendation items using scores by popularity. As you can tell for each model results above, the rows show the first 30 records with 10 recommendations. These 30 records include 3 users and their recommended items, along with score and descending ranks.

In the result, although different models have different recommendation list, each user is recommended the same list of 10 items. This is because popularity is calculated by taking the most popular items across all users.

#### 7. Collaborative Filtering Model
In collaborative filtering, we would recommend items based on how similar users purchase items. For instance, if customer 1 and customer 2 bought similar items, e.g. 1 bought X, Y, Z and 2 bought X, Y, we would recommend an item Z to customer 2.

7.1. Methodology
To define similarity across users, we use the following steps:

1. Create a user-item matrix, where index values represent unique customer IDs and column values represent unique product IDs

2. Create an item-to-item similarity matrix. The idea is to calculate how similar a product is to another product. There are a number of ways of calculating this. In steps 7.2 and 7.3, we use cosine or pearson similarity measure, respectively.

To calculate similarity between products X and Y, look at all customers who have purchased both these items. For example, both X and Y have been purchased by customers 1 and 2.

We then create two item-vectors, v1 for item X and v2 for item Y, in the user-space of (1, 2) and then find the cosine or pearson angle/distance between these vectors. A zero angle or overlapping vectors with cosine value of 1 means total similarity (or per user, across all items, there is same rating) and an angle of 90 degree would mean cosine of 0 or no similarity.

3. For each customer, we then predict his likelihood to buy a product (or his purchase counts) for products that he had not bought.

For our example, we will calculate rating for user 2 in the case of item Z (target item). To calculate this we weigh the just-calculated similarity-measure between the target item and other items that customer has already bought. The weighing factor is the purchase counts given by the user to items already bought by him.

We then scale this weighted sum with the sum of similarity-measures so that the calculated rating remains within a predefined limits. Thus, the predicted rating for item Z for user 2 would be calculated using similarity measures.


#### 7.2. Cosine similarity
Similarity is the cosine of the angle between the 2 vectors of the item vectors of A and B

#### i. Using purchase count

In [19]:
name = 'cosine'
target = 'PurchaseCount'
cos = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-------------------------+---------------------+------+
| CustomerID |       ProductName       |        score        | rank |
+------------+-------------------------+---------------------+------+
|   11000    |   Patch Kit/8 Patches   | 0.10993940383195877 |  1   |
|   11000    |  Water Bottle - 30 oz.  | 0.09440415352582932 |  2   |
|   11000    |      Road Tire Tube     |  0.0883534848690033 |  3   |
|   11000    |   Mountain Bottle Cage  | 0.08455412834882736 |  4   |
|   11000    |       AWC Logo Cap      | 0.08284298330545425 |  5   |
|   11000    |  Sport-100 Helmet, Blue | 0.07985440641641617 |  6   |
|   11000    |    Mountain Tire Tube   | 0.07619590312242508 |  7   |
|   11000    |     HL Mountain Tire    |  0.0709715336561203 |  8   |
|   11000    | Sport-100 Helmet, Black | 0.07075894623994827 |  9   |
|   11000    |       HL Road Tire      | 0.07018416374921799 |  10  |
|   11001    |  Sport-100 Helmet, Blue | 0.12026970585187276 |  1   |
|   11001    |  Spor

#### ii. Using purchase dummy

In [20]:
name = 'cosine'
target = 'purchase_dummy'
cos_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+--------------------------------+----------------------+------+
| CustomerID |          ProductName           |        score         | rank |
+------------+--------------------------------+----------------------+------+
|   11000    |     Water Bottle - 30 oz.      | 0.08419644832611084  |  1   |
|   11000    |          AWC Logo Cap          | 0.08019295760563441  |  2   |
|   11000    |      Mountain Bottle Cage      | 0.07766308954783849  |  3   |
|   11000    |        HL Mountain Tire        | 0.07653546333312988  |  4   |
|   11000    |      Patch Kit/8 Patches       | 0.07565557956695557  |  5   |
|   11000    |       Mountain Tire Tube       | 0.07179322413035802  |  6   |
|   11000    |     Sport-100 Helmet, Blue     |  0.0672506434576852  |  7   |
|   11000    |    Sport-100 Helmet, Black     |  0.0637678929737636  |  8   |
|   11000    |        Road Bottle Cage        | 0.05836418696812221  |  9   |
|   11000    |    Hydration Pack - 70 oz.     | 0.05376683814185

#### iii. Using scaled purchase count

In [21]:
name = 'cosine' 
target = 'scaled_purchase_freq' 
cos_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+--------------------------------+----------------------+------+
| CustomerID |          ProductName           |        score         | rank |
+------------+--------------------------------+----------------------+------+
|   11000    | Short-Sleeve Classic Jersey, M |         0.0          |  1   |
|   11000    |     Mountain-200 Black, 46     |         0.0          |  2   |
|   11000    |          AWC Logo Cap          |         0.0          |  3   |
|   11000    |    Touring-1000 Yellow, 46     |         0.0          |  4   |
|   11000    |          ML Road Tire          |         0.0          |  5   |
|   11000    |        Road Bottle Cage        |         0.0          |  6   |
|   11000    |     Sport-100 Helmet, Blue     |         0.0          |  7   |
|   11000    |     Mountain-200 Black, 42     |         0.0          |  8   |
|   11000    |    Hydration Pack - 70 oz.     |         0.0          |  9   |
|   11000    |     Water Bottle - 30 oz.      |         0.0     

#### 7.3. Pearson similarity
Similarity is the pearson coefficient between the two vectors.

#### i. Using purchase count

In [22]:
name = 'pearson'
target = 'PurchaseCount'
pear = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-------------------------+--------------------+------+
| CustomerID |       ProductName       |       score        | rank |
+------------+-------------------------+--------------------+------+
|   11000    |   Patch Kit/8 Patches   | 1.0770100769690956 |  1   |
|   11000    |      Road Tire Tube     | 1.0611391546783484 |  2   |
|   11000    |       HL Road Tire      | 1.050259952346937  |  3   |
|   11000    |  Water Bottle - 30 oz.  | 1.0383609580862965 |  4   |
|   11000    |    Mountain Tire Tube   | 1.0363986535560263 |  5   |
|   11000    |  Bike Wash - Dissolver  | 1.0362966939300378 |  6   |
|   11000    |  Mountain-200 Black, 46 | 1.0359359754064354 |  7   |
|   11000    |     LL Mountain Tire    | 1.0343510846913135 |  8   |
|   11000    |  Mountain-200 Black, 38 | 1.0330347599962206 |  9   |
|   11000    |       ML Road Tire      | 1.0327007689259269 |  10  |
|   11001    |   Patch Kit/8 Patches   | 1.1255128798179115 |  1   |
|   11001    |      Road Tire Tube

#### ii. Using purchase dummy

In [23]:
name = 'pearson'
target = 'purchase_dummy'
pear_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+--------------------------------+-------+------+
| CustomerID |          ProductName           | score | rank |
+------------+--------------------------------+-------+------+
|   11000    |     Road-550-W Yellow, 44      |  0.0  |  1   |
|   11000    |     Water Bottle - 30 oz.      |  0.0  |  2   |
|   11000    |    Mountain-200 Silver, 42     |  0.0  |  3   |
|   11000    | Short-Sleeve Classic Jersey, L |  0.0  |  4   |
|   11000    |       Road-750 Black, 48       |  0.0  |  5   |
|   11000    |     Sport-100 Helmet, Blue     |  0.0  |  6   |
|   11000    |   Women's Mountain Shorts, M   |  0.0  |  7   |
|   11000    |         Road Tire Tube         |  0.0  |  8   |
|   11000    |          LL Road Tire          |  0.0  |  9   |
|   11000    |     Mountain-500 Black, 42     |  0.0  |  10  |
|   11001    |        HL Mountain Tire        |  0.0  |  1   |
|   11001    |     Road-550-W Yellow, 44      |  0.0  |  2   |
|   11001    |    Mountain-200 Silver, 42     |  0.0  |

#### iii. Using scaled purchase count

In [24]:
name = 'pearson'
target = 'scaled_purchase_freq'
pear_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-------------------------+----------------------+------+
| CustomerID |       ProductName       |        score         | rank |
+------------+-------------------------+----------------------+------+
|   11000    |  Mountain-200 Black, 46 | 0.03312016465155482  |  1   |
|   11000    |  Mountain-200 Black, 38 | 0.029540398084756104 |  2   |
|   11000    |    Road-250 Black, 52   | 0.02880497503673098  |  3   |
|   11000    |    Road-250 Black, 48   | 0.023902882419892666 |  4   |
|   11000    | Mountain-200 Silver, 42 |  0.0208300252755483  |  5   |
|   11000    |  Half-Finger Gloves, S  | 0.020581831534703582 |  6   |
|   11000    |     Road-250 Red, 58    | 0.02040610690506137  |  7   |
|   11000    |  All-Purpose Bike Stand | 0.02034255841962334  |  8   |
|   11000    |    Road-250 Black, 58   | 0.01941655185616133  |  9   |
|   11000    |    Road-250 Black, 44   | 0.018778289707613662 |  10  |
|   11001    |  Mountain-200 Black, 46 | 0.03312132978044436  |  1   |
|   11

#### 8. Model Evaluation
For evaluating recommendation engines, we can use the concept of RMSE and precision-recall.

i. RMSE (Root Mean Squared Errors)
Measures the error of predicted values
Lesser the RMSE value, better the recommendations

ii. Recall
What percentage of products that a user buys are actually recommended?
If a customer buys 5 products and the recommendation decided to show 3 of them, then the recall is 0.6

iii. Precision
Out of all the recommended items, how many the user actually liked?
If 5 products were recommended to the customer out of which he buys 4 of them, then precision is 0.8

#### Let’s first create initial callable variables for model evaluation:

In [25]:
models_w_counts = [popularity, cos, pear]
models_w_dummy = [pop_dummy, cos_dummy, pear_dummy]
models_w_norm = [pop_norm, cos_norm, pear_norm]
names_w_counts = ['Popularity Model on Purchase Counts', 'Cosine Similarity on Purchase Counts', 'Pearson Similarity on Purchase Counts']
names_w_dummy = ['Popularity Model on Purchase Dummy', 'Cosine Similarity on Purchase Dummy', 'Pearson Similarity on Purchase Dummy']
names_w_norm = ['Popularity Model on Scaled Purchase Counts', 'Cosine Similarity on Scaled Purchase Counts', 'Pearson Similarity on Scaled Purchase Counts']

#### Lets compare all the models we have built based on RMSE and precision-recall characteristics:

In [26]:
eval_counts = tc.recommender.util.compare_models(test_data, models_w_counts, model_names=names_w_counts)
eval_dummy = tc.recommender.util.compare_models(test_data_dummy, models_w_dummy, model_names=names_w_dummy)
eval_norm = tc.recommender.util.compare_models(test_data_norm, models_w_norm, model_names=names_w_norm)

PROGRESS: Evaluate model Popularity Model on Purchase Counts



Precision and recall summary statistics by cutoff
+--------+----------------------+----------------------+
| cutoff |    mean_precision    |     mean_recall      |
+--------+----------------------+----------------------+
|   1    | 0.07213449960240828  | 0.054686353317501615 |
|   2    | 0.04799500170396433  | 0.07218273351977761  |
|   3    | 0.04244765042220469  | 0.09473673296801807  |
|   4    | 0.055748040440758935 |  0.1671836260726413  |
|   5    | 0.049028740202203665 | 0.18285726006105413  |
|   6    | 0.047143019425195824 | 0.21253477800677567  |
|   7    | 0.048522419305107006 |  0.2545575191718545  |
|   8    | 0.04698682267408819  | 0.28113463303455183  |
|   9    | 0.04439143221376553  |  0.2990773798254198  |
|   10   |  0.0428944677950698  |  0.3208568868116759  |
+--------+----------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.25038784622738475

Per User RMSE (best)
+------------+------+-------+
| CustomerID | rmse | count |
+----------


Precision and recall summary statistics by cutoff
+--------+----------------------+----------------------+
| cutoff |    mean_precision    |     mean_recall      |
+--------+----------------------+----------------------+
|   1    | 0.024082699079859216 | 0.017706535831038792 |
|   2    | 0.029876178575485662 | 0.04479434319829639  |
|   3    | 0.04347002915672688  | 0.09835702655480037  |
|   4    | 0.04117914347381591  | 0.12482653821118005  |
|   5    | 0.041008747018061974 | 0.15381016392319405  |
|   6    | 0.039797038888257835 | 0.18027192209005286  |
|   7    | 0.03964557537203228  | 0.20901541501905024  |
|   8    |  0.038921390435079   |  0.2333489755476571  |
|   9    |  0.0384212452825427  |  0.2581042447650422  |
|   10   | 0.038884471202998934 | 0.28794003306953475  |
+--------+----------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.9513775649965248

Per User RMSE (best)
+------------+---------------------+-------+
| CustomerID |         rmse


Precision and recall summary statistics by cutoff
+--------+----------------------+---------------------+
| cutoff |    mean_precision    |     mean_recall     |
+--------+----------------------+---------------------+
|   1    | 0.06304668862887612  | 0.04894575081727179 |
|   2    | 0.03811200727024863  | 0.05862994938594165 |
|   3    | 0.029989776212654914 | 0.06843721206154503 |
|   4    | 0.025474270135181134 | 0.07661529529074668 |
|   5    | 0.021538112007270276 | 0.08066694434977974 |
|   6    | 0.019122268923473074 | 0.08589243565956063 |
|   7    | 0.017818600801674786 | 0.09379504461862763 |
|   8    | 0.01749403612404858  | 0.10469473790500713 |
|   9    | 0.017127999293170267 | 0.11417067413886671 |
|   10   | 0.018016585255026748 | 0.13213424716321465 |
+--------+----------------------+---------------------+
[10 rows x 3 columns]


Overall RMSE: 0.2321305854961837

Per User RMSE (best)
+------------+------+-------+
| CustomerID | rmse | count |
+------------+------+-----


Precision and recall summary statistics by cutoff
+--------+-----------------------+-----------------------+
| cutoff |     mean_precision    |      mean_recall      |
+--------+-----------------------+-----------------------+
|   1    | 0.0010331764435770874 | 0.0006983507442696967 |
|   2    |  0.018195385145218654 |  0.02866139874233099  |
|   3    |  0.026097271648873044 |  0.06012098368601651  |
|   4    |  0.023590862128343368 |   0.0729955101468129  |
|   5    |  0.027367696016530836 |  0.10496084133726223  |
|   6    |  0.023935254276202547 |  0.10947429176392598  |
|   7    |  0.021745904193384347 |  0.11525351999752177  |
|   8    |   0.020735277235679   |   0.1255576783960434  |
|   9    |  0.019745149810584298 |  0.13403068187823136  |
|   10   |  0.02003214326713352  |  0.14974211587975844  |
+--------+-----------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.0

Per User RMSE (best)
+------------+------+-------+
| CustomerID | rmse | count |


Precision and recall summary statistics by cutoff
+--------+----------------------+----------------------+
| cutoff |    mean_precision    |     mean_recall      |
+--------+----------------------+----------------------+
|   1    | 0.048214900700263665 | 0.032368534219641976 |
|   2    | 0.04861669153943305  | 0.06603988206846677  |
|   3    | 0.04779397696399203  | 0.10038375125047184  |
|   4    | 0.04428309034554006  | 0.12462321859061554  |
|   5    | 0.04626334519572936  | 0.16622592671553757  |
|   6    | 0.04695212949144756  | 0.20301083468630257  |
|   7    | 0.046132148187021035 | 0.23378035980779616  |
|   8    | 0.04491447594994829  |  0.2594988000940235  |
|   9    | 0.042513297363486906 |  0.2755685203710681  |
|   10   | 0.04037423946734032  |  0.2913506094283268  |
+--------+----------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.9130893431659411

Per User RMSE (best)
+------------+---------------------+-------+
| CustomerID |         rmse


Precision and recall summary statistics by cutoff
+--------+----------------------+-----------------------+
| cutoff |    mean_precision    |      mean_recall      |
+--------+----------------------+-----------------------+
|   1    | 0.001033176443577088 | 0.0006983507442696965 |
|   2    | 0.018195385145218707 |  0.028661398742330976 |
|   3    | 0.02609727164887309  |  0.06012098368601632  |
|   4    | 0.02359086212834341  |  0.07299551014681324  |
|   5    | 0.027367696016530912 |  0.10496084133726206  |
|   6    | 0.023935254276202463 |  0.10947429176392569  |
|   7    | 0.021745904193384322 |   0.1152535199975221  |
|   8    | 0.020735277235679028 |  0.12555767839604395  |
|   9    | 0.019745149810584398 |  0.13403068187823156  |
|   10   | 0.020032143267133404 |   0.149742115879758   |
+--------+----------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 1.0

Per User RMSE (best)
+------------+------+-------+
| CustomerID | rmse | count |
+------------


Precision and recall summary statistics by cutoff
+--------+----------------------+----------------------+
| cutoff |    mean_precision    |     mean_recall      |
+--------+----------------------+----------------------+
|   1    | 0.016706111652512835 | 0.012436772007981779 |
|   2    | 0.017054155645273555 | 0.025627639333611636 |
|   3    | 0.014664253561650204 | 0.03366745556638355  |
|   4    | 0.01252958373938465  | 0.03849366559933196  |
|   5    | 0.013169984686064347 |  0.0504315745510234  |
|   6    | 0.013295280523458202 | 0.06071682698788377  |
|   7    | 0.01344444223464129  |  0.0719818508869047  |
|   8    | 0.012616594737574871 | 0.07691634127315664  |
|   9    | 0.012142868191872786 | 0.08366839473271312  |
|   10   |  0.0117221216761799  | 0.09000279540095786  |
+--------+----------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.09042432260902869

Per User RMSE (best)
+------------+----------------------+-------+
| CustomerID |         rm


Precision and recall summary statistics by cutoff
+--------+----------------------+----------------------+
| cutoff |    mean_precision    |     mean_recall      |
+--------+----------------------+----------------------+
|   1    | 0.021717945148266624 | 0.016087919227276117 |
|   2    | 0.020116942781567546 | 0.029990343160327837 |
|   3    |  0.0339690936934429  | 0.07760773342702892  |
|   4    | 0.036370597243491495 |  0.1122742409878702  |
|   5    | 0.03399693721286379  |  0.1320365103672909  |
|   6    | 0.03111513295280543  | 0.14390829096035926  |
|   7    |  0.0353811578926433  | 0.18868818351973005  |
|   8    | 0.03711889182792713  |  0.2274127183009261  |
|   9    | 0.03495908549507321  |  0.2406457376212358  |
|   10   | 0.03463733815954328  | 0.26658794243020956  |
+--------+----------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.08785621326090001

Per User RMSE (best)
+------------+------+-------+
| CustomerID | rmse | count |
+----------


Precision and recall summary statistics by cutoff
+--------+----------------------+----------------------+
| cutoff |    mean_precision    |     mean_recall      |
+--------+----------------------+----------------------+
|   1    | 0.016706111652512932 | 0.012436772007981807 |
|   2    | 0.017054155645273572 | 0.025627639333611747 |
|   3    | 0.014664253561650143 | 0.03366745556638357  |
|   4    | 0.012529583739384674 |  0.0384936655993317  |
|   5    | 0.013225671724906043 | 0.05080282147663453  |
|   6    | 0.01352730985196529  | 0.06270592602904973  |
|   7    | 0.012668801336488933 | 0.06879669590236222  |
|   8    | 0.012007517750243646 |  0.074009621482822   |
|   9    | 0.011709746778659469 |  0.0812663387318823  |
|   10   | 0.011708199916469441 | 0.08993318660240593  |
+--------+----------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.08715169228835211

Per User RMSE (best)
+------------+-----------------------+-------+
| CustomerID |          

#### 8.1. Evaluation Output
Based on RMSE



1. Popularity on purchase counts : 0.25038784622738475
2. cosine similarity on purchase counts : 0.9513775649965248
3. Pearson similarity on purchase counts : 0.2321305854961837

 
4. Popularity on purchase dummy : 0.0
5. cosine similarity on purchase dummy : 0.9130893431659411
6. Pearson similarity on purchase dummy : 1.0

 
7. Popularity on scaled purchase counts : 0.09042432260902869
8. cosine similarity on scaled purchase counts : 0.08785621326090001
9. Pearson similarity on scaled purchase counts : 0.08715169228835211



#### 8.2. Evaluation Summary
Popularity v. Collaborative Filtering: We can see that the collaborative filtering algorithms work better than popularity model for purchase counts. Indeed, popularity model doesn’t give any personalizations as it only gives the same list of recommended items to every user.

Below is the evaluation summary of the 6 Collaborative Filtering models

1. Recommendation scores: The recommendation scores for the 'normalized purchase data in cosine similarity' is zero and constant. The recommendation scores for the 'purchase dummy data in pearson similarity' is zero and constant. Hence, these two models cannot be used, as the recommended item list for multiple users will be identical.

2. Precision : We see that the precision for 'cosine similarity on purchase dummy' > 'cosine similarity on purchase counts' > 'Pearson similarity on purchase counts' > 'Pearson similarity on scaled purchase counts'

3. Recall : We see that the recall for 'cosine similarity on purchase dummy' > 'Pearson similarity on purchase counts' > 'cosine similarity on purchase counts' > 'Pearson similarity on scaled purchase counts'

4. RMSE: The RMSE for 'cosine similarity on purchase dummy' is lower than 'cosine similarity on purchase counts', and hence has a better outcome

# Therefore, we select the Cosine similarity on Purchase Dummy approach as our final model. 

#### 9. Final Output
We would like to manipulate the format for the recommendation output to one list, so that we can export to a csv file, and also a function that will return recommendation list upon providing a customer ID.


We need to first re-run the model using the whole dataset, as we came to a final model using the train data and evaluated the same with the test set.

In [27]:
users_to_recommend = list(customers[user_id])

final_model = tc.item_similarity_recommender.create(tc.SFrame(data_dummy), 
                                            user_id=user_id, 
                                            item_id=item_id, 
                                            target='purchase_dummy', 
                                            similarity_type='cosine')

recom = final_model.recommend(users=users_to_recommend, k=n_rec)
recom.print_rows(n_display)

+------------+-------------------------+----------------------+------+
| CustomerID |       ProductName       |        score         | rank |
+------------+-------------------------+----------------------+------+
|   11000    |   Mountain Bottle Cage  | 0.10170374810695648  |  1   |
|   11000    |  Water Bottle - 30 oz.  | 0.09968318790197372  |  2   |
|   11000    |       AWC Logo Cap      | 0.08919692039489746  |  3   |
|   11000    |   Patch Kit/8 Patches   | 0.08583327382802963  |  4   |
|   11000    |     HL Mountain Tire    | 0.08068738132715225  |  5   |
|   11000    |    Mountain Tire Tube   | 0.07670731097459793  |  6   |
|   11000    |  Sport-100 Helmet, Blue | 0.07248257100582123  |  7   |
|   11000    | Sport-100 Helmet, Black | 0.07234588265419006  |  8   |
|   11000    |     Road Bottle Cage    |  0.0666889101266861  |  9   |
|   11000    | Hydration Pack - 70 oz. | 0.06438815593719482  |  10  |
|   11001    |  Sport-100 Helmet, Red  |  0.090486341714859   |  1   |
|   11

####  9.1. CSV output file
Here we want to print our result to a csv output:

In [28]:
df_rec = recom.to_dataframe()
print(df_rec.shape)
df_rec.head()

(184840, 4)


Unnamed: 0,CustomerID,ProductName,score,rank
0,11000,Mountain Bottle Cage,0.101704,1
1,11000,Water Bottle - 30 oz.,0.099683,2
2,11000,AWC Logo Cap,0.089197,3
3,11000,Patch Kit/8 Patches,0.085833,4
4,11000,HL Mountain Tire,0.080687,5


#### Let’s define a function to create a desired output:

In [35]:
def create_output(model, users_to_recommend, n_rec, print_csv=True):
    recomendation = model.recommend(users=users_to_recommend, k=n_rec)
    df_rec = recomendation.to_dataframe()
    df_rec['RecommendedProducts'] = df_rec.groupby([user_id])[item_id] \
        .transform(lambda x: '|'.join(x.astype(str)))
    df_output = df_rec[['CustomerID', 'RecommendedProducts']].drop_duplicates() \
        .sort_values('CustomerID').set_index('CustomerID')
    if print_csv:
        df_output.to_csv('/Users/joeldias/Desktop/option1_recommendation.csv')
        print("An output file can be found in 'output' folder with name 'option1_recommendation.csv'")
    return df_output

#### Lets print the output below and setprint_csv to true, this way we can print out our output file in csv

In [36]:
df_output = create_output(cos_dummy, users_to_recommend, n_rec, print_csv=True)
print(df_output.shape)
df_output.head()

An output file can be found in 'output' folder with name 'option1_recommendation.csv'
(18484, 1)


Unnamed: 0_level_0,RecommendedProducts
CustomerID,Unnamed: 1_level_1
11000,Water Bottle - 30 oz.|AWC Logo Cap|Mountain Bo...
11001,"Sport-100 Helmet, Red|HL Mountain Tire|Sport-1..."
11002,Water Bottle - 30 oz.|Mountain Bottle Cage|HL ...
11003,Touring Tire Tube|Mountain Bottle Cage|Road Bo...
11004,Mountain Tire Tube|Water Bottle - 30 oz.|HL Mo...


#### 9.2. Customer recommendation function
Let’s define a function that will return recommendation list given a customer ID:

In [37]:
def customer_recomendation(customer_id):
    if customer_id not in df_output.index:
        print('Customer not found.')
        return customer_id
    return df_output.loc[customer_id]

In [38]:
customer_recomendation(11001)

RecommendedProducts    Sport-100 Helmet, Red|HL Mountain Tire|Sport-1...
Name: 11001, dtype: object

#### End of Code