# Building a baseline model

To tell whether our recommendation system is working well on data or not, we should have a baseline model to compare with. 

* Here, we adopted a popularity model that recommends top-5 most purchased product to users.

- Our plan for building a popularity baseline model is like this.
    * load the train & validation data separated before and preprocess the data
    * build a train matrix with top-5 Product_ID and User_ID.
    * build a matrix with the same matrix frame with the step 1's but with all entries 1.
    * use similarity between 2 vectors of 5-dimensionality below
        - one with the recommendation result matrix,which is step 3 matrix - step 2 matrix
            * the result matrix would have entry with 1 only when the user didn't buy the product in train data but now they got recommended.
        - the other with the validation martix with entry only 1 when user bought the product in validation set
            * this would act as like a answer.
         - cf. we exclude and keep our test data unexposed just in case!

### 1. Load the train and validation data
- These pre-splitted train and validation data is exactly the same with the data set we used to build our own recommendation model.
    * We split train and validation of a user who bought more than 300 items.

In [125]:
import pandas as pd
# load the train data
train = pd.read_csv("train_data.csv")

In [126]:
train.head()

Unnamed: 0.1,Unnamed: 0,index,User_ID,Product_ID,countProduct,purchased
0,20,151,1000048,P00078742,337,1
1,21,187,1000048,P00058042,337,1
2,22,106,1000048,P00259342,337,1
3,23,160,1000048,P00052642,337,1
4,24,173,1000048,P00212042,337,1


In [127]:
# load the validation data
val = pd.read_csv("val_data.csv")

In [128]:
val.head()

Unnamed: 0.1,Unnamed: 0,index,User_ID,Product_ID,countProduct,purchased
0,10,80,1000048,P00103042,337,1
1,11,181,1000048,P00265442,337,1
2,12,212,1000048,P00223142,337,1
3,13,61,1000048,P00147942,337,1
4,14,104,1000048,P00265742,337,1


Since we only need User_ID,Product_ID, and purchased to build a baseline model, we'll drop all other columns here.

In [129]:
#drop columns of a train data
train = train.drop(['Unnamed: 0', 'index', 'countProduct'], axis=1)
train.head()

Unnamed: 0,User_ID,Product_ID,purchased
0,1000048,P00078742,1
1,1000048,P00058042,1
2,1000048,P00259342,1
3,1000048,P00052642,1
4,1000048,P00212042,1


In [130]:
#drop columns of a validation data
val = val.drop(['Unnamed: 0', 'index', 'countProduct'], axis=1)
val.head()

Unnamed: 0,User_ID,Product_ID,purchased
0,1000048,P00103042,1
1,1000048,P00265442,1
2,1000048,P00223142,1
3,1000048,P00147942,1
4,1000048,P00265742,1


In the introduction of building this baseline model, we planned to recommend the top 5 most-frequently pruchased items to all users. This indicates that we don't need any rows with the Product_ID that's not in the list of top-5 Product_ID. For this deletion step, firstly, we need to know the which items are top-5 things.

### 2. Top-5 most frequently-purchased items & preprocessing
Here, we are going to discover 5 items that were most frequently purchased. To do this, we refer to the code we wrote in the very first of our data pipeline step, exploratory data analysis. Only difference is here we use top-5 but before, we had top-10.

In [131]:
# load an original data set
origin = pd.read_csv("BlackFriday.csv")

In [132]:
#top-5 poducts sold
origin["Product_ID"].value_counts(sort=True)[:5]

P00265242    1858
P00110742    1591
P00025442    1586
P00112142    1539
P00057642    1430
Name: Product_ID, dtype: int64

Those 5 Product_IDs are from the very original data set.
Now, it's time to delete all the rows with non-top-5 items!

In [133]:
new_train_1 = train[(train.Product_ID == 'P00265242')]
new_train_2 = train[(train.Product_ID == 'P00110742')]
new_train_3 = train[(train.Product_ID == 'P00025442')]
new_train_4 = train[(train.Product_ID == 'P00112142')]
new_train_5 = train[(train.Product_ID == 'P00057642')]

In [134]:
new_train_1.head()

Unnamed: 0,User_ID,Product_ID,purchased
27,1000048,P00265242,1
965,1000123,P00265242,1
1831,1000169,P00265242,1
2370,1000195,P00265242,1
2952,1000202,P00265242,1


In [135]:
new_train_2.head()

Unnamed: 0,User_ID,Product_ID,purchased
49,1000048,P00110742,1
606,1000053,P00110742,1
718,1000123,P00110742,1
1563,1000149,P00110742,1
4297,1000308,P00110742,1


In [136]:
new_train = pd.concat([new_train_1, new_train_2, new_train_3, new_train_4, new_train_5])

In [137]:
new_train.head()

Unnamed: 0,User_ID,Product_ID,purchased
27,1000048,P00265242,1
965,1000123,P00265242,1
1831,1000169,P00265242,1
2370,1000195,P00265242,1
2952,1000202,P00265242,1


In [138]:
new_val_1 = val[(val.Product_ID == 'P00265242')]
new_val_2 = val[(val.Product_ID == 'P00110742')]
new_val_3 = val[(val.Product_ID == 'P00025442')]
new_val_4 = val[(val.Product_ID == 'P00112142')]
new_val_5 = val[(val.Product_ID == 'P00057642')]

In [139]:
new_val = pd.concat([new_val_1, new_val_2, new_val_3, new_val_4, new_val_5])

In [140]:
new_val.head()

Unnamed: 0,User_ID,Product_ID,purchased
611,1001242,P00265242,1
682,1001303,P00265242,1
1406,1002453,P00265242,1
1838,1003476,P00265242,1
1877,1003519,P00265242,1


Now, we got a same dataframe of User_ID and top-5 items for the train and validation data.

### 3. build a matrix
Here, we are going to transfrom the dataframe above into the matrix of User_ID and 5 Product_ID with the entry 0 or 1. 0 means the user didn't purchased the item or get the recommendation. On the other hand, 1 means the user bought the item or get the recommendation.
- Matrices we need & meaning
    * train matrix : shows the purchased history of users for the 5 items.
    * recommendation matrix : matrix with all 1 - train matrix 
    -> recommending top 5 items to all users in train matrix
           1 for the recommend the item, 0 for no recommendation for the item since the user already bought it.
    * validation matrix : kind of an answer matrix.
 - compute cosine similarity of each user vector of recommendation matrix & validation matrix if the user exists both in two matrices.
 - that would act as an accuracy of our baseline model!

In [141]:
#building a train matrix
train = pd.pivot_table(new_train, values='purchased', index='User_ID', columns='Product_ID', fill_value=0)

In [142]:
train.head()

Product_ID,P00025442,P00057642,P00110742,P00112142,P00265242
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1000048,0,0,1,1,1
1000053,0,1,1,1,0
1000123,0,1,1,0,1
1000148,0,1,0,1,0
1000149,0,0,1,1,0


In [143]:
#budling a validation matrix
val = pd.pivot_table(new_val, values='purchased', index='User_ID', columns='Product_ID', fill_value=0)

In [144]:
val.head()

Product_ID,P00025442,P00057642,P00110742,P00112142,P00265242
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1000329,0,1,0,0,0
1001068,0,0,0,1,0
1001224,0,0,1,0,0
1001242,0,0,0,0,1
1001266,0,0,1,0,0


In [145]:
#building a matrix with all entries 1
recom = pd.pivot_table(new_train, index='User_ID', columns='Product_ID', fill_value=1)

In [146]:
recom.head()

Unnamed: 0_level_0,purchased,purchased,purchased,purchased,purchased
Product_ID,P00025442,P00057642,P00110742,P00112142,P00265242
User_ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1000048,1,1,1,1,1
1000053,1,1,1,1,1
1000123,1,1,1,1,1
1000148,1,1,1,1,1
1000149,1,1,1,1,1


In [147]:
#gain a recommendation matrix
recom = recom - train

In [148]:
recom.head()

Unnamed: 0_level_0,purchased,purchased,purchased,purchased,purchased
Product_ID,P00025442,P00057642,P00110742,P00112142,P00265242
User_ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1000048,1,1,0,0,0
1000053,1,0,0,0,1
1000123,1,0,0,1,0
1000148,1,0,1,0,1
1000149,1,1,0,0,1


Here, we got a recommendation matrix! 
* entry of 1 : recommend the item to the user since user didn't purchase it in the train matrix.
* entry of 0 : do not recommend the item to the user since user already bought it in the train matrix.
- So, to sum up, we only recommend the item to the user in the case the user didn't bought the item before(train set).

In [162]:
#keep the User_ID of train matrix & top-5 items' Product_ID as a dataframe
users = pd.DataFrame(new_train.User_ID.unique())
users.columns = ['User_ID']
users_length = len(users)

top5products = pd.DataFrame(new_val.Product_ID.unique())
top5products.columns = ['Product_ID']
top5products_length = len(top5products)

### 4. implement a consine similarity function & compute similarity

In [163]:
import numpy as np
from numpy import dot
from numpy.linalg import norm

def cos_sim(A, B):
    return dot(A,B)/(norm(A)*norm(B))

In [164]:
recom.loc[1000048] #just for checking

           Product_ID
purchased  P00025442     1
           P00057642     1
           P00110742     0
           P00112142     0
           P00265242     0
Name: 1000048, dtype: int64

In [165]:
type(val.index)

pandas.core.indexes.numeric.Int64Index

In [166]:
#keep the User_ID of validation matrix
val_user_info = pd.DataFrame(val.index)
val_user_info.columns = ['User_ID'] 

In [167]:
val_user_info

Unnamed: 0,User_ID
0,1000329
1,1001068
2,1001224
3,1001242
4,1001266
5,1001303
6,1001599
7,1001605
8,1001889
9,1002092


In [168]:
val_users_length = len(val_user_info)
similarity_matrix = np.zeros(shape=(val_users_length, 1))

#compute a similarity between user vectors of 5 dimesionality for the user who exist both in recommendation & validation matrix
for i, user1 in enumerate(val_user_info['User_ID']):
    for user2 in recom.index:
        if(user1 == user2):
            user_recom = np.array(recom.loc[user1])
            user_val = np.array(val.loc[user2])
    
            similarity_matrix[i] = cos_sim(user_recom, user_val)
            
            print(i, user1)

0 1000329
2 1001224
3 1001242
4 1001266
5 1001303
6 1001599
7 1001605
8 1001889
9 1002092
10 1002109
11 1002453
12 1003391
13 1003476
14 1003507
15 1003519
16 1003648
17 1003683
18 1003693
19 1004028
20 1005111
21 1005306
22 1005312


We had 23 users in validation matrix and alot more users in recommendation matrix, so we can guess that at most 23 users would get recommendation and also their similarity between two User_ID vectors of a recommendation and validation matrix. But, here we noticed that User_ID indexed with 1 didn't got a recommendation. We interpreted it like this. The User_ID(user) existed in the validation matrix since in validation set the user bought at least one of the top-5 items. However, the user purchased nothing among top-5 items in train set which resulted in deletion of the User_ID of the user. That's why one user among the shared users of recommendation and validation matrix(or data) didn't get recommendation and similarity result!

In [169]:
similarity_matrix

array([[0.57735027],
       [0.        ],
       [0.5       ],
       [0.5       ],
       [0.57735027],
       [0.57735027],
       [1.        ],
       [0.5       ],
       [0.57735027],
       [0.5       ],
       [0.70710678],
       [0.57735027],
       [0.57735027],
       [0.70710678],
       [0.57735027],
       [0.5       ],
       [0.5       ],
       [0.70710678],
       [0.70710678],
       [0.5       ],
       [0.5       ],
       [0.57735027],
       [0.57735027]])

- This would act like a baseline model's accuracy to compare with the accuracy of our new recommendation model. The second row shows nothing showing 0 because as we mentioned just before, we set the similarity matrix with number of users in validation matrix but one user of validation matrix didn't exist in the recommendation matrix so no result obviously. 
- Here, we can see various ranges of similarity values. Among 22 results, 1 showed the similarity 1 the highest value obtainable, 13 showed above 0.5 with even several values above 0.7 which is definitely siginicant! Plus, even the all other values were 0.5.
- We guess this would be somehow already very well-built model for baseline model with the sound similarity matrix, the accuracy. However, since we had some size difference between recommendation matrix and validation matrix and also it was just for the top-5 items it could fall behind the model we'd built in the next step or even do better than it with far more data. 
- Comparison with this baseline model and new recommendation model will come after building the new one!