# Building a baseline model

To tell whether our recommendation system is working well on data or not, we should have a baseline model to compare with. 

* Here, we adopted a popularity model that recommends top-5 most purchased product to users.

- Our plan for building a popularity baseline model is like this.
    * load the train & validation data and preprocess the data
    * build a train matrix with top-5 Product_ID and User_ID.
    * build a matrix with the same matrix frame with the step 1's but with all entries 1.
    * use similarity between 2 vectors of 5-dimensionality below
        - one with the recommendation result matrix,which is step 3 matrix - step 2 matrix
            * the result matrix would have entry with 1 only when the user didn't buy the product in train data but now they got recommended.
        - the other with the validation martix with entry only 1 when user bought the product in validation set
            * this would act as like a answer.
         - cf. we exclude and keep our test data unexposed just in case!

### 1. Load the train and validation data
- These pre-splitted train and validation data is exactly the same with the data set we used to build our own recommendation model.
    * We split train and validation of a user who bought more than 300 items.

In [1]:
import pandas as pd
import numpy as np
from numpy import dot
from numpy.linalg import norm

In [2]:
# load the train data
train = pd.read_csv("train_data.csv")
# load the validation data
val = pd.read_csv("val_data.csv")

Since we only need User_ID,Product_ID, and purchased to build a baseline model, we'll drop all other columns here.

In [3]:
#drop columns of a train data
train = train.drop(['Unnamed: 0', 'index', 'countProduct'], axis=1)
#drop columns of a validation data
val = val.drop(['Unnamed: 0', 'index', 'countProduct'], axis=1)

In the introduction of building this baseline model, we planned to recommend the top 5 most-frequently pruchased items to all users. This indicates that we don't need any rows with the Product_ID that's not in the list of top-5 Product_ID. For this deletion step, firstly, we need to know the which items are top-5 things.

### 2. Top-5 most frequently-purchased items & preprocessing
Here, we are going to discover 5 items that were most frequently purchased. To do this, we refer to the code we wrote in the very first of our data pipeline step, exploratory data analysis. Only difference is here we use top-5 but before, we had top-10.

In [4]:
# load an original data set
origin = pd.read_csv("BlackFriday.csv")
#top-5 poducts sold
origin["Product_ID"].value_counts(sort=True)[:5]

P00265242    1858
P00110742    1591
P00025442    1586
P00112142    1539
P00057642    1430
Name: Product_ID, dtype: int64

Those 5 Product_IDs are from the very original data set.
Now, it's time to delete all the rows with non-top-5 items!

In [5]:
new_train_1 = train[(train.Product_ID == 'P00265242')]
new_train_2 = train[(train.Product_ID == 'P00110742')]
new_train_3 = train[(train.Product_ID == 'P00025442')]
new_train_4 = train[(train.Product_ID == 'P00112142')]
new_train_5 = train[(train.Product_ID == 'P00057642')]
new_train = pd.concat([new_train_1, new_train_2, new_train_3, new_train_4, new_train_5])

new_val_1 = val[(val.Product_ID == 'P00265242')]
new_val_2 = val[(val.Product_ID == 'P00110742')]
new_val_3 = val[(val.Product_ID == 'P00025442')]
new_val_4 = val[(val.Product_ID == 'P00112142')]
new_val_5 = val[(val.Product_ID == 'P00057642')]
new_val = pd.concat([new_val_1, new_val_2, new_val_3, new_val_4, new_val_5])

Now, we got a same dataframe of User_ID and top-5 items for the train and validation data.

### 3. build a matrix
Here, we are going to transfrom the dataframe above into the matrix of User_ID and 5 Product_ID with the entry 0 or 1. 0 means the user didn't purchased the item or get the recommendation. On the other hand, 1 means the user bought the item or get the recommendation.
- Matrices we need & meaning
    * train matrix : shows the purchased history of users for the 5 items.
    * recommendation matrix : matrix with all 1 - train matrix 
    -> recommending top 5 items to all users in train matrix
           1 for the recommend the item, 0 for no recommendation for the item since the user already bought it.
    * validation matrix : kind of an answer matrix.
 - compute cosine similarity of each user vector of recommendation matrix & validation matrix if the user exists both in two matrices.
 - that would act as an accuracy of our baseline model!

In [12]:
#building a train matrix
train_matrix = pd.pivot_table(new_train, values='purchased', index='User_ID', columns='Product_ID', fill_value=0)
#budling a validation matrix
val_matrix = pd.pivot_table(new_val, values='purchased', index='User_ID', columns='Product_ID', fill_value=0)
#building a matrix with all entries 1
recom_matrix = pd.pivot_table(new_train, index='User_ID', columns='Product_ID', fill_value=1)

#gain a recommendation matrix
recom_matrix = recom_matrix - train_matrix

Here, we got a recommendation matrix! 
* entry of 1 : recommend the item to the user since user didn't purchase it in the train matrix.
* entry of 0 : do not recommend the item to the user since user already bought it in the train matrix.
- So, to sum up, we only recommend the item to the user in the case the user didn't bought the item before(train set).

### 4. implement a consine similarity function & compute similarity

In [13]:
def cos_sim(A, B):
    return dot(A,B)/(norm(A)*norm(B))

In [32]:
#keep the User_ID of train matrix & top-5 items' Product_ID as a dataframe
users = pd.DataFrame(new_train.User_ID.unique())
users.columns = ['User_ID']
users_length = len(users)

top5products = pd.DataFrame(new_val.Product_ID.unique())
top5products.columns = ['Product_ID']
top5products_length = len(top5products)

#keep the User_ID of validation matrix
val_user_info = pd.DataFrame(new_val.User_ID)
val_user_info.columns = ['User_ID']

val_users_length = len(val_user_info)
similarity_matrix = np.zeros(shape=(val_users_length, 1))

In [38]:
#compute a similarity between user vectors of 5 dimesionality for the user who exist both in recommendation & validation matrix
for i, user1 in enumerate(val_user_info['User_ID']):
    for user2 in recom_matrix.index:
        if(user1 == user2):
            user_recom = np.array(recom_matrix.loc[user1])
            user_val = np.array(val_matrix.loc[user2])
            
            similarity_matrix[i] = cos_sim(user_recom, user_val)

In [42]:
pd.DataFrame(similarity_matrix).describe()

Unnamed: 0,0
count,18.0
mean,0.585168
std,0.122014
min,0.5
25%,0.5
50%,0.57735
75%,0.57735
max,1.0


- This would act like a baseline model's accuracy to compare with the accuracy of our new recommendation model.
- Here, we can see the similarity 1 the highest value obtainable, min  0.5.
- We guess this would be somehow already very well-built model for baseline model with the sound similarity matrix, the accuracy. However, since we had some size difference between recommendation matrix and validation matrix and also it was just for the top-5 items it could fall behind the model we'd built in the next step or even do better than it with far more data. 
- Comparison with this baseline model and new recommendation model will come after building the new one!