# Building a baseline model

To tell whether our recommendation system is working well on data or not, we should have a baseline model to compare with. 

* Here, we adopted a popularity model that recommends top-5 most purcahsed product to every users.

- our plan for building a popularity baseline model is like this.
    * 1. build a train matrix with top-5 Product_ID rows and User_ID columns.
    * 2. build a recommendation model with the same matrix frame with the step 1's but with all entries 1.
    * 3. use similarity between two matrices below
        - one with the recommend result matrix, step 2 matrix - step 1 matrix
            * the result matrix would have entry with 1 only when the user didn't buy the product in train data but now they got recommended.
        - the other with the actual result martix with entry only 1 when user bought the product in validation set but not in train set
            * this would act as like a answer.
         - cf. we exclude and keep safe our test data just in case!

#### 1. Load the train and validation data
- These pre-splitted train and validation data is exactly the same with the data set we use to build our own recommendation model.
    * We splitted train and validation of a user who bought more than 100 items.

In [4]:
import pandas as pd
# load the train data
train = pd.read_csv("./desktop/train_data.csv")

In [5]:
train

Unnamed: 0.1,Unnamed: 0,index,User_ID,Product_ID,countProduct,purchased
0,20,11,1000005,P00059442,106,1
1,21,20,1000005,P00249642,106,1
2,22,45,1000005,P00151442,106,1
3,23,32,1000005,P00003642,106,1
4,24,91,1000005,P00213142,106,1
5,25,58,1000005,P00001642,106,1
6,26,78,1000005,P00304342,106,1
7,27,79,1000005,P00197942,106,1
8,28,35,1000005,P00140742,106,1
9,29,26,1000005,P00110342,106,1


In [6]:
# load the validation data
val = pd.read_csv("./desktop/val_data.csv")

In [7]:
val

Unnamed: 0.1,Unnamed: 0,index,User_ID,Product_ID,countProduct,purchased
0,10,70,1000005,P00304442,106,1
1,11,36,1000005,P00294842,106,1
2,12,100,1000005,P00189042,106,1
3,13,50,1000005,P00183442,106,1
4,14,69,1000005,P00240142,106,1
5,15,55,1000005,P00289842,106,1
6,16,87,1000005,P00255842,106,1
7,17,2,1000005,P00286542,106,1
8,18,52,1000005,P00093242,106,1
9,19,76,1000005,P00034842,106,1


Since we only need User_ID and Product_ID to build a baseline model, we'll drop all other columns here.

In [12]:
#drop columns of a train data
train = train.drop(['Unnamed: 0', 'index', 'countProduct', 'purchased'], axis=1)
train

Unnamed: 0,User_ID,Product_ID
0,1000005,P00059442
1,1000005,P00249642
2,1000005,P00151442
3,1000005,P00003642
4,1000005,P00213142
5,1000005,P00001642
6,1000005,P00304342
7,1000005,P00197942
8,1000005,P00140742
9,1000005,P00110342


In [13]:
#drop columns of a validation data
val = val.drop(['Unnamed: 0', 'index', 'countProduct', 'purchased'], axis=1)
val

Unnamed: 0,User_ID,Product_ID
0,1000005,P00304442
1,1000005,P00294842
2,1000005,P00189042
3,1000005,P00183442
4,1000005,P00240142
5,1000005,P00289842
6,1000005,P00255842
7,1000005,P00286542
8,1000005,P00093242
9,1000005,P00034842


In the introduction of building this baseline model, we planned to recommend the top 5 most-frequently pruchased items to all users. This indicates that we don't need any rows with the Product_ID that's not in the list of top-5 Product_ID. For this deletion step, firstly, we need to know the which items are top-5 things.

#### 2. Top-5 most frequently-purchased items
Here, we are going to discover 5 items that were most frequently purchased. To do this, we refer to the code we wrote in the very first of our data pipeline step, exploratory data analysis. Only difference is here we use top-5 but before, we had top-10.

In [15]:
# load an original data set
origin = pd.read_csv("./desktop/BlackFriday.csv")

In [18]:
#top-5 poducts sold
origin["Product_ID"].value_counts(sort=True)[:5]

P00265242    1858
P00110742    1591
P00025442    1586
P00112142    1539
P00057642    1430
Name: Product_ID, dtype: int64

Now, it's time to delete all the rows with non-top-5 items!

In [31]:
new_train_1 = train[(train.Product_ID == 'P00265242')]
new_train_2 = train[(train.Product_ID == 'P00110742')]
new_train_3 = train[(train.Product_ID == 'P00025442')]
new_train_4 = train[(train.Product_ID == 'P00112142')]
new_train_5 = train[(train.Product_ID == 'P00057642')]

In [32]:
new_train_1

Unnamed: 0,User_ID,Product_ID
21,1000005,P00265242
568,1000018,P00265242
698,1000019,P00265242
893,1000022,P00265242
1992,1000045,P00265242
2079,1000048,P00265242
3077,1000059,P00265242
3607,1000090,P00265242
4259,1000118,P00265242
4541,1000123,P00265242


In [33]:
new_train_2

Unnamed: 0,User_ID,Product_ID
610,1000018,P00110742
887,1000022,P00110742
916,1000023,P00110742
1081,1000026,P00110742
2111,1000048,P00110742
2452,1000053,P00110742
2977,1000058,P00110742
3252,1000062,P00110742
3503,1000075,P00110742
3884,1000093,P00110742
