Michael D'Acampora
Data612 - Project 1
Global Baseline Predictors and RMSE
Summer 2019

-----------

This recommender system recommends music to users.





In [1]:
import numpy as np
import pandas as pd
from random import randint

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

After loading our imports we create an 10x8 dataframe with random scores from 0 to 5.

In [2]:
users = ['user1', 'user2', 'user3', 'user4', 'user5', 'user6', 'user7', 'user8', 'user9', 'user10']
ratings = {
    "Users": [i for i in users],
    "Rock": [randint(1,5) for i in range(10)],
    "Jazz": [randint(1,5) for i in range(10)],
    "Pop": [randint(1,5) for i in range(10)],
    "HipHop": [randint(1,5) for i in range(10)],
    "Classical": [randint(1,5) for i in range(10)],
    "Blues": [randint(1,5) for i in range(10)],
    "Country": [randint(1,5) for i in range(10)],
    "Folk": [randint(1,5) for i in range(10)],
}

The output of the dataframe is shown below.

In [3]:
df = pd.DataFrame(ratings).set_index('Users')
df

Unnamed: 0_level_0,Rock,Jazz,Pop,HipHop,Classical,Blues,Country,Folk
Users,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
user1,1,4,4,2,1,2,5,5
user2,5,2,4,4,3,4,2,4
user3,1,5,2,2,4,5,4,2
user4,5,1,2,2,1,2,5,5
user5,1,5,3,5,1,1,4,4
user6,1,2,1,5,1,5,4,4
user7,4,4,2,2,3,5,3,3
user8,5,2,3,5,5,1,4,4
user9,5,4,1,2,5,3,2,5
user10,5,1,1,2,5,5,5,5


In order to add null values to represent a more realistic scenario, we choose at randome about 15% to be NaN and output the revised dataframe below.

In [4]:
np.random.seed(111)
mask = np.random.choice([True, False], size=df.shape, p=[0.15, 0.85])
df = df.mask(mask)    

In [5]:
df

Unnamed: 0_level_0,Rock,Jazz,Pop,HipHop,Classical,Blues,Country,Folk
Users,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
user1,1.0,4.0,4.0,2.0,1.0,,,5.0
user2,5.0,2.0,4.0,4.0,,4.0,2.0,4.0
user3,1.0,,,2.0,4.0,5.0,4.0,2.0
user4,5.0,1.0,2.0,,1.0,,5.0,5.0
user5,1.0,5.0,3.0,,1.0,1.0,4.0,4.0
user6,1.0,2.0,1.0,5.0,1.0,5.0,,4.0
user7,4.0,4.0,2.0,2.0,3.0,5.0,,
user8,,,,5.0,5.0,1.0,4.0,
user9,5.0,4.0,1.0,2.0,5.0,3.0,2.0,5.0
user10,,1.0,1.0,2.0,5.0,5.0,5.0,5.0


From here the dataframe is split into training and test sets on an 80/20 basis.

In [6]:
train, test = train_test_split(df, test_size=0.2)

The next step is to calculate the raw averages of the training and test sets. First we will work on the training set.

In [7]:
train

Unnamed: 0_level_0,Rock,Jazz,Pop,HipHop,Classical,Blues,Country,Folk
Users,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
user1,1.0,4.0,4.0,2.0,1.0,,,5.0
user8,,,,5.0,5.0,1.0,4.0,
user5,1.0,5.0,3.0,,1.0,1.0,4.0,4.0
user2,5.0,2.0,4.0,4.0,,4.0,2.0,4.0
user4,5.0,1.0,2.0,,1.0,,5.0,5.0
user3,1.0,,,2.0,4.0,5.0,4.0,2.0
user6,1.0,2.0,1.0,5.0,1.0,5.0,,4.0
user7,4.0,4.0,2.0,2.0,3.0,5.0,,


In [8]:
# sum the rows, them sum the row sums
# then divide the total by the non-NaN count to obtain the raw avg
train_row_sum = train.sum(axis=1)
df_train_sum = train_row_sum.sum()
train_raw_avg = df_train_sum / train.count().sum()
print(f'The raw average of the training set is: {"%.4f" % train_raw_avg}')

The raw average of the training set is: 3.1020


Next we will impute (replace) any NaN values in the training set with the raw average with the output below.

In [9]:
train_imputed = train.fillna(train_raw_avg)
train_imputed

Unnamed: 0_level_0,Rock,Jazz,Pop,HipHop,Classical,Blues,Country,Folk
Users,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
user1,1.0,4.0,4.0,2.0,1.0,3.102041,3.102041,5.0
user8,3.102041,3.102041,3.102041,5.0,5.0,1.0,4.0,3.102041
user5,1.0,5.0,3.0,3.102041,1.0,1.0,4.0,4.0
user2,5.0,2.0,4.0,4.0,3.102041,4.0,2.0,4.0
user4,5.0,1.0,2.0,3.102041,1.0,3.102041,5.0,5.0
user3,1.0,3.102041,3.102041,2.0,4.0,5.0,4.0,2.0
user6,1.0,2.0,1.0,5.0,1.0,5.0,3.102041,4.0
user7,4.0,4.0,2.0,2.0,3.0,5.0,3.102041,3.102041


Lastly an equivalent sized dataframe is created full of raw averages so we can calculate the root mean square error (RMSE).

In [10]:
train_avg = {
    "col1": [train_raw_avg for i in range(8)],
    "col2": [train_raw_avg for i in range(8)],
    "col3": [train_raw_avg for i in range(8)],
    "col4": [train_raw_avg for i in range(8)],
    "col5": [train_raw_avg for i in range(8)],
    "col6": [train_raw_avg for i in range(8)],
    "col7": [train_raw_avg for i in range(8)],
    "col8": [train_raw_avg for i in range(8)],
}

train_raw_avg_df = pd.DataFrame(train_avg)
train_raw_avg_df

Unnamed: 0,col1,col2,col3,col4,col5,col6,col7,col8
0,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041
1,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041
2,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041
3,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041
4,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041
5,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041
6,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041
7,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041,3.102041


Now that we have an equally sized imputed training set and raw average dataframes we can calculate the RMSE below:

In [11]:
training_rmse = sqrt(mean_squared_error(train_imputed, train_raw_avg_df))
print(f'The traning set RMSE is: {"%.4f" % training_rmse}')

The traning set RMSE is: 1.3607


*----------------*

Now let's go to work on the test set...

In [12]:
test

Unnamed: 0_level_0,Rock,Jazz,Pop,HipHop,Classical,Blues,Country,Folk
Users,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
user10,,1.0,1.0,2.0,5.0,5.0,5.0,5.0
user9,5.0,4.0,1.0,2.0,5.0,3.0,2.0,5.0


In [13]:
# my method for finding the raw avgs. Sum the rows, then sum the column
test_row_sum = test.sum(axis=1)
df_test_sum = test_row_sum.sum()
test_raw_avg = df_test_sum / test.count().sum()
print(f'The raw average for the test set is: {"%.4f" % test_raw_avg}')

The raw average for the test set is: 3.4000


As we did with the training set, we impute the raw average to replace the NaNs in the test set.

In [14]:
test_imputed = test.fillna(test_raw_avg)
test_imputed

Unnamed: 0_level_0,Rock,Jazz,Pop,HipHop,Classical,Blues,Country,Folk
Users,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
user10,3.4,1.0,1.0,2.0,5.0,5.0,5.0,5.0
user9,5.0,4.0,1.0,2.0,5.0,3.0,2.0,5.0


Just as with the training set, we create a new dataframe full of raw averages for the test set in order to calcluate the test set RMSE.

In [15]:
test_avg = {
    "col1": [test_raw_avg for i in range(2)],
    "col2": [test_raw_avg for i in range(2)],
    "col3": [test_raw_avg for i in range(2)],
    "col4": [test_raw_avg for i in range(2)],
    "col5": [test_raw_avg for i in range(2)],
    "col6": [test_raw_avg for i in range(2)],
    "col7": [test_raw_avg for i in range(2)],
    "col8": [test_raw_avg for i in range(2)],
}

test_raw_avg_df = pd.DataFrame(test_avg)
test_raw_avg_df

Unnamed: 0,col1,col2,col3,col4,col5,col6,col7,col8
0,3.4,3.4,3.4,3.4,3.4,3.4,3.4,3.4
1,3.4,3.4,3.4,3.4,3.4,3.4,3.4,3.4


Lastly the RMSE is calculated when we subtract the equally sized imputed dataframe from the raw average dataframe with the output below.

In [16]:
test_rmse = sqrt(mean_squared_error(test_imputed, test_raw_avg_df))
print(f'The test set RMSE is: {"%.4f" % test_rmse}')

The test set RMSE is: 1.6125


*----------------*
The next part is to calculate the bias for each user and each item, and calculate the baseline predictors for every user-item combination.

In [17]:
train

Unnamed: 0_level_0,Rock,Jazz,Pop,HipHop,Classical,Blues,Country,Folk
Users,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
user1,1.0,4.0,4.0,2.0,1.0,,,5.0
user8,,,,5.0,5.0,1.0,4.0,
user5,1.0,5.0,3.0,,1.0,1.0,4.0,4.0
user2,5.0,2.0,4.0,4.0,,4.0,2.0,4.0
user4,5.0,1.0,2.0,,1.0,,5.0,5.0
user3,1.0,,,2.0,4.0,5.0,4.0,2.0
user6,1.0,2.0,1.0,5.0,1.0,5.0,,4.0
user7,4.0,4.0,2.0,2.0,3.0,5.0,,


After viewing the training set we grab the sums of the rows and columns to make sure our math is correct before we calculate row and column bias values. 

In [18]:
print(train.sum(axis=0), train.count(axis=0))

Rock         18.0
Jazz         18.0
Pop          16.0
HipHop       20.0
Classical    16.0
Blues        21.0
Country      19.0
Folk         24.0
dtype: float64 Rock         7
Jazz         6
Pop          6
HipHop       6
Classical    7
Blues        6
Country      5
Folk         6
dtype: int64


In [19]:
train_col_bias = (train.sum(axis=0) / train.count(axis=0)) - train_raw_avg
train_col_bias

Rock        -0.530612
Jazz        -0.102041
Pop         -0.435374
HipHop       0.231293
Classical   -0.816327
Blues        0.397959
Country      0.697959
Folk         0.897959
dtype: float64

In [20]:
train_row_bias = (train.sum(axis=1) / train.count(axis=1)) - test_raw_avg
train_row_bias

Users
user1   -0.566667
user8    0.350000
user5   -0.685714
user2    0.171429
user4   -0.233333
user3   -0.400000
user6   -0.685714
user7   -0.066667
dtype: float64

The next step is to create a new matrix that gives us baseline predictors for every user-item combination and afterwards calculate the RMSE.

In [21]:
baseline_train = {
    "col1":[(train_col_bias[0] + i + train_raw_avg) for i in train_row_bias],
    "col2":[(train_col_bias[1] + i + train_raw_avg) for i in train_row_bias],
    "col3":[(train_col_bias[2] + i + train_raw_avg) for i in train_row_bias],
    "col4":[(train_col_bias[3] + i + train_raw_avg) for i in train_row_bias],
    "col5":[(train_col_bias[4] + i + train_raw_avg) for i in train_row_bias],
    "col6":[(train_col_bias[5] + i + train_raw_avg) for i in train_row_bias],
    "col7":[(train_col_bias[6] + i + train_raw_avg) for i in train_row_bias],
    "col8":[(train_col_bias[7] + i + train_raw_avg) for i in train_row_bias],
}

baseline_train_df = pd.DataFrame(baseline_train)
baseline_train_df

Unnamed: 0,col1,col2,col3,col4,col5,col6,col7,col8
0,2.004762,2.433333,2.1,2.766667,1.719048,2.933333,3.233333,3.433333
1,2.921429,3.35,3.016667,3.683333,2.635714,3.85,4.15,4.35
2,1.885714,2.314286,1.980952,2.647619,1.6,2.814286,3.114286,3.314286
3,2.742857,3.171429,2.838095,3.504762,2.457143,3.671429,3.971429,4.171429
4,2.338095,2.766667,2.433333,3.1,2.052381,3.266667,3.566667,3.766667
5,2.171429,2.6,2.266667,2.933333,1.885714,3.1,3.4,3.6
6,1.885714,2.314286,1.980952,2.647619,1.6,2.814286,3.114286,3.314286
7,2.504762,2.933333,2.6,3.266667,2.219048,3.433333,3.733333,3.933333


In [22]:
baseline_train_rmse = sqrt(mean_squared_error(baseline_train_df, train_raw_avg_df))
print(f'The RMSE for the baseline predictors for the training data is: {"%.4f" % baseline_train_rmse}')

The RMSE for the baseline predictors for the training data is: 0.7163


And now we perform the same operation on the test data: Calculate the row and column bias, their baseline predictors, and find the RMSE.  

In [23]:
test_col_bias = (test.sum(axis=0) / test.count(axis=0)) - test_raw_avg
test_col_bias

Rock         1.6
Jazz        -0.9
Pop         -2.4
HipHop      -1.4
Classical    1.6
Blues        0.6
Country      0.1
Folk         1.6
dtype: float64

In [24]:
test_row_bias = (test.sum(axis=1) / test.count(axis=1) - test_raw_avg)
test_row_bias

Users
user10    0.028571
user9    -0.025000
dtype: float64

In [25]:
baseline_test = {
    "col1":[(test_col_bias[0] + i + test_raw_avg) for i in test_row_bias],
    "col2":[(test_col_bias[1] + i + test_raw_avg) for i in test_row_bias],
    "col3":[(test_col_bias[2] + i + test_raw_avg) for i in test_row_bias],
    "col4":[(test_col_bias[3] + i + test_raw_avg) for i in test_row_bias],
    "col5":[(test_col_bias[4] + i + test_raw_avg) for i in test_row_bias],
    "col6":[(test_col_bias[5] + i + test_raw_avg) for i in test_row_bias],
    "col7":[(test_col_bias[6] + i + test_raw_avg) for i in test_row_bias],
    "col8":[(test_col_bias[7] + i + test_raw_avg) for i in test_row_bias],
}


baseline_test_df = pd.DataFrame(baseline_test)
baseline_test_df

Unnamed: 0,col1,col2,col3,col4,col5,col6,col7,col8
0,5.028571,2.528571,1.028571,2.028571,5.028571,4.028571,3.528571,5.028571
1,4.975,2.475,0.975,1.975,4.975,3.975,3.475,4.975


In [26]:
baseline_test_rmse = sqrt(mean_squared_error(baseline_test_df, test_raw_avg_df))
print(f'The RMSE for the baseline predictors for the test data is: {"%.4f" % baseline_test_rmse}')

The RMSE for the baseline predictors for the test data is: 1.4400


Lastly, we compare the original RMSE vs. the RMSE of the baseline predictors for the training and test data.

In [27]:
print(f'Original training set RMSE: {"%.4f" % training_rmse}',
     f'\nBaseline predictors training set RMSE: {"%.4f" % baseline_train_rmse}')

Original training set RMSE: 1.3607 
Baseline predictors training set RMSE: 0.7163


In [28]:
print(f'Original test set RMSE: {"%.4f" % test_rmse}',
     f'\nBaseline predictors test set RMSE: {"%.4f" % baseline_test_rmse}')

Original test set RMSE: 1.6125 
Baseline predictors test set RMSE: 1.4400


In [29]:
pct_improve_testdata = (1 - baseline_test_rmse / test_rmse) * 100
print(f'There was a {"%.1f" % pct_improve_testdata}% improvement with baseline predictors over the original test data RMSE')

There was a 10.7% improvement with baseline predictors over the original test data RMSE


In summary, there were RMSE improvements in both the training and test sets when creating baseline predictors by adding row and column biases to the raw average. In theory it should make any model we create a bit more accurate. 