# DATA 612 Project 1 - Recommender System Validation

By Mike Silva

## Introduction

This is a validation of the calculations used in project 1.  For this validation we will be using the same dataset featured in the [Network20Q's YouTube videos](https://www.youtube.com/watch?v=q97VFt56vRs&list=PLuKhJYywjDe96T2L0-zXFU5Up2jqXlWI9) explaining the mathmatics behind recommendation systems.

## Training and Test Sets

We will create the training and test sets to match what was used in the videos.

In [1]:
import pandas as pd
import data612

train_df = pd.DataFrame([
    [5, None, 4, None, 4],
    [4, 3, 5, None, 4],
    [None, 2, None, None, 3],
    [2, None, 3, 1, 2],
    [4, None, None, 4, 5],
    [4, 2, 5, 4, None]
], columns=["I", "II", "III", "IV", "V"], index=["A", "B", "C", "D", "E", "F"])
train_df

Unnamed: 0,I,II,III,IV,V
A,5.0,,4.0,,4.0
B,4.0,3.0,5.0,,4.0
C,,2.0,,,3.0
D,2.0,,3.0,1.0,2.0
E,4.0,,,4.0,5.0
F,4.0,2.0,5.0,4.0,


In [2]:
test_df = pd.DataFrame([
    [None, None, None, None, None],
    [None, None, None, 3, None],
    [4, None, None, None, None],
    [None, 2, None, None, None],
    [None, None, 5, None, None],
    [None, None, None, None, 4]
], columns=["I", "II", "III", "IV", "V"], index=["A", "B", "C", "D", "E", "F"])
test_df

Unnamed: 0,I,II,III,IV,V
A,,,,,
B,,,,3.0,
C,4.0,,,,
D,,2.0,,,
E,,,5.0,,
F,,,,,4.0


## Calculate the Average Rating

Now that we have a training set we need calculate the raw average (mean) rating for every user-item combination. According to the video it should be 3.5.

In [3]:
raw_avg = train_df.sum(numeric_only=True).sum() / train_df.count().sum(axis = 0)
raw_avg

3.5

## Validate the RMSE Calculations 

Now I will calculate the RMSE for test data.  The [video](https://www.youtube.com/watch?v=prVRuPezW3Q&list=PLuKhJYywjDe96T2L0-zXFU5Up2jqXlWI9&index=12) arrives at a RMSE of 1.0247.  Here's the derivation:

In [4]:
(((4 - 3.5)**2 + (2 - 3.5)**2 + (5 - 3.5)**2 + (3 - 3.5)**2 + (4 - 3.5)**2)/5)**(1/2)

1.02469507659596

This matches the video.  Let's validate the get_RMSE() function:

In [5]:
test_df_RMSE = data612.get_RMSE(test_df, raw_avg)
test_df_RMSE

1.02469507659596

This checks out so everything is awesome!

## Calculate the Biases

We now can calculate the user and item biases.  The [movie](https://www.youtube.com/watch?v=dGM4bNQcVKI&list=PLuKhJYywjDe96T2L0-zXFU5Up2jqXlWI9&index=14) calculates the biases for user D.  They come up with -1.5.  Here's the derivation:

In [6]:
((2 + 3 + 1 + 2) / 4) - 3.5

-1.5

Now we will check that against our function's value

In [7]:
user_bias_train_df, item_bias_train_df = data612.get_biases(train_df, raw_avg)
user_bias_train_df["D"]

-1.5

This matches.  We will also check the item bias for  movie III.  The video gets X.  Here's their math:

In [8]:
((4 + 5 + 3 + 5) / 4) - 3.5

0.75

And here's what I got:

In [9]:
item_bias_train_df["III"]

0.75

At the end of the video he gives all the biases.  Here's what he gives for the users
* A = 0.83
* B = 0.5
* C = -1.0
* D = -1.5
* E = -0.83
* F = 0.25

Here's what I got:

In [10]:
user_bias_train_df

A    0.833333
B    0.500000
C   -1.000000
D   -1.500000
E    0.833333
F    0.250000
dtype: float64

Sweet.  It's a match.  For the items he gives the following values:
* I = 0.3
* II = -1.17
* III = 0.75
* IV = -0.5
* V = 0.1

Here's what I got:

In [11]:
item_bias_train_df

I      0.300000
II    -1.166667
III    0.750000
IV    -0.500000
V      0.100000
dtype: float64

## Baseline Predictors

Now we can make our baseline predictions.  In the [movie](https://www.youtube.com/watch?v=4RSigTais8o&list=PLuKhJYywjDe96T2L0-zXFU5Up2jqXlWI9&index=15) he works out user D and movie III.  He come's up with 2.75.  He also works through user A and movie II and gets 3.16.

In [12]:
baseline_predictions_df = data612.get_baseline_predictions(raw_avg, user_bias_train_df, item_bias_train_df)

def between_1_and_5(x):
    if x > 5:
        return 5
    elif x < 1:
        return 1
    return x

# Round and validates predictions
baseline_predictions_df = baseline_predictions_df.round(2).applymap(between_1_and_5)
baseline_predictions_df

Unnamed: 0,I,II,III,IV,V
A,4.63,3.17,5.0,3.83,4.43
B,4.3,2.83,4.75,3.5,4.1
C,2.8,1.33,3.25,2.0,2.6
D,2.3,1.0,2.75,1.5,2.1
E,4.63,3.17,5.0,3.83,4.43
F,4.05,2.58,4.5,3.25,3.85


## RMSE Part II

The [video](https://www.youtube.com/watch?v=lppNpLFelOc&list=PLuKhJYywjDe96T2L0-zXFU5Up2jqXlWI9&index=16) works out the RMSE on the test set using the baseline predictions.  He worked it out to be 0.7365 as shown below:

In [13]:
(((4 - 2.8)**2 + (2 - 1)**2 + (5 - 5)**2 + (3 - 3.5)**2 + (4 - 3.85)**2) / 5)**(1/2)

0.7365459931328119

Now to see what our function returns:

In [14]:
data612.get_RMSE(test_df, baseline_predictions_df)

0.7365459931328119

It matches.  In the video he says the training data has and RMSE of 0.4709.  He doesn't show the derivation because of the number of terms.  We'll take his word for it and check our function:

In [15]:
data612.get_RMSE(train_df, baseline_predictions_df)

0.47099363053018034

It matches.  I can conclude that the functions are generating valid results on these small datasets and can be used on larger datasets with confidence.