# DATA 612 Project 1 - Joke Recommender System Validation

By Mike Silva

## Introduction

This recommender system provides our users with jokes that they will find funny.  By providing this content we will keep users engaged longer.

### About the Jester Dataset

For this project I will be using the [Jester dataset](http://eigentaste.berkeley.edu/dataset/).  It was created by Ken Goldberg at UC Berkley (Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001).

Data files are in .zip format, when unzipped, they are in Excel (.xls) format.  The ratings are real values ranging from -10.00 to +10.00 (the value "99" corresponds to "null" meaning "not rated").  Each row is a user.  The first column gives the number of jokes rated by the user. The next 100 give the ratings for jokes 1 to 100.  I will only be the first data set that has data for users that have rated 36 or more jokes.

The researchers note that the sub-matrix including only columns {5, 7, 8, 13, 15, 16, 17, 18, 19, 20} is dense. Almost all users have rated those jokes. I will be using this small subset of this matrix to verify the calculations preformed on the whole.

In [1]:
import os
import requests
import zipfile
import pandas as pd
import data612

# STEP 1 - DOWNLOAD THE DATA SET
if not os.path.exists("jester_dataset_1_1.zip"):
    # We need to download it
    response = requests.get("http://eigentaste.berkeley.edu/dataset/jester_dataset_1_1.zip")
    if response.status_code == 200:
        with open("jester_dataset_1_1.zip", "wb") as f:
            f.write(response.content)
# STEP 2 - EXTRACT THE DATA SET
if not os.path.exists("jester-data-1.xls"):
    with zipfile.ZipFile("jester_dataset_1_1.zip","r") as z:
        z.extract("jester-data-1.xls")
# STEP 3 - READ ING THE DATA
# The data is a continous rating scale from -10 to 10.  99 is used if a user hasn't rated a joke.
# The data does not have a header.  Using the column numbers works great.
df = pd.read_excel("jester-data-1.xls",  header=None, na_values = 99)
# We should have a 24,983 X 101 data frame
df.shape

(24983, 101)

Since the first column (0) is the number of jokes rated by the user we can drop that so we are only left with the ratings.

Now I will create the small subset using the first 10 users and 10 jokes.  Nine of the ten jokes are identified as having dense ratings data.  The first joke is added to the set to ensure some missing values.

In [2]:
df = df[[1, 7, 8, 13, 15, 16, 17, 18, 19, 20]].head(10)
df

Unnamed: 0,1,7,8,13,15,16,17,18,19,20
0,-7.82,-9.85,4.17,-7.18,-7.18,-7.52,-7.43,-9.81,-9.85,-9.85
1,4.08,-0.73,-5.34,4.42,4.56,-0.97,4.66,-0.68,3.3,-1.21
2,,9.03,9.27,9.37,-6.36,-6.89,-7.86,9.03,9.03,9.03
3,,-2.82,6.21,6.31,-7.23,-6.65,1.17,-6.6,-3.64,-2.09
4,8.5,7.04,4.61,-3.93,-2.33,-9.66,2.72,-1.36,2.57,4.51
5,-6.17,-8.69,-0.87,-5.0,0.49,-8.93,-3.69,-2.18,-2.28,-6.12
6,,7.72,8.79,-6.26,6.07,-3.5,-2.09,6.17,5.15,4.42
7,6.84,9.27,1.41,-6.94,0.29,-9.9,-7.09,-7.18,1.02,-0.29
8,-3.79,-5.29,-8.93,-4.85,-8.74,-6.99,-8.74,-2.91,-3.35,-0.29
9,3.01,8.93,2.52,4.47,-4.66,-0.97,-0.44,1.55,0.49,4.37


## Break into Training and Test Sets

Now that we have data we will break it into training and test sets.

In [3]:
train_df, test_df = data612.train_test_split(df)
train_df

Unnamed: 0,1,7,8,13,15,16,17,18,19,20
0,-7.82,-9.85,4.17,-7.18,-7.18,-7.52,-7.43,-9.81,-9.85,-9.85
1,4.08,-0.73,-5.34,,4.56,-0.97,4.66,,3.3,-1.21
2,,9.03,9.27,9.37,-6.36,-6.89,,9.03,9.03,9.03
3,,-2.82,6.21,6.31,-7.23,-6.65,1.17,,-3.64,-2.09
4,,7.04,,-3.93,-2.33,-9.66,2.72,-1.36,2.57,4.51
5,-6.17,,-0.87,-5.0,,-8.93,-3.69,-2.18,-2.28,-6.12
6,,7.72,8.79,,,-3.5,-2.09,6.17,5.15,
7,6.84,9.27,,-6.94,,-9.9,-7.09,-7.18,1.02,-0.29
8,,,,-4.85,-8.74,-6.99,,-2.91,-3.35,-0.29
9,3.01,8.93,,4.47,-4.66,,-0.44,1.55,0.49,4.37


In [4]:
test_df

Unnamed: 0,1,7,8,13,15,16,17,18,19,20
0,,,,,,,,,,
1,,,,4.42,,,,-0.68,,
2,,,,,,,-7.86,,,
3,,,,,,,,-6.6,,
4,8.5,,4.61,,,,,,,
5,,-8.69,,,0.49,,,,,
6,,,,-6.26,6.07,,,,,4.42
7,,,1.41,,0.29,,,,,
8,-3.79,-5.29,-8.93,,,,-8.74,,,
9,,,2.52,,,-0.97,,,,


## Calculate the Average Rating

Now that we have a training set we need calculate the raw average (mean) rating for every user-item combination.

In [5]:
raw_avg = train_df.sum(numeric_only=True).sum() / train_df.count().sum(axis = 0)
raw_avg

-0.8758974358974362

This almost negative 1 mean rating indicates most of the jokes are not very funny.

### Validation

This may seem silly but we will validate the above calulation just to make sure we did it right.

In [6]:
the_avg = (-7.82 - 9.85 + 4.17 - 7.18 - 7.18 - 7.52 - 7.43 - 9.81 - 9.85 - 9.85 + 4.08 - 0.73 - 5.34 + 4.56 - 0.97 + 4.66 + 3.30 - 1.21 + 9.03 + 9.27 + 9.37 - 6.36 - 6.89 + 9.03 + 9.03 + 9.03 - 2.82 + 6.21 + 6.31 - 7.23 - 6.65 + 1.17 - 3.64 - 2.09 + 7.04 - 3.93 - 2.33 - 9.66 + 2.72 - 1.36 + 2.57 + 4.51 - 6.17 - 0.87 - 5.00 - 8.93 - 3.69 - 2.18 - 2.28 - 6.12 + 7.72 + 8.79 - 3.50 - 2.09 + 6.17 + 5.15 + 6.84 + 9.27 - 6.94 - 9.90 - 7.09 - 7.18 + 1.02 - 0.29 - 4.85 - 8.74 - 6.99 - 2.91 - 3.35 - 0.29 + 3.01 + 8.93 + 4.47 - 4.66 - 0.44 + 1.55 + 0.49 + 4.37) / 78
the_avg

-0.8758974358974354

There's a bit of some slight difference but it is after the 12th place.  I'm not going to sweat that.

## Calculate the RMSE 

Now I will calculate the RMSE for raw average for both the training and test data.

In [7]:
train_df_RMSE = data612.get_RMSE(train_df, raw_avg)
train_df_RMSE

6.0313466276882775

In [8]:
test_df_RMSE = data612.get_RMSE(test_df, raw_avg)
test_df_RMSE

5.4707428277558865

### Validation

I will verify the above calculations.  I will do so on the training data.

#### Initial Data Set

In [9]:
train_df

Unnamed: 0,1,7,8,13,15,16,17,18,19,20
0,-7.82,-9.85,4.17,-7.18,-7.18,-7.52,-7.43,-9.81,-9.85,-9.85
1,4.08,-0.73,-5.34,,4.56,-0.97,4.66,,3.3,-1.21
2,,9.03,9.27,9.37,-6.36,-6.89,,9.03,9.03,9.03
3,,-2.82,6.21,6.31,-7.23,-6.65,1.17,,-3.64,-2.09
4,,7.04,,-3.93,-2.33,-9.66,2.72,-1.36,2.57,4.51
5,-6.17,,-0.87,-5.0,,-8.93,-3.69,-2.18,-2.28,-6.12
6,,7.72,8.79,,,-3.5,-2.09,6.17,5.15,
7,6.84,9.27,,-6.94,,-9.9,-7.09,-7.18,1.02,-0.29
8,,,,-4.85,-8.74,-6.99,,-2.91,-3.35,-0.29
9,3.01,8.93,,4.47,-4.66,,-0.44,1.55,0.49,4.37


#### Prediction Error

Adjusting the initial data by the prediction to yeild the error

In [10]:
error_df = train_df - raw_avg
error_df

Unnamed: 0,1,7,8,13,15,16,17,18,19,20
0,-6.944103,-8.974103,5.045897,-6.304103,-6.304103,-6.644103,-6.554103,-8.934103,-8.974103,-8.974103
1,4.955897,0.145897,-4.464103,,5.435897,-0.094103,5.535897,,4.175897,-0.334103
2,,9.905897,10.145897,10.245897,-5.484103,-6.014103,,9.905897,9.905897,9.905897
3,,-1.944103,7.085897,7.185897,-6.354103,-5.774103,2.045897,,-2.764103,-1.214103
4,,7.915897,,-3.054103,-1.454103,-8.784103,3.595897,-0.484103,3.445897,5.385897
5,-5.294103,,0.005897,-4.124103,,-8.054103,-2.814103,-1.304103,-1.404103,-5.244103
6,,8.595897,9.665897,,,-2.624103,-1.214103,7.045897,6.025897,
7,7.715897,10.145897,,-6.064103,,-9.024103,-6.214103,-6.304103,1.895897,0.585897
8,,,,-3.974103,-7.864103,-6.114103,,-2.034103,-2.474103,0.585897
9,3.885897,9.805897,,5.345897,-3.784103,,0.435897,2.425897,1.365897,5.245897


#### Squared Prediction Error

Now we square the prediction error

In [11]:
squared_error_df = error_df ** 2
squared_error_df

Unnamed: 0,1,7,8,13,15,16,17,18,19,20
0,48.22056,80.534517,25.461081,39.741709,39.741709,44.144099,42.95626,79.818189,80.534517,80.534517
1,24.560919,0.021286,19.928212,,29.548981,0.008855,30.64616,,17.438119,0.111625
2,,98.126804,102.939235,104.978414,30.075381,36.16943,,98.126804,98.126804,98.126804
3,,3.779535,50.209942,51.637122,40.374619,33.34026,4.185696,,7.640263,1.474045
4,,62.661432,,9.327542,2.114414,77.160458,12.930478,0.234355,11.874209,29.007891
5,28.027522,,3.5e-05,17.008222,,64.868568,7.919173,1.700683,1.971504,27.500612
6,,73.889453,93.429573,,,6.885914,1.474045,49.644671,36.31144,
7,59.535073,102.939235,,36.77334,,81.434427,38.615071,39.741709,3.594427,0.343276
8,,,,15.793491,61.844109,37.38225,,4.137573,6.121183,0.343276
9,15.100199,96.155625,,28.578619,14.319432,,0.190007,5.884978,1.865676,27.51944


#### Average the Square Errors

Now we will get the mean of the squared errors

In [12]:
mean_squared_errors = squared_error_df.stack().mean()
mean_squared_errors

36.37714214332676

#### Root Mean Square Errors

Finally we will take the square root of the mean squared errors

In [13]:
RMSE = mean_squared_errors ** (1/2)
RMSE

6.0313466276882775

#### Check Results

Now to check these results with what we got previously

In [14]:
if RMSE == train_df_RMSE:
    print("Everything is awesome")
else:
    print("Something bad happened")

Everything is awesome


## Calculate the Biases

We now can calculate the user and item biases

In [15]:
user_bias_train_df, item_bias_train_df = data612.get_biases(train_df, raw_avg)

### Validation

Now to validate the biases calculations

#### User Bias

We will compute the bias for the first user.  If it is right for one it is right for all.

In [16]:
train_df.loc[0]

1    -7.82
7    -9.85
8     4.17
13   -7.18
15   -7.18
16   -7.52
17   -7.43
18   -9.81
19   -9.85
20   -9.85
Name: 0, dtype: float64

In [17]:
user_0_avg = (-7.82 -9.85 + 4.17 - 7.18 - 7.18 -7.52 - 7.43 - 9.81 - 9.85 - 9.85) / 10
user_0_bias = user_0_avg - raw_avg
user_0_bias

-6.356102564102563

Looks like the first user was not very impressed by the jokes.  Let's see if the calculations match:

In [18]:
if user_bias_train_df[0] == user_0_bias:
    print("Everything is awesome")
else:
    print("Something bad happened")

Everything is awesome


It matches.

#### Item Bias

Now let's see how the first joke fared.

In [19]:
train_df.iloc[:,0]

0   -7.82
1    4.08
2     NaN
3     NaN
4     NaN
5   -6.17
6     NaN
7    6.84
8     NaN
9    3.01
Name: 1, dtype: float64

In [20]:
item_1_avg = (-7.82 + 4.08 - 6.17 + 6.84 + 3.01) / 5
item_1_bias = item_1_avg - raw_avg
item_1_bias

0.8638974358974361

This joke did slighly better than the mean.

In [21]:
if item_bias_train_df[1] == item_1_bias:
    print("Everything is awesome")
else:
    print("Something bad happened")

Everything is awesome


## Baseline Predictors

Now we can make our baseline predictions.

In [22]:
baseline_predictions_df = data612.get_baseline_predictions(raw_avg, user_bias_train_df, item_bias_train_df)
baseline_predictions_df

Unnamed: 0,1,7,8,13,15,16,17,18,19,20
0,-6,-3,-3,-7,-10,-10,-8,-7,-6,-7
1,2,5,6,1,-3,-5,0,1,2,2
2,6,10,10,5,2,-1,5,5,6,6
3,0,3,3,-1,-5,-7,-2,-1,0,0
4,1,4,5,0,-4,-6,-1,0,1,1
5,-4,0,0,-4,-8,-10,-5,-4,-3,-4
6,5,8,8,4,0,-2,3,4,5,4
7,-1,3,3,-2,-5,-8,-2,-2,-1,-1
8,-4,0,0,-5,-8,-10,-5,-4,-3,-4
9,3,7,7,2,-1,-4,2,2,3,3


In [23]:
data612.get_RMSE(train_df, baseline_predictions_df)

3.722589393805311

In [24]:
data612.get_RMSE(test_df, baseline_predictions_df)

6.236859027292166