# DATA 612 Project 1 - Joke Recommender System

By Mike Silva

## Introduction

This recommender system provides our users with jokes that they will find funny.  By providing this content we will keep users engaged longer.

## Project Requirements

• Briefly describe the recommender system that you’re going to build out from a business perspective, e.g. “This system recommends data science books to readers.”

• Find a dataset, or build out your own toy dataset. As a minimum requirement for complexity, please include numeric ratings for at least five users, across at least five items, with some missing data.

• Load your data into (for example) an R or pandas dataframe, a Python dictionary or list of lists, (or another data structure of your choosing). From there, create a user-item matrix.

• If you choose to work with a large dataset, you’re encouraged to also create a small, relatively dense “user-item” matrix as a subset so that you can hand-verify your calculations.

• Break your ratings into separate training and test datasets.

• Using your training data, calculate the raw average (mean) rating for every user-item combination.

• Calculate the RMSE for raw average for both your training data and your test data.

• Using your training data, calculate the bias for each user and each item.

• From the raw average, and the appropriate user and item biases, calculate the baseline predictors for every user-item combination.

• Calculate the RMSE for the baseline predictors for both your training data and your test data.

• Summarize your results

### About the Jester Dataset

For this project I will be using the [Jester dataset](http://eigentaste.berkeley.edu/dataset/).  It was created by Ken Goldberg at UC Berkley (Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001).

Data files are in .zip format, when unzipped, they are in Excel (.xls) format.  The ratings are real values ranging from -10.00 to +10.00 (the value "99" corresponds to "null" meaning "not rated").  Each row is a user.  The first column gives the number of jokes rated by the user. The next 100 give the ratings for jokes 1 to 100.  I will only be the first data set that has data for users that have rated 36 or more jokes.

The researchers note that the sub-matrix including only columns {5, 7, 8, 13, 15, 16, 17, 18, 19, 20} is dense. Almost all users have rated those jokes.  I will be using a small subset of this matrix to verify the calculations preformed on the whole.

In [1]:
from os import path
import requests
import zipfile
import numpy as np
import pandas as pd

# STEP 1 - DOWNLOAD THE DATA SET
if not path.exists("jester_dataset_1_1.zip"):
    # We need to download it
    response = requests.get("http://eigentaste.berkeley.edu/dataset/jester_dataset_1_1.zip")
    if response.status_code == 200:
        with open("jester_dataset_1_1.zip", "wb") as f:
            f.write(response.content)
# STEP 2 - EXTRACT THE DATA SET
if not path.exists("jester-data-1.xls"):
    with zipfile.ZipFile("jester_dataset_1_1.zip","r") as z:
        z.extract("jester-data-1.xls")
# STEP 3 - READ ING THE DATA
# The data is a continous rating scale from -10 to 10.  99 is used if a user hasn't rated a joke.
# The data does not have a header.  Using the column numbers works great.
df = pd.read_excel("jester-data-1.xls",  header=None, na_values = 99)
# We should have a 24,983 X 101 data frame
df.shape

(24983, 101)

Since the first column (0) is the number of jokes rated by the user we can drop that so we are only left with the ratings.

In [2]:
df = df.drop([0], axis=1)

The rating scale was continous from -10 to 10.  We will simplify this data set by transforming all values to the nearest whole number.

In [3]:
df = df.round(0)

Now I will create the small subset using the first 10 users and 10 jokes.  Nine of the ten jokes are identified as having dense ratings data.  The first joke is added to the set to ensure some missing values.

In [4]:
small_df = df[[1, 7, 8, 13, 15, 16, 17, 18, 19, 20]].head(10)
small_df

Unnamed: 0,1,7,8,13,15,16,17,18,19,20
0,-8.0,-10.0,4.0,-7.0,-7.0,-8.0,-7.0,-10.0,-10.0,-10.0
1,4.0,-1.0,-5.0,4.0,5.0,-1.0,5.0,-1.0,3.0,-1.0
2,,9.0,9.0,9.0,-6.0,-7.0,-8.0,9.0,9.0,9.0
3,,-3.0,6.0,6.0,-7.0,-7.0,1.0,-7.0,-4.0,-2.0
4,8.0,7.0,5.0,-4.0,-2.0,-10.0,3.0,-1.0,3.0,5.0
5,-6.0,-9.0,-1.0,-5.0,0.0,-9.0,-4.0,-2.0,-2.0,-6.0
6,,8.0,9.0,-6.0,6.0,-4.0,-2.0,6.0,5.0,4.0
7,7.0,9.0,1.0,-7.0,0.0,-10.0,-7.0,-7.0,1.0,-0.0
8,-4.0,-5.0,-9.0,-5.0,-9.0,-7.0,-9.0,-3.0,-3.0,-0.0
9,3.0,9.0,3.0,4.0,-5.0,-1.0,-0.0,2.0,0.0,4.0


## Break into Training and Test Sets

Now that we have data we will break it into training and test sets.  We will begin with the smaller data set to verify the function is working properly.  The smaller data set has 97 non N/A values.  If we want 20% to be in the test set there will be 19 non N/A data points in the test set.

In [5]:
def train_test_split(user_item_df, train_proportion = 0.8, random_seed = 42):
    """Splits a data frame into two data frames.
    Args:
        user_item_df (DataFrame): the pandas dataframe of that is a user item matrix.
        train_proportion (float): the proportion of the non N/A data in the training set (Optional - 80% default)
        random_seed (int): the random number seed (Optional - 42 default).
    Returns:
        train_df (DataFrame): The training set dataframe.
        test_df (DataFrame): The testing set dataframe.
    """
    train_df = user_item_df.copy()
    np.random.seed(random_seed)
    # Count how many non N/A values are in the data frame
    has_data = train_df.count().sum(axis = 0)
    # Determine how many values we need in the test set (there is no N/A's in the test set)
    n_test = has_data - int(round(has_data * train_proportion, 0))
    # Create an empty test data frame
    test_df = pd.DataFrame(np.nan, index = train_df.index, columns = train_df.columns)
    # Fill it with randomly selected values
    while test_df.count().sum(axis = 0) < n_test:
        # Randomly select a row & column
        row_id = np.random.choice(list(train_df.index), 1)
        col_id = np.random.choice(list(train_df.columns), 1)
        # Get the data at that location
        val = train_df.iloc[row_id][col_id].values
        # Check to see if it is not N/A
        if not np.isnan(val):
            # Remove it from the training set
            train_df.at[row_id, col_id] = None
            # Save it to the test set
            test_df.at[row_id, col_id] = val
    return(train_df, test_df)

train_small_df, test_small_df = train_test_split(small_df)
test_small_df

Unnamed: 0,1,7,8,13,15,16,17,18,19,20
0,,,,,,,,,,
1,,,,4.0,,,,-1.0,,
2,,,,,,,-8.0,,,
3,,,,,,,,-7.0,,
4,8.0,,5.0,,,,,,,
5,,-9.0,,,0.0,,,,,
6,,,,-6.0,6.0,,,,,4.0
7,,,1.0,,0.0,,,,,
8,-4.0,-5.0,-9.0,,,,-9.0,,,
9,,,3.0,,,-1.0,,,,


We see the above data frame does indeed have 19 non N/A values. We can apply this to all of the data. There are 1,810,455 ratings in the data set so this will take some time.  I am going to "cache" the output of this cell by saving it to disk and reading it in if it is present for future runs.

In [6]:
if not path.exists("project_1_train_df.csv") or not path.exists("project_1_test_df.csv"):
    train_df, test_df = train_test_split(df)
    train_df.to_csv("project_1_train_df.csv", index = False)
    test_df.to_csv("project_1_test_df.csv", index = False)
else:
    train_df = pd.read_csv("project_1_train_df.csv")
    test_df = pd.read_csv("project_1_test_df.csv")

## Calculate the Average Rating

Now that we have a training set we need calculate the raw average (mean) rating for every user-item combination. Let's begin with the small data frame.

In [7]:
small_df_raw_avg = train_small_df.sum(numeric_only=True).sum() / train_small_df.count().sum(axis = 0)
small_df_raw_avg

-0.8846153846153846

If this was calculated properly it indicates that on average most of the jokes get a lukewarm reception.  Let's verify that the calculation is done properly.  Here's the training data:

In [8]:
train_small_df

Unnamed: 0,1,7,8,13,15,16,17,18,19,20
0,-8.0,-10.0,4.0,-7.0,-7.0,-8.0,-7.0,-10.0,-10.0,-10.0
1,4.0,-1.0,-5.0,,5.0,-1.0,5.0,,3.0,-1.0
2,,9.0,9.0,9.0,-6.0,-7.0,,9.0,9.0,9.0
3,,-3.0,6.0,6.0,-7.0,-7.0,1.0,,-4.0,-2.0
4,,7.0,,-4.0,-2.0,-10.0,3.0,-1.0,3.0,5.0
5,-6.0,,-1.0,-5.0,,-9.0,-4.0,-2.0,-2.0,-6.0
6,,8.0,9.0,,,-4.0,-2.0,6.0,5.0,
7,7.0,9.0,,-7.0,,-10.0,-7.0,-7.0,1.0,-0.0
8,,,,-5.0,-9.0,-7.0,,-3.0,-3.0,-0.0
9,3.0,9.0,,4.0,-5.0,,-0.0,2.0,0.0,4.0


We're going to verify this by a series of calculations:

In [9]:
# Column 1
sum_of_ratings = -8 + 4 - 6 + 7 + 3
# Column 7
sum_of_ratings = sum_of_ratings - 10 - 1 + 9 - 3 + 7 + 8 + 9 + 9
# Column 8
sum_of_ratings = sum_of_ratings + 4 - 5 + 9 + 6 - 1 + 9
# Column 13
sum_of_ratings = sum_of_ratings - 7 + 9 + 6 - 4 - 5 - 7 - 5 + 4
# Column 15
sum_of_ratings = sum_of_ratings - 7 + 5 - 6 - 7 - 2 - 9 - 5
# Column 16
sum_of_ratings = sum_of_ratings - 8 - 1 - 7 - 7 - 10 - 9 - 4 - 10 - 7
# Column 17
sum_of_ratings = sum_of_ratings - 7 + 5 + 1 + 3 - 4 - 2 - 7 - 0
# Column 18
sum_of_ratings = sum_of_ratings - 10 + 9 - 1 -2 + 6 - 7 - 3 + 2
# Column 19
sum_of_ratings = sum_of_ratings - 10 + 3 + 9 - 4 + 3 - 2 + 5 + 1 - 3 + 0
# Column 20
sum_of_ratings = sum_of_ratings - 10 - 1 + 9 - 2 + 5 - 6 - 0 - 0 + 4
# Remember the were 97 non N/A values and 19 were made into the test set
n_ratings = 97 - 19 
# Validate the average rating
if (sum_of_ratings / n_ratings) == small_df_raw_avg:
    print("Everything is awesome!")
else:
    print("Something bad happened")

Everything is awesome!


This checks out with the average calculated at the begining of this section.  We can replicate the process with the full dataset.

In [10]:
df_raw_avg = train_df.sum(numeric_only=True).sum() / train_df.count().sum(axis = 0)
df_raw_avg

0.8789979590765857

## Calculate the RMSE 

Now I will calculate the RMSE for raw average for both the training and test data.

In [11]:
def get_RMSE(user_item_df, predictor):
    """Calculates the RMSE for the predictor for the dataframe
    Args:
        user_item_df (DataFrame): the pandas dataframe of that is a user item matrix.
        predictor (float): the predicted value.
    Returns:
        RMSE_df (DataFrame): a data frame with the RMSE values.
    """
    errors = user_item_df - predictor
    squared_errors = errors ** 2
    mean_squared_errors = squared_errors.stack().mean()
    RMSE = mean_squared_errors ** (1/2)
    return(RMSE)

small_df_RMSE = get_RMSE(small_df, small_df_raw_avg)
small_df_RMSE

5.933733217640245

Now we need to verify the calculations:

In [12]:
errors = small_df - small_df_raw_avg
errors

Unnamed: 0,1,7,8,13,15,16,17,18,19,20
0,-7.115385,-9.115385,4.884615,-6.115385,-6.115385,-7.115385,-6.115385,-9.115385,-9.115385,-9.115385
1,4.884615,-0.115385,-4.115385,4.884615,5.884615,-0.115385,5.884615,-0.115385,3.884615,-0.115385
2,,9.884615,9.884615,9.884615,-5.115385,-6.115385,-7.115385,9.884615,9.884615,9.884615
3,,-2.115385,6.884615,6.884615,-6.115385,-6.115385,1.884615,-6.115385,-3.115385,-1.115385
4,8.884615,7.884615,5.884615,-3.115385,-1.115385,-9.115385,3.884615,-0.115385,3.884615,5.884615
5,-5.115385,-8.115385,-0.115385,-4.115385,0.884615,-8.115385,-3.115385,-1.115385,-1.115385,-5.115385
6,,8.884615,9.884615,-5.115385,6.884615,-3.115385,-1.115385,6.884615,5.884615,4.884615
7,7.884615,9.884615,1.884615,-6.115385,0.884615,-9.115385,-6.115385,-6.115385,1.884615,0.884615
8,-3.115385,-4.115385,-8.115385,-4.115385,-8.115385,-6.115385,-8.115385,-2.115385,-2.115385,0.884615
9,3.884615,9.884615,3.884615,4.884615,-4.115385,-0.115385,0.884615,2.884615,0.884615,4.884615


In [13]:
squared_errors = errors**2
squared_errors

Unnamed: 0,1,7,8,13,15,16,17,18,19,20
0,50.628698,83.090237,23.859467,37.397929,37.397929,50.628698,37.397929,83.090237,83.090237,83.090237
1,23.859467,0.013314,16.936391,23.859467,34.628698,0.013314,34.628698,0.013314,15.090237,0.013314
2,,97.705621,97.705621,97.705621,26.16716,37.397929,50.628698,97.705621,97.705621,97.705621
3,,4.474852,47.397929,47.397929,37.397929,37.397929,3.551775,37.397929,9.705621,1.244083
4,78.936391,62.16716,34.628698,9.705621,1.244083,83.090237,15.090237,0.013314,15.090237,34.628698
5,26.16716,65.859467,0.013314,16.936391,0.782544,65.859467,9.705621,1.244083,1.244083,26.16716
6,,78.936391,97.705621,26.16716,47.397929,9.705621,1.244083,47.397929,34.628698,23.859467
7,62.16716,97.705621,3.551775,37.397929,0.782544,83.090237,37.397929,37.397929,3.551775,0.782544
8,9.705621,16.936391,65.859467,16.936391,65.859467,37.397929,65.859467,4.474852,4.474852,0.782544
9,15.090237,97.705621,15.090237,23.859467,16.936391,0.013314,0.782544,8.321006,0.782544,23.859467


In [14]:
mean_squared_errors = squared_errors.sum(numeric_only=True).sum() / squared_errors.count().sum(axis = 0)
mean_squared_errors

35.20918989812725

In [15]:
mean_squared_errors ** (1/2)

5.933733217640245

In [16]:
train_df_RMSE = get_RMSE(train_df, df_raw_avg)
train_df_RMSE

5.243578588345168

In [17]:
test_df_RMSE = get_RMSE(test_df, df_raw_avg)
test_df_RMSE

5.242529667597847