# DATA 612 Project 1 - Joke Recommender System

By Mike Silva

## Introduction

This recommender system provides our users with jokes that they will find funny.  By providing this content we will keep users engaged longer.

### About the Jester Dataset

For this project I will be using the [Jester dataset](http://eigentaste.berkeley.edu/dataset/).  It was created by Ken Goldberg at UC Berkley (Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001).

Data files are in .zip format, when unzipped, they are in Excel (.xls) format.  The ratings are real values ranging from -10.00 to +10.00 (the value "99" corresponds to "null" meaning "not rated").  Each row is a user.  The first column gives the number of jokes rated by the user. The next 100 give the ratings for jokes 1 to 100.  I will only be the first data set that has data for users that have rated 36 or more jokes.

In [1]:
import os
import requests
import zipfile
import pandas as pd
import data612

# STEP 1 - DOWNLOAD THE DATA SET
if not os.path.exists("jester_dataset_1_1.zip"):
    # We need to download it
    response = requests.get("http://eigentaste.berkeley.edu/dataset/jester_dataset_1_1.zip")
    if response.status_code == 200:
        with open("jester_dataset_1_1.zip", "wb") as f:
            f.write(response.content)
# STEP 2 - EXTRACT THE DATA SET
if not os.path.exists("jester-data-1.xls"):
    with zipfile.ZipFile("jester_dataset_1_1.zip","r") as z:
        z.extract("jester-data-1.xls")
# STEP 3 - READ ING THE DATA
# The data is a continous rating scale from -10 to 10.  99 is used if a user hasn't rated a joke.
# The data does not have a header.  Using the column numbers works great.
df = pd.read_excel("jester-data-1.xls",  header=None, na_values = 99)
# We should have a 24,983 X 101 data frame
df.shape

(24983, 101)

Since the first column (0) is the number of jokes rated by the user we can drop that so we are only left with the ratings.

In [2]:
df = df.drop([0], axis=1)

## Break into Training and Test Sets

Now that we have data we will break it into training and test sets.  There are 1,810,455 ratings in the data set so this will take some time.  I am going to "cache" the output of this cell by saving it to disk and reading it in if it is present for future runs.

In [3]:
if not os.path.exists("project_1_train_df.csv") or not os.path.exists("project_1_test_df.csv"):
    train_df, test_df = data612.train_test_split(df)
    train_df.to_csv("project_1_train_df.csv", index = False)
    test_df.to_csv("project_1_test_df.csv", index = False)
else:
    train_df = pd.read_csv("project_1_train_df.csv")
    test_df = pd.read_csv("project_1_test_df.csv")

## Calculate the Average Rating

Now that we have a training set we need calculate the raw average (mean) rating for every user-item combination.

In [4]:
raw_avg = train_df.sum(numeric_only=True).sum() / train_df.count().sum(axis = 0)
raw_avg

0.8801402409891428

This almost zero mean rating indicates most of the jokes are not very funny.

## Calculate the RMSE 

Now I will calculate the RMSE for raw average for both the training and test data.

In [5]:
train_df_RMSE = data612.get_RMSE(train_df, raw_avg)
train_df_RMSE

5.236313400381647

In [6]:
test_df_RMSE = data612.get_RMSE(test_df, raw_avg)
test_df_RMSE

5.234060383407136

## Calculate the Biases

We now can calculate the user and item biases

In [7]:
user_bias_train_df, item_bias_train_df = data612.get_biases(train_df, raw_avg)

## Baseline Predictors

Now we can make our baseline predictions.  We will test it on the small data frame.  Afeter verifying that the calculations are correct we will apply it to the training set.

In [8]:
baseline_predictions_df = data612.get_baseline_predictions(raw_avg, user_bias_train_df, item_bias_train_df)

In [9]:
data612.get_RMSE(train_df, baseline_predictions_df)

4.280588684895016

In [10]:
data612.get_RMSE(test_df, baseline_predictions_df)

4.3586750167649