#  Recomender Systems
## Assignment 1

This assignment is for CUNY's DATA 643 Recomender Systems.

---

###### Briefly describe the recommender system that you’re going to build out from a business perspective, e.g. “This system recommends data science books to readers.”

This project's goal is to build a joke recomender system to make us laugh in
these trying times.

If one wants to imporove ones mood, the more high quality jokes and fewer low quality jokes encountered for a particular user the better. We will use baisic mean imputation as well as joke and user bias, to predict joke values on non-rated jokes. This will be validated on a test set.

---

###### Find a dataset, or build out your own toy dataset. As a minimum requirement for complexity, please include numeric ratings for at least five users, across at least five items, with some missing data.

I downloaded the dataset from [this](http://eigentaste.berkeley.edu/dataset/) site.
The dataset is quite wide at 151 rows.

This is known as a user-item matrix because each row is a user and each column is an item, in this case a joke.

In [2]:
import pandas as pd
import numpy as np
df = pd.read_excel('jester-data-1.xls', header=None)


In [2]:
df.iloc[0:5, 0:10]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,74,-7.82,8.79,-9.66,-8.16,-7.52,-8.5,-9.85,4.17,-8.98
1,100,4.08,-0.29,6.36,4.37,-2.38,-9.66,-0.73,-5.34,8.88
2,49,99.0,99.0,99.0,99.0,9.03,9.27,9.03,9.27,99.0
3,48,99.0,8.35,99.0,99.0,1.8,8.16,-2.82,6.21,99.0
4,91,8.5,4.61,-4.17,-5.39,1.36,1.6,7.04,4.61,-0.44


In [10]:
df.shape


(24983, 101)

The dataset contains 100 jokes as well the count of the ratings. The user ID is given by the pandas index

In [3]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure

output_notebook()

In [4]:
jokes = df.iloc[:, 1:].replace(99, np.NAN)


In [5]:
def remover(p):
    # Set axis to invisiable
    p.xaxis.axis_line_width = 0.00001
    p.yaxis.axis_line_width = 0.00001
    # Fonts
    p.title.text_font = "times"
    p.title.text_font_style = "normal"
    p.xaxis.axis_label_text_font = 'times'
    p.xaxis.axis_label_text_font_style = 'normal'
    p.yaxis.axis_label_text_font = 'times'
    p.yaxis.axis_label_text_font_style = 'normal'
    # This removes the outline of the graph.
    p.outline_line_color = None
    p.toolbar.logo = None
    p.toolbar_location = None
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
    return p

In [7]:
from bokeh.layouts import gridplot
from bokeh.plotting import figure, show, output_file

p1 = figure(title="Frequency of Number of Jokes Rated")

hist, edges = np.histogram(df[0], density=True, bins=30)

#x = np.linspace(-2, 2, 1000)
p1.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
    fill_color="skyblue", line_color="grey")
p1.xaxis.axis_label = 'Total Jokes Rated'
p1.yaxis.axis_label = 'Proportion of Responses'
p1 = remover(p1)
show(p1)

We have some complete cases here, however, even if complete cases were not correlated with joke preferences, we still would be thowing out a crazy amount of data for training. 

Since we must train on data that is not a complete case, we need to validate it. Because we must predict NA values for people who have been used as training sets, we need a way to validate.

To do this we can select random ratings to be our test set.

I removed the first column because it was the number of jokes answered.

This is less interesting to us because we don't need to understand how many jokes
were answered. 

In [6]:
p1 = figure(title="Rating Distrobution")
allHist = pd.melt(jokes)
hist, edges = np.histogram(allHist['value'].dropna(), density=True, bins=30)
p1.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
    fill_color="skyblue", line_color="grey")
p1 = remover(p1)
show(p1)

The data looks like there is a preference for jokes just above average. Also,
more people seem to be fine giving the max negative value indicating they hated
the joke while less give the max positive value. Conversly, there are more people
giving positive values.

---

## Simple Predictions

We are tasked with predicting joke ratings for jokes that users have not yet
rated. There is no way to verify what they will rate. Instead we will have to
validate with a training and test set.

Once we have some strategy, we can apply it to the `NA` values.

Because the NA values are distributed amoung all but a few responders.

---

###### Break your ratings into separate training and test datasets.


In [9]:
np.random.seed(101)
trainTestMask = np.random.choice([True, False], size=jokes.shape, p=[0.7, 0.3])
trainTestMask = pd.DataFrame(trainTestMask)
trainTestMask.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,True,True,True,True,True,False,True,False,False,True,...,True,True,True,True,True,False,True,True,False,True
1,True,True,True,True,False,True,True,True,False,True,...,True,True,True,True,True,False,True,False,True,True
2,True,True,True,True,False,True,True,True,True,True,...,True,True,False,False,False,True,True,False,True,True
3,True,True,True,True,True,True,False,True,True,True,...,True,True,True,True,True,False,False,True,True,True
4,True,True,True,True,True,True,True,True,True,True,...,False,True,True,True,False,True,False,True,True,True


This gives us a way to randomly select training and test sets, but we still need to test for NA values.

In [10]:
train = pd.DataFrame(np.where(trainTestMask, jokes, np.NAN))
test = pd.DataFrame(np.where(np.invert(trainTestMask), jokes, np.NAN))

We have our training and test data. We first need to calculate the mean for the total training set and calculate the RMSE on the test set. 

In [11]:
rawMean = train.stack().mean()
rawMean

0.8783613614046262

The raw mean is close to zero. We will replace all na values with this and calculate the RMSE.

###### Using your training data, calculate the raw average (mean) rating for every user-item combination.
- Calculate the RMSE for raw average for both your training data and your test data.

In [12]:
testRaw = train.fillna(value=rawMean)

In [13]:
np.sqrt(((pd.melt(test)['value'] - rawMean)**2).mean()) #  This ignores NA values that were present before too.

5.2324418349094142

In [14]:
np.sqrt(((pd.melt(train)['value'] - rawMean)**2).mean()) #  This ignores NA values that were present before too.

5.2373261451264863

So we have a RMSE of 5.23 for the test set and surprisingly, a higher one for the training set. This is a bit weird because we would expect some over fitting. However, given that we are just taking the meanit is unlikely there is a statistical difference. 

## More accuracy

To get more accuracy, we will need to add the column means and row means to the raw mean.

The formula should look like:

$$
pred = RawMean+SpecificColumnMean+SpecificRowMean
$$

Another way to think of it is:

$$
pred_{i,j} = RawMean + RowMean_j + ColMean_i
$$

###### Using your training data, calculate the bias for each user and each item.


In [15]:
colMean = train.mean()
rowMean = train.mean(axis=1)

In [16]:
preds = np.full(jokes.shape, rawMean)
preds += colMean
preds =preds + rowMean[:,np.newaxis]

In [17]:
preds = pd.DataFrame(preds)
preds.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-1.885926,-2.646217,-2.469234,-4.251214,-2.334262,-1.162556,-3.202825,-3.425682,-3.344824,-1.531391,...,-0.757232,-1.621067,-0.324863,-1.583458,-1.836889,-1.352553,-1.123196,-2.061757,-2.840812,-1.467379
1,3.9275,3.167209,3.344193,1.562212,3.479165,4.65087,2.610602,2.387745,2.468603,4.282036,...,5.056195,4.192359,5.488563,4.229969,3.976538,4.460873,4.69023,3.75167,2.972614,4.346047
2,8.84025,8.079959,8.256942,6.474962,8.391914,9.56362,7.523351,7.300494,7.381352,9.194785,...,9.968944,9.105109,10.401313,9.142718,8.889287,9.373623,9.60298,8.664419,7.885364,9.258797
3,5.250582,4.490291,4.667274,2.885294,4.802246,5.973952,3.933683,3.710827,3.791684,5.605117,...,6.379276,5.515441,6.811645,5.55305,5.299619,5.783955,6.013312,5.074751,4.295696,5.669129
4,5.170784,4.410493,4.587476,2.805496,4.722449,5.894154,3.853885,3.631029,3.711887,5.525319,...,6.299478,5.435643,6.731847,5.473252,5.219822,5.704157,5.933514,4.994953,4.215898,5.589331


In [18]:
np.sqrt(((preds - test)**2).stack().mean())

4.68875588784186

In [19]:
np.sqrt(((preds - train)**2).stack().mean())

4.6173021500924856

Adding the column and row biases lead to significant imporovement of the RMSE. We now see the training data slightly overfitting when compared to the test data.

## Predictions,
We have already made predictions for the entire data frame, now we just need to add them to the main dataset.

In [20]:
isNA = jokes.isnull()

In [21]:
fullDF = pd.DataFrame(np.where(isNA, preds, jokes))

In [22]:
fullDF.head(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-7.82,8.79,-9.66,-8.16,-7.52,-8.5,-9.85,4.17,-8.98,-4.76,...,2.82,-1.621067,-0.324863,-1.583458,-1.836889,-1.352553,-5.63,-2.061757,-2.840812,-1.467379
1,4.08,-0.29,6.36,4.37,-2.38,-9.66,-0.73,-5.34,8.88,9.22,...,2.82,-4.95,-0.29,7.86,-0.19,-2.14,3.06,0.34,-4.32,1.07
2,8.84025,8.079959,8.256942,6.474962,9.03,9.27,9.03,9.27,7.381352,9.194785,...,9.968944,9.105109,10.401313,9.08,8.889287,9.373623,9.60298,8.664419,7.885364,9.258797
3,5.250582,8.35,4.667274,2.885294,1.8,8.16,-2.82,6.21,3.791684,1.84,...,6.379276,5.515441,6.811645,0.53,5.299619,5.783955,6.013312,5.074751,4.295696,5.669129
4,8.5,4.61,-4.17,-5.39,1.36,1.6,7.04,4.61,-0.44,5.73,...,5.19,5.58,4.27,5.19,5.73,1.55,3.11,6.55,1.8,1.6
5,-6.17,-3.54,0.44,-8.5,-7.09,-4.32,-8.69,-0.87,-6.65,-1.8,...,-3.54,-6.89,-0.68,-2.96,-2.18,-3.35,0.05,-9.08,-5.05,-3.45
6,6.109847,5.349555,5.526539,3.744558,8.59,-9.85,7.72,8.79,4.650949,6.464382,...,7.238541,6.374705,7.670909,6.412315,6.158884,2.33,6.872576,5.934016,5.15496,6.528393
7,6.84,3.16,9.17,-6.21,-8.16,-1.7,9.27,1.41,-5.19,-4.42,...,7.23,-1.12,-0.1,-5.68,-3.16,-3.35,2.14,-0.05,1.31,0.0
8,-3.79,-3.54,-9.42,-6.89,-8.74,-0.29,-5.29,-8.93,-7.86,-1.6,...,4.37,-0.29,4.17,-0.29,-0.29,-0.29,-0.29,-0.29,-3.4,-4.95
9,3.01,5.15,5.15,3.01,6.41,5.15,8.93,2.52,3.01,8.16,...,7.604834,4.47,8.037203,6.778608,6.525177,7.009512,7.23887,6.300309,5.521254,6.894686


We can see that more bias we can account for, them lower RMSE we have.

---