# Data 612 - Project 1 : Global Baseline Predictors and RMSE

#### Team Info 
+ Christina Valore, 
+ Juliann McEachern, 
+ Rajwant Mishra

#### Date: June 11, 2019


## Overview

In this first assignment, we'll attempt to predict ratings with very little information.  We'll first look at just raw averages across all (training dataset) users.  We'll then account for "bias" by normalizing across users and across items.   You'll be working with ratings in a user-item matrix, where each rating may be (1) assigned to a training dataset, (2) assigned to a test dataset, or (3) missing. 

Please code as much of your work as possible in R or Python.  You may use standard functions (e.g. from base R and the tidyverse).  Your project should be delivered in an R Markdown or a Jupyter notebook, then the notebook should be saved into a GitHub repository.  You should include a link to your GitHub repository in your assignment submission link. 

## Recommender System Data Selection

We built a recommender system that suggests movies based on IMBD rating scores from 1 - 10. The following movies were selected and scraped from IMBD using R: 

![Example](Data/Movie_Titles.JPG)

The Rmd file and output datasets are stored in the `Data` file within this repository. The dataset includes missing ratings across all movies as not all users rated all selected titles. 

#### Data Preparation

Data was imported using panda and cleated to remove all blank and unnessary columns from webscrapping. 

In [82]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#load data from csv in github repository into pandas dataframe
data = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-1/Data/User%20_Mov%20.csv')
movie = pd.read_csv('https://raw.githubusercontent.com/jemceach/612-group/master/project-1/Data/Mov%20.csv')

#select relevant columns
data = data[['Movie_ID','User_Name', 'Rating']]

#preview data
data.head()

Unnamed: 0,Movie_ID,User_Name,Rating
0,1,TheTopDawgCritic,8
1,1,celestinoavilajr,8
2,1,dave-mcclain,8
3,1,neener3707,7
4,1,krice23,8


#### User-Item Matrix

In [74]:
user_matrix = data.pivot_table(index='User_Name', columns='Movie_ID',values='Rating')

#replace column numbers with movie names
new_cols = list(movie.Movie_Name) 
user_matrix.rename(columns=dict(zip(user_matrix.columns, new_cols)),inplace=True)

user_matrix.head()

Movie_ID,Bad Samaritan,Avengers: InfinityWar,Rampage,Truth or Dare,Incredibles 2,A Star Is Born,KungFu Panda,Shrek,Shrek 2,Love Actually,...,The Nightingale,Blue Ruin,Hidden Figures,The Help,Pretty Ugly People,Green Room,A Long Way Down,28 Weeks Later,Captain Fantastic,The Road
User_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3xHCCH,,,,,,,10.0,,,,...,,,,,,,,,,
A_Different_Drummer,,,,,,,,,,,...,,10.0,9.0,,,,,,,
AdrenalinDragon,,,,,,,,,,,...,,,,,,,,,,
AirBourne_Bds,,,,,,,,,,10.0,...,,,,,,,,,,
Ajk2386,,,,,6.0,,,,,,...,,,,,,,,,,


#### Split Data 
The train/test datasets were created from data and split 80/20 using a random, binary vector.

In [75]:
# set seed
np.random.seed(30)

# split train/test data
ratio = 0.80
split = np.random.rand(len(user_matrix)) < ratio
train = user_matrix[split]  
test = user_matrix[~split] 

## RMSE 

#### Raw Mean Calculations

Raw average (mean) ratings were created for all training, user-item combination. We found that the raw mean average for the training set was 6.59. 

In [76]:
import warnings
warnings.simplefilter('ignore')

# Revert train/test dataset to long format 
train_long = pd.melt(train, 
            id_vars='User_ID', 
            value_vars=list(train.columns[1:]), 
            var_name='Movie_ID', 
            value_name='Rating')

# Revert train/test dataset to long format 
test_long = pd.melt(test, 
            id_vars='User_ID', 
            value_vars=list(test.columns[1:]), 
            var_name='Movie_ID', 
            value_name='Rating')

# Calculate raw average ratings 
train_raw_avg = train_long.Rating.mean(skipna = True)

print("Raw avg for test: ", train_raw_avg)

Raw avg for test:  6.591324200913242


#### RMSE

The `rmse` function finds the square root of the mean from the squared difference of all actual and predicted data points for the training data and test data. Using the raw avaergage only, the RMSE for the train was slightly higher at 3.14 than the test at 3.08.

In [77]:
def rmse(actual, predicted):
    return np.sqrt(np.mean((actual - predicted) ** 2))

train_rmse = rmse(train_long.Rating, train_raw_avg)
test_rmse = rmse(test_long.Rating, train_raw_avg)

print("RMSE for train: ", train_rmse.round(2))
print("RMSE for test: ", test_rmse.round(2))

RMSE for train:  3.14
RMSE for test:  3.08


## Baseline Predictors

#### Calculate Bias

Bias calculations for `movie` from the raw average:

In [78]:
# bias calculation
movie_bias = train.mean(axis=0)-train_raw_avg

# iterate to assign output value to index
for i in range(movie_bias.shape[0]):
    movie_bias
    
# preview data
movie_bias.head(5).round(2)

Movie_ID
Bad Samaritan            1.09
Avengers: InfinityWar    0.15
Rampage                  1.11
Truth or Dare           -1.59
Incredibles 2           -2.67
dtype: float64

Bias calculations for `user` from the raw average:

In [79]:
# bias calculation
user_bias = train.mean(axis=1)- train_raw_avg

# iterate to assign output value to index
for i in range(user_bias.shape[0]):
    user_bias[i] = user_bias[i] 

# preview data
user_bias.head(5).round(2)

User_Name
3xHCCH                 3.41
A_Different_Drummer    2.91
AdrenalinDragon       -1.59
AirBourne_Bds          3.41
AlsExGal               2.41
dtype: float64

#### Baseline Predictors

Raw Avg + Bias User + Bias of Movie

In [80]:
# create empty dataframe for loop calculations
baseline = pd.DataFrame(index=train.index, columns=movie_bias.index).fillna(0)  

# iterate baseline calculations
for r in range(baseline.shape[0]):
    for c in range(movie_bias.shape[0]):
        baseline.iloc[r] = train_raw_avg + user_bias[r] + movie_bias

# set ceiling and floor for ratings greater than 10 and less than 1
baseline[baseline < 1] = 1
baseline[baseline > 10] = 10

# preview data
baseline.head(5).round(2)

Movie_ID,Bad Samaritan,Avengers: InfinityWar,Rampage,Truth or Dare,Incredibles 2,A Star Is Born,KungFu Panda,Shrek,Shrek 2,Love Actually,...,The Nightingale,Blue Ruin,Hidden Figures,The Help,Pretty Ugly People,Green Room,A Long Way Down,28 Weeks Later,Captain Fantastic,The Road
User_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3xHCCH,10.0,10.0,10.0,8.41,7.33,8.28,10.0,10.0,10.0,8.41,...,10.0,10.0,9.61,9.1,9.23,9.51,10.0,7.3,10.0,8.53
A_Different_Drummer,10.0,9.65,10.0,7.91,6.83,7.78,10.0,10.0,10.0,7.91,...,10.0,10.0,9.11,8.6,8.73,9.01,9.59,6.8,10.0,8.03
AdrenalinDragon,6.09,5.15,6.11,3.41,2.33,3.28,7.72,7.11,7.02,3.41,...,6.21,7.41,4.61,4.1,4.23,4.51,5.09,2.3,7.48,3.53
AirBourne_Bds,10.0,10.0,10.0,8.41,7.33,8.28,10.0,10.0,10.0,8.41,...,10.0,10.0,9.61,9.1,9.23,9.51,10.0,7.3,10.0,8.53
AlsExGal,10.0,9.15,10.0,7.41,6.33,7.28,10.0,10.0,10.0,7.41,...,10.0,10.0,8.61,8.1,8.23,8.51,9.09,6.3,10.0,7.53


#### RMSE for the Baseline Predictors 

RMSE was calculated from baseline predictors for both training data and test data. Using the baseline predictor, we saw that the RMSE for the training data lowered signifigantly to 1.18, while the testing RMSE increased to 4.18.

In [81]:
# convert baseline to long format 
baseline_long = pd.melt(baseline, 
            id_vars='User_Name', 
            value_vars=list(baseline.columns[1:]), 
            var_name='Movie_ID', 
            value_name='Rating')

# calculate rmse
trainb_rmse = rmse(train_long.Rating,baseline_long.Rating)
testb_rmse = rmse(test_long.Rating,baseline_long.Rating)

print("Baseline RMSE for train: ", trainb_rmse.round(2))
print("Baseline RMSE for test: ", testb_rmse.round(2))

Baseline RMSE for train:  1.18
Baseline RMSE for test:  4.18


## Test Calculations with Subset Data

In [53]:
#create small subset of the data for manual calculations, 5 rows to be exact
sub_train_mtx=train.loc[['Platypuschow', 'claudio_carvalho', 'TheTopDawg']].iloc[:,:5]
sub_test_mtx=test.loc[['TheTopDawgCritic', 'neener3707', 'Ajk2386']].iloc[:,:5]

# view sub_train
sub_train_mtx

Movie_ID,Bad Samaritan,Avengers: InfinityWar,Rampage,Truth or Dare,Incredibles 2
User_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Platypuschow,7.0,9.0,6.0,,
claudio_carvalho,8.0,,6.0,3.0,
TheTopDawg,8.0,,8.0,,


In [54]:
# view sub_test
sub_test_mtx

Movie_ID,Bad Samaritan,Avengers: InfinityWar,Rampage,Truth or Dare,Incredibles 2
User_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TheTopDawgCritic,8.0,,8.0,8.0,
neener3707,7.0,,,3.0,
Ajk2386,,,,,6.0


In [55]:
# RMSE manual calculations
sub_train = np.array([7,8,8,9,6,6,8,3])
sub_train_avg = sub_train.mean()
sub_train_se = (sub_train - sub_train_avg)**2
sub_train_mean = sub_train_se.mean()
sub_train_rmse = np.sqrt(sub_train_mean).round(3)
print('sub_train RMSE : ', sub_train_rmse)

sub_test = np.array([8,7,8,8,3,6])
sub_test_avg = sub_test.mean()
sub_test_se = (sub_test - sub_test_avg)**2
sub_test_mean = sub_test_se.mean()
sub_test_rmse = np.sqrt(sub_test_mean).round(3)
print('sub_test RMSE : ', sub_test_rmse)

sub_train RMSE :  1.763
sub_test RMSE :  1.795



## Summary 

Our initiali RMSE using the raw average only was train: 3.14 and test: 3.08. After calculating the baseline predictor and attempting to refit the model, we saw that while the train RMSE lowered to 1.18, the test RMSE increased to 4.18 which indicates that our model worsened the likelihood that our recommender system would be accurate in predicting movies a user would want to watch next. 

The higher RMSE for the test could be due to the fact that some movies were rated much less than others or that the movie genres were very different from eachother making it harder to predict which movie the user will watch next. 

## Appendix : Data Scrapping from IMDB using 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Parsing of HTML/XML files  
library(rvest)    

# String manipulation
library(stringr)   

# Verbose regular expressions
#install.packages("rebus")
library(rebus)     

# Eases DateTime manipulation
library(lubridate)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

R Scrapping Function
------------------

-   Reading only Rating , Title , Username and content of comments