# Recommender Systems Naive Approaches
## Introduction
In this task, we will be working on the MovieLens 1M dataset which can be fetched from moivelens . This set contains about 1.000.000 ratings given to about 4.000 movies by about 6.000 users. Our task adopted Naive Approaches in Python and estimated their accuracy with the Root Mean Squared Error, RMSE, and the Mean Absolute Error, MAE.

To make sure that the results are reliable, 5-fold cross- validation is used.

In [1]:
# Import packages
import numpy as np
import pandas as pd
import sklearn
import os
from sklearn.model_selection import KFold
from sklearn import linear_model

# Read the Ratings File
DATA_DIR = '../data'
RATING_FILE = 'ratings.csv'

ratings_raw = pd.read_csv(os.path.join(DATA_DIR, RATING_FILE),sep=',')

In [2]:
ratings = ratings_raw.copy()
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [3]:
# Data Wrangling
print ("The max rating is:",ratings.rating.max(),". And the min rating is:",ratings.rating.min())

The max rating is: 5.0 . And the min rating is: 0.5


In [4]:
ratings.rating[ratings.rating<1]=1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [None]:
# need to do more
'''
# Fill NaN values in user_id and movie_id column with 0
ratings['user_id'] = ratings['user_id'].fillna(0)
ratings['movie_id'] = ratings['movie_id'].fillna(0)

# Replace NaN values in rating column with average of all values
ratings['rating'] = ratings['rating'].fillna(ratings['rating'].mean())
'''

## Naive Approaches

The first 4 formulas from slide 17: the global average rating, the av- erage rating per item, the average rating per user, and an “optimal” linear combination of the 3 averages. (The global average does not depend on the specific user or item, hence it is constant - therefore we model its contribution by a single parameter γ.)

The “average rating” recommender requires no further explanation. However, when building models for “average user rating” or “average movie rating” you must take into account that during the sampling process some users or some movies might disappear from the training sets – all their ratings will enter the test set. To handle such cases, use the “global average rating” as a fall-back value.
Additionally, improve predictions by rounding values bigger than 5 to 5 and smaller than 1 to 1 (valid ratings are always between 1 and 5).
Thus there are 4 naive approaches: global average, user average, movie average and a linear combination of the three averages (with fall-back rules).

In [19]:
nfolds=5

# set up the empty result list
err_train=[]
err_test=[]

In [17]:
kf = KFold(n_splits=nfolds, random_state = 0, shuffle = True)

In [27]:
for train_index, test_index in kf.split(ratings.userId):
    train, test = ratings.iloc[train_index], ratings.iloc[test_index]
    gmr = train.rating.mean()
    train['diff'] = (train.rating - gmr)**2
    err_train.append(np.sqrt(train['diff'].mean()))
    test['diff'] = (test.rating - gmr)**2
    err_test.append(np.sqrt(test['diff'].mean()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [29]:
#print the final conclusion:
print("For the global average rating ")
print("Mean error on TRAIN: " + str(np.mean(err_train)))
print("Mean error on  TEST: " + str(np.mean(err_test)))

For the global average rating 
Mean error on TRAIN: 1.03607933603437
Mean error on  TEST: 1.036079372847295


## average user rating

In [31]:
# set up the empty result list
err_train=[]
err_test=[]

for train_index, test_index in kf.split(ratings.userId):
    train, test = ratings.iloc[train_index], ratings.iloc[test_index]
    ave_user = train.groupby(['userId'])['rating'].mean().to_frame().reset_index()
    ave_user.columns = ['userId', 'pred'] 
    
    train_pred = train.merge(ave_user,on=['userId'], how='left')
    test_pred = test.merge(ave_user,on=['userId'], how='left')
    
    gmr = train.rating.mean()
    train_pred['pred'].fillna(gmr, inplace = True)
    test_pred['pred'].fillna(gmr, inplace = True)

    train_pred['diff'] = (train_pred.rating - train_pred.pred)**2
    err_train.append(np.sqrt(train_pred['diff'].mean()))
    test_pred['diff'] = (test_pred.rating - test_pred.pred)**2
    err_test.append(np.sqrt(test_pred['diff'].mean()))

In [32]:
#print the final conclusion:
print("For the user average rating ")
print("Mean error on TRAIN: " + str(np.mean(err_train)))
print("Mean error on  TEST: " + str(np.mean(err_test)))

For the user average rating 
Mean error on TRAIN: 0.9414273189579973
Mean error on  TEST: 0.9500774977254043


## average movie rating

In [33]:
# set up the empty result list
err_train=[]
err_test=[]

for train_index, test_index in kf.split(ratings.movieId):
    train, test = ratings.iloc[train_index], ratings.iloc[test_index]
    ave_movie = train.groupby(['movieId'])['rating'].mean().to_frame().reset_index()
    ave_movie.columns = ['movieId', 'pred'] 
    
    train_pred = train.merge(ave_movie,on=['movieId'], how='left')
    test_pred = test.merge(ave_movie,on=['movieId'], how='left')
    
    gmr = train.rating.mean()
    train_pred['pred'].fillna(gmr, inplace = True)
    test_pred['pred'].fillna(gmr, inplace = True)

    train_pred['diff'] = (train_pred.rating - train_pred.pred)**2
    err_train.append(np.sqrt(train_pred['diff'].mean()))
    test_pred['diff'] = (test_pred.rating - test_pred.pred)**2
    err_test.append(np.sqrt(test_pred['diff'].mean()))

In [34]:
#print the final conclusion:
print("For the movie average rating ")
print("Mean error on TRAIN: " + str(np.mean(err_train)))
print("Mean error on  TEST: " + str(np.mean(err_test)))

For the movie average rating 
Mean error on TRAIN: 0.9249538388169952
Mean error on  TEST: 0.9265259641995804


## user movie linear regression rating

In [39]:
# set up the empty result list
err_train=[]
err_test=[]

for train_index, test_index in kf.split(ratings.userId):
    train, test = ratings.iloc[train_index], ratings.iloc[test_index]

    ave_user = train.groupby(['userId'])['rating'].mean().to_frame().reset_index()
    ave_user.columns = ['userId', 'pred_user'] 
    
    ave_movie = train.groupby(['movieId'])['rating'].mean().to_frame().reset_index()
    ave_movie.columns = ['movieId', 'pred_movie'] 
    
   
    train_pred = train.merge(ave_user,on=['userId'], how='left')
    test_pred = test.merge(ave_user,on=['userId'], how='left')
    train_pred = train_pred.merge(ave_movie,on=['movieId'], how='left')
    test_pred = test_pred.merge(ave_movie,on=['movieId'], how='left')
    
    ## linear regression
    reg = linear_model.LinearRegression()
    
    mask = ~np.isnan(train_pred['pred_user']) & ~np.isnan(train_pred['pred_movie']) & ~np.isnan(train_pred['rating'])
    reg.fit (train_pred[['pred_user','pred_movie']][mask],train_pred['rating'][mask])

    a1, a2 = reg.coef_
    a0 = reg.intercept_
    
    train_pred['pred']=train_pred['pred_user']*a1+train_pred['pred_movie']*a2+a0
    test_pred['pred']=test_pred['pred_user']*a1+test_pred['pred_movie']*a2+a0
    
    #rounding values bigger than 5 to 5 and smaller than 1 to 1 
    train_pred['pred'][train_pred['pred']<1]=1
    test_pred['pred'][test_pred['pred']<1]=1
    train_pred['pred'][train_pred['pred']>5]=5
    test_pred['pred'][test_pred['pred']>5]=5

    gmr = train.rating.mean()
    train_pred['pred'].fillna(gmr, inplace = True)
    test_pred['pred'].fillna(gmr, inplace = True)

    train_pred['diff'] = (train_pred.rating - train_pred.pred)**2
    err_train.append(np.sqrt(train_pred['diff'].mean()))
    test_pred['diff'] = (test_pred.rating - test_pred.pred)**2
    err_test.append(np.sqrt(test_pred['diff'].mean()))

In [40]:
#print the final conclusion:
print("For the user movie linear regression rating ")
print("Mean error on TRAIN: " + str(np.mean(err_train)))
print("Mean error on  TEST: " + str(np.mean(err_test)))

For the user movie linear regression rating 
Mean error on TRAIN: 0.8511163694734373
Mean error on  TEST: 0.8592044774314213
