# Recommender Systems Naive Approaches
## Introduction
In this task, we will be working on the MovieLens 1M dataset which can be fetched from moivelens. This set contains about 1.000.000 ratings given to about 4.000 movies by about 6.000 users. Our task adopted Naive Approaches in Python and estimated their accuracy withthe Mean Absolute Error, MAE.

To make sure that the results are reliable, 5-fold cross- validation is used.

In [80]:
# Import the packages
import numpy as np
import pandas as pd
import sklearn
import os
from sklearn.model_selection import KFold
from sklearn import linear_model
import warnings
warnings.filterwarnings('ignore') #ignore the warnings

# Read the Ratings File
file_path = '../data'
RATING_FILE = 'ratings.csv'
data = np.loadtxt(fo )
ratings_raw = pd.read_csv(os.path.join(DATA_DIR, RATING_FILE),sep=',')

In [81]:
ratings = ratings_raw.copy()
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [82]:
# Data Wrangling
print ("The max rating is:",ratings.rating.max(),". And the min rating is:",ratings.rating.min())

The max rating is: 5.0 . And the min rating is: 0.5


In [83]:
# Set the lower bound to 1 instead of 0.5
ratings.rating[ratings.rating<1]=1

# Check is there is NaN values
ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

There is no NaN values in user_id and movie_id columns

## Naive Approaches

The “average rating” recommender requires no further explanation. We will deploy four naive approaches:
- the global average rating
- the average rating per item
- the average rating per user
- an “optimal” linear combination of the 3 averages


### 1. The global average rating

In [84]:
# set up the empty result list
err_train1=[]
err_test1=[]

# set up the cross validation with 5 folds
nfolds=5
kf = KFold(n_splits=nfolds, random_state = 0, shuffle = True)
for train_index, test_index in kf.split(ratings.userId):
    train, test = ratings.iloc[train_index], ratings.iloc[test_index]
    # calcuate the global meaning value
    gmr = train.rating.mean()
    # calcuate the MSE
    train['diff'] = (train.rating - gmr)**2
    err_train1.append(np.sqrt(train['diff'].mean()))
    test['diff'] = (test.rating - gmr)**2
    err_test1.append(np.sqrt(test['diff'].mean()))

In [85]:
#print the final conclusion:
print("For the global average rating ")
print("MSE ON TRAIN: " + str(np.mean(err_train1)))
print("MSE on  TEST: " + str(np.mean(err_test1)))

For the global average rating 
MSE ON TRAIN: 1.0244140710185898
MSE on  TEST: 1.0244173138607935


### 2. The average user rating
When building models for “average user rating” or “average movie rating”, we must take into account that during the sampling process some users or some movies might disappear from the training sets – all their ratings will enter the test set. To handle such cases, we use the “global average rating” as a fall-back value.

In [86]:
# set up the empty result list
err_train2=[]
err_test2=[]

# set up the cross validation with 5 folds
for train_index, test_index in kf.split(ratings.userId):
    train, test = ratings.iloc[train_index], ratings.iloc[test_index]
    # calcuate the average rating per user using groupby function
    ave_user = train.groupby(['userId'])['rating'].mean().to_frame().reset_index()
    ave_user.columns = ['userId', 'pred'] 
    
    # assign the average user rating as the prediction
    train_pred = train.merge(ave_user,on=['userId'], how='left')
    test_pred = test.merge(ave_user,on=['userId'], how='left')
    
    # use the “global average rating” as a fall-back value.
    gmr = train.rating.mean()
    test_pred['pred'].fillna(gmr, inplace = True)

    # calcuate the MSE
    train_pred['diff'] = (train_pred.rating - train_pred.pred)**2
    err_train2.append(np.sqrt(train_pred['diff'].mean()))
    test_pred['diff'] = (test_pred.rating - test_pred.pred)**2
    err_test2.append(np.sqrt(test_pred['diff'].mean()))

In [87]:
#print the final conclusion:
print("For the user average rating ")
print("Mean MSE on TRAIN: " + str(np.mean(err_train2)))
print("Mean MSE on  TEST: " + str(np.mean(err_test2)))

For the user average rating 
Mean MSE on TRAIN: 0.9170435071734993
Mean MSE on  TEST: 0.9243405869451401


### 3. The average movie rating

In [88]:
# set up the empty result list
err_train3=[]
err_test3=[]

for train_index, test_index in kf.split(ratings.movieId):
    train, test = ratings.iloc[train_index], ratings.iloc[test_index]
    # calcuate the average rating per movie using groupby function
    ave_movie = train.groupby(['movieId'])['rating'].mean().to_frame().reset_index()
    ave_movie.columns = ['movieId', 'pred'] 
    
    # assign the average movie rating as the prediction
    train_pred = train.merge(ave_movie,on=['movieId'], how='left')
    test_pred = test.merge(ave_movie,on=['movieId'], how='left')
    
    # use the “global average rating” as a fall-back value
    gmr = train.rating.mean()
    test_pred['pred'].fillna(gmr, inplace = True)

    # calcuate the MSE
    train_pred['diff'] = (train_pred.rating - train_pred.pred)**2
    err_train3.append(np.sqrt(train_pred['diff'].mean()))
    test_pred['diff'] = (test_pred.rating - test_pred.pred)**2
    err_test3.append(np.sqrt(test_pred['diff'].mean()))

In [89]:
#print the final conclusion:
print("For the movie average rating: ")
print("Mean MSE on TRAIN: " + str(np.mean(err_train3)))
print("Mean MSE on  TEST: " + str(np.mean(err_test3)))

For the movie average rating: 
Mean MSE on TRAIN: 0.8532443161602952
Mean MSE on  TEST: 0.9572987289292613


### 4. The user movie linear regression rating
We will use a linear regression model to predict the ratings. Then we will improve predictions by rounding values bigger than 5 to 5 and smaller than 1 to 1 (valid ratings are always between 1 and 5).

In [90]:
# set up the empty result list
err_train4=[]
err_test4=[]

for train_index, test_index in kf.split(ratings.userId):
    train, test = ratings.iloc[train_index], ratings.iloc[test_index]

    # calcuate the average user rating
    ave_user = train.groupby(['userId'])['rating'].mean().to_frame().reset_index()
    ave_user.columns = ['userId', 'pred_user'] 
    # calcuate the average movie rating
    ave_movie = train.groupby(['movieId'])['rating'].mean().to_frame().reset_index()
    ave_movie.columns = ['movieId', 'pred_movie'] 
    
    # assign the average values based on user and movie IDs
    train_pred = train.merge(ave_user,on=['userId'], how='left')
    test_pred = test.merge(ave_user,on=['userId'], how='left')
    train_pred = train_pred.merge(ave_movie,on=['movieId'], how='left')
    test_pred = test_pred.merge(ave_movie,on=['movieId'], how='left')
    
    # set up the linear regression model
    reg = linear_model.LinearRegression()
    '''
    set up a mask to hide NaN values # Not necessary in this case 
    mask = ~np.isnan(train_pred['pred_user']) & ~np.isnan(train_pred['pred_movie']) & ~np.isnan(train_pred['rating'])
    reg.fit(train_pred[['pred_user','pred_movie']][mask],train_pred['rating'][mask])
    ''' 
    reg.fit(train_pred[['pred_user','pred_movie']],train_pred['rating'])
 
    # get model parameters for prediction
    a1, a2 = reg.coef_
    a0 = reg.intercept_
    
    # make predictions
    train_pred['pred']=train_pred['pred_user']*a1+train_pred['pred_movie']*a2+a0
    test_pred['pred']=test_pred['pred_user']*a1+test_pred['pred_movie']*a2+a0
    
    #rounding values bigger than 5 to 5 and smaller than 1 to 1 
    train_pred['pred'][train_pred['pred']<1]=1
    test_pred['pred'][test_pred['pred']<1]=1
    train_pred['pred'][train_pred['pred']>5]=5
    test_pred['pred'][test_pred['pred']>5]=5
    
    # use the “global average rating” as a fall-back value
    gmr = train.rating.mean()
    test_pred['pred'].fillna(gmr, inplace = True)

    # calcuate the MSE
    train_pred['diff'] = (train_pred.rating - train_pred.pred)**2
    err_train4.append(np.sqrt(train_pred['diff'].mean()))
    test_pred['diff'] = (test_pred.rating - test_pred.pred)**2
    err_test4.append(np.sqrt(test_pred['diff'].mean()))

In [91]:
#print the final conclusion:
print("For the user movie linear regression rating: ")
print("Mean MSE on TRAIN: " + str(np.mean(err_train4)))
print("Mean MSE on  TEST: " + str(np.mean(err_test4)))

For the user movie linear regression rating: 
Mean MSE on TRAIN: 0.7867146084287131
Mean MSE on  TEST: 0.8795261369743201


In [92]:
## Conclusion
err_train = pd.DataFrame({'Train1':err_train1, 'Train2':err_train2,'Train3':err_train3, 'Train4':err_train4})
err_test = pd.DataFrame({'Test1':err_test1, 'Test2':err_test2,'Test3':err_test3, 'Test4':err_test4})

print("Training Error And Test Error:")
print(err_train.mean(),'\n\n',err_test.mean())

Training Error And Test Error:
Train1    1.024414
Train2    0.917044
Train3    0.853244
Train4    0.786715
dtype: float64 

 Test1    1.024417
Test2    0.924341
Test3    0.957299
Test4    0.879526
dtype: float64


### Conclusion
**Based on the MSE, we can see the linear model works best among the four approaches. Not suprisingly, the user or moive based method is better than just using the global average.**

The Matrix Factorization method will be discussed in the other python notebook.