## Data643 Project 4 - Part 2 - Cross Validation of Recommendation


Deliverables 
 
1. As in your previous assignments, compare the accuracy of at least two recommender system algorithms against your offline data.
2. Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity. 
3. Compare and report on any change in accuracy before and after you’ve made the change in #2.
4. As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible.  Also, briefly propose how you would design a reasonable online evaluation environment. 

####  <font color='blue'> Introduction </font>

In part 2 of the project 4, we also built upon our prior recommender systems and use cross validation to check the model performance. Specifically we use sklearn's grid cross validation method on the matrix factorization recommendation (SVD). In addition to RMSE and MAE, we also added FCP as a performance metrix. FCP measure the correlation of pair items. If a user pick 10 similar movies with another user, their enjoyment is measure in term of relative magnitude instead of pure quantitative numbers.

In [1]:
import numpy as np
import pandas as pd
from scipy.sparse.linalg import svds
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

In [2]:
#Load datasets
ratings = pd.read_csv('https://raw.githubusercontent.com/akulapa/Data643-Week02/master/Data/ratings.csv')
movies = pd.read_csv('https://raw.githubusercontent.com/akulapa/Data643-Week02/master/Data/movies.csv')

In [3]:
#Convert Users as Rows and Movies as Columns 
M_df = ratings.pivot(index = 'userId', columns ='movieId', values = 'rating').fillna(0)

#Convert Movies as Rows and Users as Columns 
U_df = ratings.pivot(index = 'movieId', columns ='userId', values = 'rating').fillna(0)

M_df.head(10)

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
#Movies rated by user 1
ratings[(ratings['userId'] == 1)].head(10)

R = M_df.values
R[10:]

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 5.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 4.,  0.,  0., ...,  0.,  0.,  0.],
       [ 5.,  0.,  0., ...,  0.,  0.,  0.]])

In [5]:
#Calculate mean for each user
user_ratings_mean = M_df.mean(axis=1)
user_ratings_mean.head(10)

#Calculate mean for each user
user_ratings_mean = np.mean(R, axis = 1)

#Apply mean for each user
#user_bias = user rating - user over all average 
R_bias = R - user_ratings_mean.reshape(-1, 1)

R_bias[10:]


#Lets start with entire dataset as is
#Get number of rows and columns 
r, c = R.shape

#get min of row or column size, it acts as starting value for k
k = min(r, c)
k = k - 1

#Get SVD values for entire dataset, k value has to smaller value of ratings matrix
U, sigma, Vt = np.linalg.svd(R, full_matrices=False)

#user features
U[10:]

#diagonal matrix of singular values
sigma[:10]

#movie features
Vt[:10]

#Get diagonal sigma
sigma_diag = np.diag(sigma)

#Recalculate ratings
predicted_ratings = np.dot(np.dot(U, sigma_diag), Vt) + user_ratings_mean.reshape(-1, 1)

In [6]:
#Convert to dataframe
predicted_ratings_T = pd.DataFrame(predicted_ratings.T, columns=U_df.columns)
predicted_ratings_T['movieId'] = U_df.index
predicted_ratings_T = pd.melt(predicted_ratings_T, id_vars = 'movieId')
userMoviesId = list(ratings['movieId'])
userUserId = list(ratings['userId'])

In [7]:

from surprise import SVD, evaluate, KNNBasic, Dataset, similarities, Reader, BaselineOnly
from surprise.model_selection import GridSearchCV, cross_validate

algoSVD = SVD
sim_options = {'name': 'cosine',
               'user_based': False  # compute  similarities between items
               }

algo2 = KNNBasic#(sim_options=sim_options)

reader = Reader(rating_scale = (1,5))

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
param_grid1 = {}

In [8]:
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }
algo3 = BaselineOnly#(bsl_options=bsl_options)

In [9]:
df2 = Dataset.load_from_df(ratings[['userId','movieId','rating']], reader)
gs = GridSearchCV(algoSVD, param_grid1,
                  measures=['rmse', 'mae', 'fcp'], cv=3, n_jobs = -2)

gs2 = GridSearchCV(algo3, param_grid1,
                  measures=['rmse', 'mae', 'fcp'], cv=3, n_jobs = -2)

In [10]:
#Fit model by data
gs.fit(df2)

##### <font color='blue'> Concordant Pair </font>

The following is a reference from wikipedia which explains in technical details a concordant pair in FCP performance measure.

From Wikipedia:

A concordant pair is a pair of observations, each on two variables, {X1,Y1} and {X2,Y2}, having the property that

$$ sgn(X_{2}-X_{1})\ = sgn (Y_{2}-Y_{1}) $$
where "sgn" refers to whether a number is positive, zero, or negative (its sign). Specifically, the sign function, often represented as sgn, is defined as:

$$ sgn x=\begin{cases}
-1&:&x<0\\
0&:&x=0\\
1&:&x>0\end{cases},$$
That is, in a concordant pair, both elements of one pair are either greater than, equal to, or less than the corresponding elements of the other pair.

In contrast, a discordant pair is a pair of two-variable observations such that

$$ sgn(X_{2}-X_{1})\ = -sgn (Y_{2}-Y_{1}) $$

In [11]:
#gs.cv_results['mean_test_rmse']
#ALGO 1 Results
metrixdf = pd.DataFrame.from_dict(gs.cv_results)
metrixdf1 = metrixdf.iloc[:,9:18]
metrixdf1.reindex(sorted(metrixdf1.columns),axis = 1)
metrixdf1.sort_index(axis=1, inplace = True)
columnsname = ['fcp1','mae1','rmse1','fcp2','mae2','rmse2','fcp3','mae3','rmse3']
metrixdf1.columns=columnsname

### <font color='blue'> Cross-Validation Results </font>

The following 3 sets of results show little variance in the SVD recommender system. As all the values are relative in nature, the true baseline of the system is unknown. Cross validation with more folds should be done in the future.

In [12]:
metrixdf1.sort_index(axis=1, inplace = True)
metrixdf1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,fcp1,fcp2,fcp3,mae1,mae2,mae3,rmse1,rmse2,rmse3
0,0.63251,0.634728,0.636845,0.695701,0.694482,0.696785,0.902563,0.900004,0.903829


####  <font color='blue'> Conclusion & Future Research </font>
In examining the diversity of our recommendation systems, we explored extensive literature on the topic. There were few examples using python language. A few interesting articles mentioned using two-pass method, Bayes Rule and PC reranking after post processing of recommendation. The two-pass method was shown to out perform other methods while maintaing accuracy. Diversity is shown to have inverse relationship with accuracy of the recommendation system. 

Two-pass methods works by first calcuating the discrepancy between items pairs after ratings are predicted. The items or users with least discrepancy forms a cluster. From this cluster, items with the highest ratings are recommendated to the users. The discrepancy measure is based on Graph Theory's in-degree and out-degree of nodes that are connected by users.

# Reference:

1. Concordant Pair (2018) retrieved from https://en.wikipedia.org/wiki/Concordant_pair

2. Antikacioglu, A., &  Ravi, R. (2017). Post Processing Recommender Systems for Diversity. 
    Retrieved from 	https://dl.acm.org/ 
