# Igor Balagula
# Project 3

The purpose of this project is to gain an understanding of the singular value decomposition technique and compare predictions obtains via SVD vs other methods.

In [153]:
import numpy as np
import pandas as pd
from sklearn import cross_validation as cv
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics import mean_squared_error
from math import sqrt
import scipy.sparse as sp
from scipy.sparse.linalg import svds

In [154]:
# Source code of "Predict" and "rmse" functions credit: 
# https://github.com/ongxuanhong/data-science-works/blob/master/python/recommender/song_recommender.py

def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten()
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

We will use ratings data from the MovieLens dataset 

In [155]:
header = ['userid', 'itemid', 'rating', 'timestamp']
df = pd.read_csv('E:\\Igor\\CUNY\\DATA 643 - Recommender Systems\\Projects\\Project_03\\Data\\ratings_943_1682.data', sep='\t', names=header)
print 'Dimentions of the dataset: '+str(df.userid.unique().shape[0])+' users and '+str(df.itemid.unique().shape[0])+' movies'


Dimentions of the dataset: 943 users and 1682 movies


In [156]:
print (df.head)

<bound method DataFrame.head of        userid  itemid  rating  timestamp
0         196     242       3  881250949
1         186     302       3  891717742
2          22     377       1  878887116
3         244      51       2  880606923
4         166     346       1  886397596
5         298     474       4  884182806
6         115     265       2  881171488
7         253     465       5  891628467
8         305     451       3  886324817
9           6      86       3  883603013
10         62     257       2  879372434
11        286    1014       5  879781125
12        200     222       5  876042340
13        210      40       3  891035994
14        224      29       3  888104457
15        303     785       3  879485318
16        122     387       5  879270459
17        194     274       2  879539794
18        291    1042       4  874834944
19        234    1184       2  892079237
20        119     392       4  886176814
21        167     486       4  892738452
22        299     144    

Split our dataset into 75% training set and 25% test set

In [157]:
train_df, test_df = cv.train_test_split(df, test_size=0.25)

Create training matrix, rows are users, columns are movies 

In [158]:
train = np.zeros((943, 1682))
for line in train_df.itertuples():
    train[line[1]-1, line[2]-1] = line[3]

Create testing matrix, rows are users, columns are movies 

In [159]:
test = np.zeros((943, 1682))
for line in test_df.itertuples():
    test[line[1]-1, line[2]-1] = line[3]

In [160]:
print (train)

[[ 5.  3.  4. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  5.  0. ...,  0.  0.  0.]]


We will use cosine method to calculate distances between users in the training matrix

In [161]:
user_dist = pairwise_distances(train, metric='cosine')

Generate predictive models for user-based collaborative filtering techniques

In [162]:
user_pred = predict(train, user_dist, type='user')

In [163]:
print 'User-based cosine-similarity method RMSE: ' + str(rmse(user_pred, test))

User-based cosine-similarity method RMSE: 3.11489515319


Now we will use the SVD method. The SVD reduces the dimensionality of our dataset and captures the "features" that we can use to compare users. Parameter k is the number of singular values/features we want to retain

In [164]:

u, s, vt = svds(train, k = 10)


After this decomposition we get u, s and vt. They have (number of users, k), (k, ) and (k, number of movies) shapes respectively.
Matrix u can be interpreted as user feature matrix, matrix vt as movie feature matrix and vector s represents singular values of the original matrix. Let's see how u,s and vt look like:

In [165]:
u

array([[-0.01364545, -0.06902873,  0.0402673 , ...,  0.00341819,
        -0.00344214,  0.06878603],
       [ 0.02153813, -0.03270731,  0.01214677, ..., -0.05160203,
        -0.04863139,  0.01469435],
       [ 0.01105923,  0.02334423,  0.00629211, ..., -0.02477766,
        -0.02525216,  0.00629738],
       ..., 
       [ 0.00074534,  0.00634517, -0.0124235 , ..., -0.0085306 ,
        -0.02434453,  0.00665855],
       [ 0.04213112, -0.0028665 ,  0.02357201, ..., -0.02321055,
         0.00415819,  0.02533531],
       [ 0.04538209, -0.02871444,  0.04151322, ...,  0.07234217,
        -0.00234888,  0.04229639]])

In [166]:
s

array([  82.90786394,   87.89137894,   99.36099134,  101.7108859 ,
        115.12407029,  122.21470942,  123.71478626,  167.73570397,
        187.25802917,  481.73297612])

In [167]:
vt

array([[ -2.45354782e-03,   1.59058680e-02,   2.98187309e-02, ...,
          4.20551640e-05,  -1.47186940e-03,  -2.51527945e-03],
       [  6.49952014e-02,  -5.97849163e-02,   2.24021842e-02, ...,
          8.79436671e-04,  -2.73906965e-03,  -1.41777666e-03],
       [ -1.75130509e-02,   5.67452654e-03,  -1.65840022e-02, ...,
          1.57731636e-04,   2.76739142e-04,   9.39407982e-04],
       ..., 
       [  1.91589896e-03,   5.59998688e-02,   5.43945548e-03, ...,
         -6.06483952e-04,   5.51023228e-04,   3.72943749e-04],
       [ -9.90379251e-02,  -2.17641008e-03,  -3.26974206e-02, ...,
         -5.79558052e-04,   5.89535282e-05,   4.13665833e-04],
       [  9.35021376e-02,   3.27287050e-02,   1.98053094e-02, ...,
          4.11272764e-05,   4.51157978e-04,   4.11299314e-04]])

We need to convert vector s into a diagonal matrix.

In [168]:
s_diag=np.diag(s)

Now we can obtain the prediction matrix by calculating dot-product of SVD components.

In [169]:
SVD_pred = np.dot(np.dot(u, s_diag), vt)

Once we use the SVD to get SVD_pred, we can predict a rating by simply looking up the entry for the appropriate user/movie pair in the matrix SVD_pred

In [170]:
print 'SVD method RMSE for k=10: ' + str(rmse(SVD_pred, test))

SVD method RMSE for k=10: 2.6692919561


We can see that if we apply SVD method to user-based collaborative filtering we can get a better accuracy of predictions as compared to method based on cosine-similarity.

We will try SVD method with different values of k 

In [171]:
u, s, vt = svds(train, k = 20)
s_diag=np.diag(s)
SVD_pred = np.dot(np.dot(u, s_diag), vt)
print 'SVD method RMSE for k=20: ' + str(rmse(SVD_pred, test))

SVD method RMSE for k=20: 2.71123108372


In [172]:
u, s, vt = svds(train, k = 50)
s_diag=np.diag(s)
SVD_pred = np.dot(np.dot(u, s_diag), vt)
print 'SVD method RMSE for k=50: ' + str(rmse(SVD_pred, test))

SVD method RMSE for k=50: 2.94352839031


In [173]:
u, s, vt = svds(train, k = 100)
s_diag=np.diag(s)
SVD_pred = np.dot(np.dot(u, s_diag), vt)
print 'SVD method RMSE for k=100: ' + str(rmse(SVD_pred, test))

SVD method RMSE for k=100: 3.21190696035


We can see that if we apply SVD method to user-based collaborative filtering we can get a better accuracy of predictions as compared to method based on cosine-similarity. Values of k between 10 and 20 provide the lowest RMSE. I need to read more about how to choose optimal k and evaluate models based on different values of k parameter.