# Assignment 3

In this assignment we are tasked with using SVD or ALS to create a recomender system.

Given the sparse nature of this matrix the `np.sparse.linalg.sparse()` function will not work.

We have a few options. most easily, we can just fill the sparse values with 0. This will introduce a lot of bias. Instead, it can be effective to replace the `na` values with the row/column means.

Finally, we can update the sparse matrix using a more sophsiticated algorithm. For example, we could use ALS to make a dense matrix and then do normal SVD to reduce the dimensions. However, we are still using imputed values to make it dense. A more interesting way of going about it is to use stochastic gradient decent to create two matrixies, P, Q that have dimensions of users/movies by `n_factors`. This is probably most elegant way to do it.

---

## Dataset

Here we are goign to use the `MovieLens` dataset 

In [1]:
import pandas as pd
import numpy as np
movies = pd.read_csv('~/DATA643/Project2/ml-latest-small/movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [2]:
ratings = pd.read_csv('~/DATA643/Project2/ml-latest-small/ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


We need to make this into a user/item dataframe first.

In [3]:
ratingsPiv = ratings.pivot(index='userId', columns='movieId', values='rating')

ratingsPiv.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,4.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,4.0,...,,,,,,,,,,
5,,,4.0,,,,,,,,...,,,,,,,,,,


This is a very sparse matrix. We need to set the train/test split first before imputing values.

In [4]:
np.random.seed(101)
trainTestMask = np.random.choice([True, False], size=ratingsPiv.shape, p=[0.7, 0.3])
trainTestMask = pd.DataFrame(trainTestMask)
trainTestMask.index = ratingsPiv.index
trainTestMask.columns = ratingsPiv.columns
trainTestMask.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,True,True,True,True,True,False,True,False,False,True,...,True,True,False,True,True,True,False,True,True,True
2,False,True,True,True,False,False,True,True,False,True,...,True,True,False,True,False,True,True,True,True,True
3,False,True,True,True,False,True,True,True,False,True,...,True,True,True,True,False,True,False,True,False,False
4,True,False,False,True,True,False,False,True,True,False,...,False,True,True,True,False,True,False,True,True,True
5,False,True,True,False,True,False,True,True,True,True,...,False,True,True,True,True,True,True,True,True,True


In [5]:
train = pd.DataFrame(np.where(trainTestMask, ratingsPiv, np.NAN))
train.index = ratingsPiv.index
train.columns = ratingsPiv.columns

test = pd.DataFrame(np.where(np.invert(trainTestMask), ratingsPiv, np.NAN))
test.index = ratingsPiv.index
test.columns = ratingsPiv.columns

In [6]:
train.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,4.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,4.0,,,,,,,,...,,,,,,,,,,


First Let's see the naieve RMSE:

In [7]:
testLong = test.copy()
testLong['userId'] = test.index
testLong = pd.melt(testLong, id_vars=['userId'])
testLong.head()

Unnamed: 0,userId,movieId,value
0,1,1,
1,2,1,
2,3,1,
3,4,1,
4,5,1,


In [8]:
trainLong = train.copy()
trainLong['userId'] = train.index
trainLong = pd.melt(trainLong, id_vars=['userId'])
trainLong.head()

Unnamed: 0,userId,movieId,value
0,1,1,
1,2,1,
2,3,1,
3,4,1,
4,5,1,


## Mean imputation

Here is implement an naieve algoritm that replaces na values with the mean of the dataset. 

In [9]:
colMean = trainLong.value.mean()
testLong['ratingMean'] = colMean

In [10]:
testLongMean = testLong[np.isfinite(testLong['value'])]
testLongMean.head()

Unnamed: 0,userId,movieId,value,ratingMean
6,7,1,3.0,3.542611
19,20,1,3.5,3.542611
43,44,1,4.0,3.542611
62,63,1,5.0,3.542611
68,69,1,5.0,3.542611


In [11]:
from sklearn.metrics import mean_squared_error
mean_squared_error(testLongMean.value, testLongMean.ratingMean)

1.1166519634936165

This uses the test set to validate the RMSE of just subing in the mean for all missing values. 

Over all it's not horrible.

This is the most baisic, implementation. It produces a reasonable RMSE.

To make this more complex we will use SVD on the matrix and set all `NA` values to the mean.

This matrix has zeros for `na` values and will produce a biaed result.

In [13]:
import scipy
from scipy.sparse.linalg import svds
svdMean = train.fillna(colMean)
svdMean = svdMean.values

In [14]:
u,s,v = svds(svdMean, k=100)

In [15]:
print(u.shape, s.shape, v.shape)

(671, 100) (100,) (100, 9066)


In [16]:
S = np.diag(s) # To make it a diagonal matrix

In [17]:
matMean = u.dot(S).dot(v)
#pd.DataFrame(matZeros).head()
matMean.shape

(671, 9066)

In [18]:
dfMean = pd.DataFrame(matMean)
dfMean.index = ratingsPiv.index
dfMean.columns = ratingsPiv.columns
dfMean['userId'] = dfMean.index
dfMean.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161155,161594,161830,161918,161944,162376,162542,162672,163949,userId
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,3.719375,3.52119,3.558313,3.534645,3.545526,3.539794,3.526117,3.529424,3.54176,3.572198,...,3.544811,3.546439,3.548524,3.547312,3.543764,3.535162,3.542359,3.540578,3.540815,1
2,3.582762,3.528446,3.585822,3.543436,3.593353,3.626788,3.438101,3.548813,3.528265,3.575843,...,3.54808,3.545952,3.550197,3.54883,3.525873,3.538473,3.543247,3.542746,3.548786,2
3,3.531427,3.507829,3.512722,3.512156,3.542782,3.605956,3.570752,3.551656,3.535466,3.528208,...,3.531653,3.543158,3.549639,3.54819,3.526202,3.540696,3.542268,3.539899,3.538938,3
4,3.738463,3.36927,3.606657,3.561323,3.399356,3.290927,3.596625,3.498572,3.477923,3.738675,...,3.53644,3.540866,3.546232,3.545798,3.568692,3.549601,3.544026,3.545135,3.540273,4
5,3.601069,3.695016,3.659219,3.447812,3.548668,3.488521,3.60083,3.58402,3.526897,3.662955,...,3.538419,3.541772,3.53698,3.537978,3.540289,3.542556,3.542056,3.539241,3.544245,5


In [19]:
meanSeries = pd.melt(dfMean, id_vars=['userId'])
meanSeries.head()

Unnamed: 0,userId,movieId,value
0,1,1,3.719375
1,2,1,3.582762
2,3,1,3.531427
3,4,1,3.738463
4,5,1,3.601069


In [20]:
meanSvdEval = pd.merge(testLong, meanSeries,  
                  how='left', 
                  left_on=['userId','movieId'], 
                  right_on = ['userId','movieId'])
meanSvdEval = meanSvdEval[np.isfinite(meanSvdEval['value_x'])]

meanSvdEval.head()

Unnamed: 0,userId,movieId,value_x,ratingMean,value_y
6,7,1,3.0,3.542611,3.771932
19,20,1,3.5,3.542611,3.455594
43,44,1,4.0,3.542611,3.477622
62,63,1,5.0,3.542611,4.041213
68,69,1,5.0,3.542611,3.941167


In [21]:
mean_squared_error(meanSvdEval['value_x'], meanSvdEval['value_y'])

1.0729859682025993

This produces a better RMSE on the test set. Infact, it's down right impressive. This is actually a bit alarming since it performs better than the cross validated surprise package below. 

### Suprpise Package
Using the surprise package we can do SVD with a sparse matrix:

In [23]:
import surprise
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise import SVD
from surprise import accuracy
from surprise.model_selection import cross_validate

df = pd.read_csv('~/DATA643/Project2/ml-latest-small/ratings.csv')
reader = Reader(rating_scale=(1, 5))


data = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], reader)
algo = SVD()
mod = cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)



Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9033  0.9015  0.8928  0.8892  0.8968  0.8967  0.0053  
Fit time          8.88    8.77    8.40    8.53    8.46    8.61    0.18    
Test time         0.21    0.21    0.20    0.28    0.28    0.24    0.04    


This uses stochastic gradient decent to figure out the ratings and then computes SVD on that. The mean RMSE is 0.893 which is worrying because I would have presumed that this package would perform better than my mean imputed SVD algorithm.

Let's do a quick sanity check to see what we are recomending. UserID of 6 is the first person in our test set.

In [24]:
user6 = ratings[ratings['userId']==6][['userId', 'movieId', 'rating']]
user6test = pd.merge(user6, movies, how='left',
                    left_on='movieId',
                    right_on='movieId')

user6test.sort_values(by=['rating'], ascending=False).iloc[0:10,:]

Unnamed: 0,userId,movieId,rating,title,genres
3,6,293,5.0,Léon: The Professional (a.k.a. The Professiona...,Action|Crime|Drama|Thriller
36,6,5952,5.0,"Lord of the Rings: The Two Towers, The (2002)",Adventure|Fantasy
38,6,7153,5.0,"Lord of the Rings: The Return of the King, The...",Action|Adventure|Drama|Fantasy
6,6,1204,5.0,Lawrence of Arabia (1962),Adventure|Drama|War
43,6,8874,4.5,Shaun of the Dead (2004),Comedy|Horror
8,6,1259,4.5,Stand by Me (1986),Adventure|Drama
28,6,2761,4.5,"Iron Giant, The (1999)",Adventure|Animation|Children|Drama|Sci-Fi
10,6,1285,4.5,Heathers (1989),Comedy
9,6,1276,4.5,Cool Hand Luke (1967),Drama
7,6,1250,4.5,"Bridge on the River Kwai, The (1957)",Adventure|Drama|War


In [55]:
predFavMovies6 = meanSeries[meanSeries['userId']==6].sort_values(by='value', ascending=False)
predFavMovies6['movieId'] = predFavMovies6.movieId.astype('int64')
user6meanSVD = pd.merge(predFavMovies6, movies, on='movieId')
user6meanSVD.head(10)

Unnamed: 0,userId,movieId,value,title,genres
0,6,5952,4.20336,"Lord of the Rings: The Two Towers, The (2002)",Adventure|Fantasy
1,6,4993,4.048155,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy
2,6,7153,3.871526,"Lord of the Rings: The Return of the King, The...",Action|Adventure|Drama|Fantasy
3,6,1204,3.793453,Lawrence of Arabia (1962),Adventure|Drama|War
4,6,293,3.739414,Léon: The Professional (a.k.a. The Professiona...,Action|Crime|Drama|Thriller
5,6,3578,3.72983,Gladiator (2000),Action|Adventure|Drama
6,6,1035,3.72042,"Sound of Music, The (1965)",Musical|Romance
7,6,1230,3.702941,Annie Hall (1977),Comedy|Romance
8,6,47,3.699305,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
9,6,1954,3.694587,Rocky (1976),Drama


This is a pretty valid prediction for user 6.

All told, the movies don't seem out of line with what the user rated on their own.

using surprise:

In [28]:
PU = algo.pu
QI = algo.qi
surprisePreds = PU.dot(QI.T)

In [29]:
QI.shape

(8402, 100)

In [30]:
ratings.movieId.unique().size

9066

This is a smaller dataset which warents invistigation, unfortunetly, I am running out of time, so I'm just going to show the predictions.

I believe the problem stems from the training/test set only having a subset of the movies, thus making a smaller matrix. This is a bug in an otherwise great library. If I had more time, I'd investigate further.

In [31]:
8401/9066

0.9266490183101699

In [32]:
surpriseDF = pd.DataFrame(surprisePreds)
surpriseDF.index = ratingsPiv.index


In [33]:
algo.predict(iid='2', uid='1')

Prediction(uid='1', iid='2', r_ui=None, est=3.5440977951102446, details={'was_impossible': False})

In [34]:
import implicit

In [42]:
ratingsMat = ratingsPiv.values
ratingsMatZero = ratingsPiv.fillna(0).values
scipy.sparse.coo_matrix(ratingsMatZero).toarray().shape

(671, 9066)

In [43]:
import implicit
model = implicit.als.AlternatingLeastSquares(factors=100)
#model.fit(scipy.sparse.coo_matrix(ratingsMat))

item_user_data = scipy.sparse.coo_matrix(ratingsMatZero.T)
model.fit(item_user_data)

100%|██████████| 15.0/15 [00:04<00:00,  2.64it/s]


In [44]:
UF = model.user_factors
IF = model.item_factors


In [45]:
IF.shape

(9066, 100)

In [46]:
implicitPreds = UF.dot(IF.T)

In [47]:
implicitPreds.shape

(671, 9066)

This is correct, but there are some very weird answers including negatives and numbers well above 5. I'll fix these and then evaluate.

In [48]:
implicitDF = pd.DataFrame(implicitPreds)
implicitDF.index = ratingsPiv.index
implicitDF.columns = ratingsPiv.columns
implicitDF.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.032774,-0.029995,0.02906,-0.013481,-0.092303,-0.084305,0.045118,-0.001605,0.040405,0.08563,...,-0.003263,0.000677,0.018946,-0.003177,-0.003223,-0.005373,0.019088,0.007336,0.004742,-0.003276
2,0.285013,0.464333,-0.069242,0.116647,0.192452,0.129119,0.094481,0.01803,-0.06653,0.859904,...,0.001126,0.004936,-0.000719,0.004219,0.004277,-0.010127,-0.000727,0.006584,0.004257,0.001126
3,0.413208,-0.015325,-0.066846,0.04175,-0.010735,0.013584,-0.033316,0.032475,0.041022,0.071464,...,0.001028,-0.004528,-0.002518,0.003129,0.003176,-0.037496,-0.002527,0.049524,0.032011,0.001036
4,-0.18944,0.093337,0.043589,-0.086198,-0.254178,-0.177212,0.00319,-0.134745,0.044859,0.926112,...,0.008421,0.000661,0.025461,0.001428,0.001448,0.035394,0.025651,0.007514,0.004857,0.008454
5,0.505168,0.420504,0.473125,-0.009577,0.19612,0.129349,0.087806,0.040132,-0.04332,0.230948,...,-0.00133,-0.005679,0.002395,0.006264,0.006352,-0.049338,0.002418,0.058432,0.037765,-0.001339


In [49]:
implicitDF[implicitDF < 0] = 0
implicitDF[implicitDF > 5] = 5

In [50]:
implicitDF.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.02906,0.0,0.0,0.0,0.045118,0.0,0.040405,0.08563,...,0.0,0.000677,0.018946,0.0,0.0,0.0,0.019088,0.007336,0.004742,0.0
2,0.285013,0.464333,0.0,0.116647,0.192452,0.129119,0.094481,0.01803,0.0,0.859904,...,0.001126,0.004936,0.0,0.004219,0.004277,0.0,0.0,0.006584,0.004257,0.001126
3,0.413208,0.0,0.0,0.04175,0.0,0.013584,0.0,0.032475,0.041022,0.071464,...,0.001028,0.0,0.0,0.003129,0.003176,0.0,0.0,0.049524,0.032011,0.001036
4,0.0,0.093337,0.043589,0.0,0.0,0.0,0.00319,0.0,0.044859,0.926112,...,0.008421,0.000661,0.025461,0.001428,0.001448,0.035394,0.025651,0.007514,0.004857,0.008454
5,0.505168,0.420504,0.473125,0.0,0.19612,0.129349,0.087806,0.040132,0.0,0.230948,...,0.0,0.0,0.002395,0.006264,0.006352,0.0,0.002418,0.058432,0.037765,0.0


In [51]:

implicitDF['userId'] = implicitDF.index
implicitSeries = pd.melt(implicitDF, id_vars=['userId'])
implicitSeries.head()


Unnamed: 0,userId,movieId,value
0,1,1,0.0
1,2,1,0.285013
2,3,1,0.413208
3,4,1,0.0
4,5,1,0.505168


In [52]:
implicitEval = pd.merge(testLong, implicitSeries,  
                  how='left', 
                  left_on=['userId','movieId'], 
                  right_on = ['userId','movieId'])
implicitEval = meanSvdEval[np.isfinite(meanSvdEval['value_x'])]

implicitEval.head()

Unnamed: 0,userId,movieId,value_x,ratingMean,value_y
6,7,1,3.0,3.542611,3.771932
19,20,1,3.5,3.542611,3.455594
43,44,1,4.0,3.542611,3.477622
62,63,1,5.0,3.542611,4.041213
68,69,1,5.0,3.542611,3.941167


In [53]:
mean_squared_error(implicitEval['value_x'], implicitEval['value_y'])

1.0729859682025993

This is actually a pretty good algorithm too. However, I ran out of time, so I didn't have the chance to use the train/test split. 

In [56]:
user6meanSVD = pd.merge(predFavMovies6, movies, on='movieId')
user6meanSVD.head(10)

Unnamed: 0,userId,movieId,value,title,genres
0,6,5952,4.20336,"Lord of the Rings: The Two Towers, The (2002)",Adventure|Fantasy
1,6,4993,4.048155,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy
2,6,7153,3.871526,"Lord of the Rings: The Return of the King, The...",Action|Adventure|Drama|Fantasy
3,6,1204,3.793453,Lawrence of Arabia (1962),Adventure|Drama|War
4,6,293,3.739414,Léon: The Professional (a.k.a. The Professiona...,Action|Crime|Drama|Thriller
5,6,3578,3.72983,Gladiator (2000),Action|Adventure|Drama
6,6,1035,3.72042,"Sound of Music, The (1965)",Musical|Romance
7,6,1230,3.702941,Annie Hall (1977),Comedy|Romance
8,6,47,3.699305,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
9,6,1954,3.694587,Rocky (1976),Drama


These predictions are different, but I wouldn't say worse.

## Further Notes

I was time constrainted on this assignmet. Idealy one should set the SVD to only lose 80-90% of the energy of the original. I'm pretty confident that I achieved this anyways, given the larege number or movies, but if I did lose more, it would be preferable to reduce the number of factors.