## MSc Data Science: collaborative filtering assignment

We begin by installing and loading a few packages.

In [1]:
install.packages("dslabs")
install.packages("softImpute")


The downloaded binary packages are in
	/var/folders/n2/vcmhff3d5b370hsbg6c1w4bh00m7lx/T//RtmpMeqZ8F/downloaded_packages

The downloaded binary packages are in
	/var/folders/n2/vcmhff3d5b370hsbg6c1w4bh00m7lx/T//RtmpMeqZ8F/downloaded_packages


In [1]:
library("dslabs")
library("softImpute")
library("Matrix")

Loading required package: Matrix

Loaded softImpute 1.4




We will use the Movielens collaborative filtering data set.

In [2]:
data("movielens")

movielens = na.omit(movielens) # we remove movies with no name

We create a vector of user IDs, a vector of movie titles, a vector of movie IDs, and a vector of all observed ratings.

In [3]:
users = movielens$userId
movies_titles = as.factor(movielens$title)
movies_IDs = as.numeric(movies_titles)
ratings  = movielens$rating

In [4]:
head(movies_titles)

In [5]:
mean_rating = mean(ratings)
cat("The average observed rating is ")
cat(mean_rating)
cat(" out of 5.")

The average observed rating is 3.543591 out of 5.

In collaborative filtering, the rating matrices are huge, but very sparse. In that context, it is much easier to work with a matrix in sparse format. The softimpute package has a way to create that:

In [40]:
X = Incomplete(i = users, j = movies_IDs, x = ratings)

In [47]:
length(users) - length(which(X!=0))

In [53]:
(users[which(duplicated(cbind(users,movies_IDs)))])

We can also easily go back to the usual matrix format this way:

In [8]:
as.matrix(X)[1:6,1:6]

0,1,2,3,4,5
,,,,,
,,,,,
,,,,,
,,,,,
,,,,,
,,,4.0,,


Softimpute allows to impute the missing values by doing nuclear norm penalised matrix completion, as seen in class. For example like this:

In [9]:
res = softImpute(X)
Xhat = complete(X,res)

“Convergence not achieved by 100 iterations”


In [10]:
Xhat[1:6,1:6] # that's the complete (imputed) matrix

0,1,2,3,4,5
1.258584,0.0333785,0.9431008,1.147716,0.2131112,0.008206607
2.65827,0.19796135,0.6565478,2.220429,0.3899051,0.032224479
2.71192,0.19258368,0.7679956,2.280219,0.4022016,0.03177981
3.841426,0.05550154,3.3643799,3.57714,0.6723604,0.019630053
3.129796,0.21495594,0.962844,2.643245,0.467626,0.035823556
2.528855,0.14349134,1.0942811,4.0,0.3921004,0.025417957


# Assignment

Play aroung with Softimpute, and implement very simple baseline as competitors (e.g. using constant imputations). In particular, you should try to create a validation and test set to assess the quality of the imputation schemes. What is your best model? What are the best hyperparameters? Write a report as a notebook or as a pdf (with acompanying code).

In [41]:
n_tot = length(users)
ind_train = sample(n_tot,floor(n_tot*0.95))

In [42]:
Xtrain = Incomplete(i = users[ind_train], j = movies_IDs[ind_train], x = ratings[ind_train])

In [77]:
res = softImpute(Xtrain,maxit = 300, lambda = 0.1)
Xhat = complete(Xtrain,res)

In [78]:
ind_test = which(Xtrain!=X) # ratings in test but not in train

In [79]:
mean( (X[ind_test]- Xhat[ind_test])^2)

In [72]:
mean_rating

In [73]:
mean( (X[ind_test]- mean_rating)^2)