#  The MovieLens Dataset

[MovieLens](https://movielens.org/) is a non-commercial web-based movie recommender system, created in 1997 by GroupLens, a research lab at the University of Minnesota, in order to gather movie rating data for research purposes.


## Getting the Data


The MovieLens dataset is hosted by the [GroupLens](https://grouplens.org/datasets/movielens/) website. Several versions are available. We will use the latest smallest dataset released from [link](https://files.grouplens.org/datasets/movielens/ml-latest-small.zip).

## Custom Code

The custom packages; soft_impute and functionsCF will need to be installed

In [1]:
# Install the standard packages
%pip install numpy
%pip install pandas
%pip install fancyimpute

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Google Collab Connection to Google Drive: External data: Local Files, Drive, Sheets, and Cloud Storage
https://colab.research.google.com/notebooks/io.ipynb

In [2]:
# mount drive
#from google.colab import drive
#drive.mount('/content/drive/')

In [3]:
# location of custom packages: soft_impute , functionsCF, and dataset ratings.csv
# CollaborativeFiltering folder in google drive
import sys
sys.path.append('/Users/eneas/Desktop/MIT/MODULE-8/CollaborativeFiltering/')

In [4]:
# change the working directory
import os
os.chdir("/Users/eneas/Desktop/MIT/MODULE-8/CollaborativeFiltering/")

In [5]:
# Impute necessary packages
import numpy as np # type: ignore
import pandas as pd # type: ignore
from fancyimpute import BiScaler # type: ignore
from soft_impute import SoftImpute
from functionsCF import GenerateTrainingSet

## Create the incomplete matrices for training and testing

In [6]:
# Read movielens data from files- point to where data is stored, small set of Movielens dataset
# 100836 (rows), userId	movieId	rating	timestamp (columns).
# Using smaller dataset rather than the full dataset to speed performance.
# Your results may vary depending on which Movielens data set is used; Several are available online
# read in values only
rating = pd.read_csv('ratings.csv', sep=',').values

In [7]:
# Here we only care about the ratings, so we only use the first three columns, which contain use IDs, movie IDs, and ratings.
rating = rating[:,0:3]

In [8]:
#show top 5 rows
print(rating[:5, :])

[[ 1.  1.  4.]
 [ 1.  3.  4.]
 [ 1.  6.  4.]
 [ 1. 47.  5.]
 [ 1. 50.  5.]]


In [9]:
# Use all known information to create the incomplete matrix

# First, create an empty matrix
matrix_incomplete = np.zeros((len(np.unique(rating[:,0])), len(np.unique(rating[:,1]))))

# Second, Since some movies don't have any ratings, we only use the movies that have ratings.
# Here we correspondingly change the movie IDs to make each column has ratings.
# create an array of all movie IDs
usedID = np.unique(rating[:, 1])
# replace the movie IDs by the their positions in the array we just created
for i in range(len(rating[:,1])):
    rating[:,1][i] = np.where(usedID==rating[:,1][i])[0][0] + 1

# Finally, we construct the incomplete matrix, on which the incomplete components are nan by
# default.
# all components are nan by default
matrix_incomplete[:] = np.nan
# create the index pair of the components with ratings
indices = np.array(rating[:,0] - 1).astype(int), np.array(rating[:,1] - 1).astype(int)
# change the values in the corresponding positions to the known rating information
matrix_incomplete[indices] = rating[:,2]

In [10]:
# Obtain the index pairs of the training set and the validation set, with ratio 90%
train_indices, validation_indices = GenerateTrainingSet(rating[:,0], rating[:,1], 0.90)
# And then use the index pairs to create the incomplete training test
matrix_train = matrix_incomplete.copy()
matrix_train[:] = np.nan
matrix_train[train_indices] = matrix_incomplete[train_indices]

##  Run the softImpute model for collaborative filtering

In [11]:
# Create the BiScaler model
biscaler = BiScaler(scale_rows=False, scale_columns=False, max_iters=50, verbose=False)
# Rescale both rows and columns to have zero mean
matrix_train_normalized = biscaler.fit_transform(matrix_train)

In [12]:
# Use softImpute to complete the matrix. J means the number of archetypes and rand_seed means the
# seed for the inner random number generator, verbose control whether outputting algorithm logs.
softImpute = SoftImpute(J = 4, maxit = 200, random_seed = 1, verbose = False)

In [13]:
# Run the softImpute model on the normalized training set
matrix_train_softImpute = softImpute.fit(matrix_train_normalized)
# Use the softImpute model to create the predicted matrix. If we set copyto as True, then it
# directly change the value of matrix_train_normalized
matrix_train_filled_normalized = matrix_train_softImpute.predict(matrix_train_normalized, copyto = False)
# Inverse transformation to undo the scaling
matrix_train_filled = biscaler.inverse_transform(matrix_train_filled_normalized)

## Analysis of the predicted ratings

### Out-of-sample R^2

In [14]:
# Create the baseline method
train_average = np.average(matrix_train[train_indices])

In [15]:
# Calculate out-of-sample R2 and in-sample R2
# Your results may vary from the lesson due to datasize and training test split.
validation_mse = ((matrix_train_filled[validation_indices] - matrix_incomplete[validation_indices]) ** 2).mean()
training_mse = ((matrix_train_filled[train_indices] - matrix_incomplete[train_indices]) ** 2).mean()
validation_mse_baseline = ((train_average - matrix_incomplete[validation_indices]) ** 2).mean()
training_mse_baseline = ((train_average - matrix_incomplete[train_indices]) ** 2).mean()
print("out-of-sample R2: %.4f, in-sample R2: %.4f." % (1 - validation_mse / validation_mse_baseline, 1 - training_mse / training_mse_baseline))

out-of-sample R2: 0.2040, in-sample R2: 0.6140.


### Get low-rank factors

In [16]:
# Obtain the ratings of each archetype
# Each row of this matrix corresponds to a song and each column corresponds to an archetype
softImpute.v

array([[-0.01059553,  0.00738432,  0.00095827,  0.00763057],
       [ 0.00084638,  0.00592296, -0.00052429, -0.00275899],
       [-0.00217362,  0.01406614, -0.00250328,  0.01682713],
       ...,
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ]],
      shape=(9724, 4))

In [17]:
softImpute.v.shape

(9724, 4)

In [18]:
# (Optional)
# Obtain the weights of archetypes of each user
# each row of this matrix corresponds to a user and each column corresponds to an archetype
weights = np.dot(softImpute.u, np.diagflat(softImpute.d).T)
weights

array([[ -9.28902628,  11.99637881,   3.96105647, -10.51264863],
       [  4.60952834,  14.17893705,  10.13304446,  -5.63247093],
       [ 67.66498595,  51.10879891,  19.09411836,  -5.31394527],
       ...,
       [ 42.60045155, -27.45724096,   3.70374048,  19.47398224],
       [ -0.17512687,   2.86434723,   1.7493947 ,  -4.97185285],
       [ -2.62538395,   8.48671393,  14.19970535,   7.42396582]],
      shape=(610, 4))

In [19]:
weights.shape

(610, 4)

In [20]:
# And then the predicted matrix is computed by the product of two low-rank matrices
new_prediction = np.dot(weights, softImpute.v.T)

In [21]:
# We can see it is the same with the output of the codes in the previous section
np.sum(np.abs(new_prediction - matrix_train_filled_normalized))

np.float64(7.019952411338923e-11)

end of the note