#  The MovieLens Dataset

[MovieLens](https://movielens.org/) is a non-commercial web-based movie recommender system, created in 1997 by GroupLens, a research lab at the University of Minnesota, in order to gather movie rating data for research purposes.


## Getting the Data


The MovieLens dataset is hosted by the [GroupLens](https://grouplens.org/datasets/movielens/) website. Several versions are available. We will use the latest smallest dataset released from [link](https://files.grouplens.org/datasets/movielens/ml-latest-small.zip).

## Custom Code

The custom packages; soft_impute and functionsCF will need to be installed

In [69]:
# Install the standard papackages
!pip install fancyimpute
!pip install numpy
!pip install pandas
!pip install SoftImpute

[31mERROR: Could not find a version that satisfies the requirement SoftImpute (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for SoftImpute[0m[31m
[0m

In [77]:
import numpy as np
import pandas as pd
from fancyimpute import BiScaler
from soft_impute import SoftImpute
from functionsCF import GenerateTrainingSet

Google Collab Connection to Google Drive: External data: Local Files, Drive, Sheets, and Cloud Storage
https://colab.research.google.com/notebooks/io.ipynb

## Create the incomplete matrices for training and testing

In [40]:
# Read movielens data from files- point to where data is stored, small set of Movielens dataset
# 100836 (rows), userId	movieId	rating	timestamp (columns).
# Using smaller dataset rather than the full dataset to speed performance.
# Your results may vary depending on which Movielens data set is used; Several are available online
# read in values only
rating = pd.read_csv('MusicRatings.csv', sep=',').values

rating_df=pd.read_csv ('MusicRatings.csv', sep=',')
rating_df.head()


Unnamed: 0,userID,songID,rating
0,526,80,1.477121
1,1403,54,2.20412
2,556,80,1.30103
3,1036,54,1.477121
4,2352,80,1.30103


In [15]:
#show top 5 rows
print(rating[:5, :])

[[5.26000000e+02 8.00000000e+01 1.47712125e+00]
 [1.40300000e+03 5.40000000e+01 2.20411998e+00]
 [5.56000000e+02 8.00000000e+01 1.30103000e+00]
 [1.03600000e+03 5.40000000e+01 1.47712125e+00]
 [2.35200000e+03 8.00000000e+01 1.30103000e+00]]


In [19]:
# Use all known information to create the incomplete matrix

# First, create an empty matrix
matrix_incomplete = np.zeros((len(np.unique(rating[:,0])), len(np.unique(rating[:,1]))))
matrix_incomplete







array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [30]:
# Second, Since some movies don't have any ratings, we only use the movies that have ratings.
# Here we correspondingly change the movie IDs to make each column has ratings.
# create an array of all movie IDs
usedID = np.unique(rating[:, 1])
# replace the movie IDs by the their positions in the array we just created
for i in range(len(rating[:,1])):
    rating[:,1][i] = np.where(usedID==rating[:,1][i])[0][0] + 1
print (rating)

[[5.26000000e+02 8.00000000e+01 1.47712125e+00]
 [1.40300000e+03 5.40000000e+01 2.20411998e+00]
 [5.56000000e+02 8.00000000e+01 1.30103000e+00]
 ...
 [3.84000000e+02 6.57000000e+02 1.00000000e+00]
 [2.30000000e+02 6.57000000e+02 1.00000000e+00]
 [1.38500000e+03 6.15000000e+02 1.00000000e+00]]


In [31]:
# Finally, we construct the incomplete matrix, on which the incomplete components are nan by
# default.
# all components are nan by default
matrix_incomplete[:] = np.nan
matrix_incomplete




array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]])

In [41]:

# create the index pair of the components with ratings
indices = np.array(rating[:,0] - 1).astype(int), np.array(rating[:,1] - 1).astype(int)
print (indices)

(array([ 525, 1402,  555, ...,  383,  229, 1384]), array([ 79,  53,  79, ..., 656, 656, 614]))


In [42]:

# change the values in the corresponding positions to the known rating information
matrix_incomplete[indices] = rating[:,2]
matrix_incomplete

array([[       nan,        nan,        nan, ...,        nan,        nan,
               nan],
       [       nan, 1.60205999,        nan, ...,        nan,        nan,
               nan],
       [       nan,        nan,        nan, ..., 1.        ,        nan,
               nan],
       ...,
       [       nan,        nan,        nan, ...,        nan, 1.        ,
               nan],
       [       nan,        nan,        nan, ...,        nan,        nan,
               nan],
       [       nan, 1.30103   , 1.30103   , ...,        nan,        nan,
               nan]])

# New Section

In [54]:
# Obtain the index pairs of the training set and the validation set, with ratio 70%
train_indices, validation_indices = GenerateTrainingSet(rating[:,0], rating[:,1], 0.70)
print (validation_indices)



(array([1582,  695,  581, ..., 1679,  484, 1384]), array([ 53,  79,  79, ..., 656, 209, 614]))


In [58]:
# And then use the index pairs to create the incomplete training test
matrix_train = matrix_incomplete.copy()
matrix_train[:] = np.nan
matrix_train[train_indices] = matrix_incomplete[train_indices]


(array([   0,    0,    0, ..., 2420, 2420, 2420]), array([  9,  24,  25, ..., 790, 798, 803]))
[[       nan        nan        nan ...        nan        nan        nan]
 [       nan 1.60205999        nan ...        nan        nan        nan]
 [       nan        nan        nan ...        nan        nan        nan]
 ...
 [       nan        nan        nan ...        nan 1.                nan]
 [       nan        nan        nan ...        nan        nan        nan]
 [       nan        nan        nan ...        nan        nan        nan]]


##  Run the softImpute model for collaborative filtering

In [61]:
# Create the BiScaler model
biscaler = BiScaler(scale_rows=False, scale_columns=False, max_iters=50, verbose=False)
# Rescale both rows and columns to have zero mean
matrix_train_normalized = biscaler.fit_transform(matrix_train)
print (matrix_train_normalized)

[[        nan         nan         nan ...         nan         nan
          nan]
 [        nan  0.1864106          nan ...         nan         nan
          nan]
 [        nan         nan         nan ...         nan         nan
          nan]
 ...
 [        nan         nan         nan ...         nan -0.07178043
          nan]
 [        nan         nan         nan ...         nan         nan
          nan]
 [        nan         nan         nan ...         nan         nan
          nan]]


In [79]:
# Use softImpute to complete the matrix. J means the number of archetypes and rand_seed means the
# seed for the inner random number generator, verbose control whether outputting algorithm logs.
softImpute = SoftImpute(J = 6, maxit = 200, random_seed = 2033, verbose = False)
print (SoftImpute)

<class 'soft_impute.SoftImpute'>


In [81]:
# Run the softImpute model on the normalized training set
matrix_train_softImpute = softImpute.fit(matrix_train_normalized)
# Use the softImpute model to create the predicted matrix. If we set copyto as True, then it
# directly change the value of matrix_train_normalized
matrix_train_filled_normalized = matrix_train_softImpute.predict(matrix_train_normalized, copyto = False)
# Inverse transformation to undo the scaling
matrix_train_filled = biscaler.inverse_transform(matrix_train_filled_normalized)
print(matrix_train_filled)

[[1.21600715 1.25536518 1.47212232 ... 1.31485753 1.18089782 1.2353703 ]
 [1.24899258 1.57515599 1.21975061 ... 1.41316361 1.49086655 1.42795362]
 [1.03132676 1.05681022 1.16823105 ... 1.12959944 1.04616418 1.02319846]
 ...
 [0.97755768 1.10467214 1.13896602 ... 1.13068744 1.13790166 1.13589428]
 [1.02558976 1.54378419 1.03825976 ... 1.08339341 1.05776299 1.02531032]
 [1.10799479 0.95611052 1.37710849 ... 1.16695131 0.9672538  0.98262133]]


## Analysis of the predicted ratings

### Out-of-sample R^2

In [82]:
# Create the baseline method
train_average = np.average(matrix_train[train_indices])

In [83]:
# Calculate out-of-sample R2 and in-sample R2
# Your results may vary from the lesson due to datasize and training test split.
validation_mse = ((matrix_train_filled[validation_indices] - matrix_incomplete[validation_indices]) ** 2).mean()
training_mse = ((matrix_train_filled[train_indices] - matrix_incomplete[train_indices]) ** 2).mean()
validation_mse_baseline = ((train_average - matrix_incomplete[validation_indices]) ** 2).mean()
training_mse_baseline = ((train_average - matrix_incomplete[train_indices]) ** 2).mean()
print("out-of-sample R2: %.4f, in-sample R2: %.4f." % (1 - validation_mse / validation_mse_baseline, 1 - training_mse / training_mse_baseline))

out-of-sample R2: 0.2150, in-sample R2: 0.5005.


### Get low-rank factors

In [84]:
# Obtain the ratings of each archetype
# Each row of this matrix corresponds to a song and each column corresponds to an archetype
softImpute.v

array([[ 0.01130268, -0.01256651, -0.01176304, -0.00257681, -0.03984453,
         0.00707466],
       [ 0.01865882,  0.05297988,  0.07543221, -0.06974131,  0.02040987,
         0.00126844],
       [-0.00107772, -0.03433519,  0.03403756,  0.06043389, -0.00542998,
         0.01011918],
       ...,
       [-0.01658398, -0.02744602, -0.04068087, -0.0075534 , -0.01531666,
         0.01506819],
       [ 0.02519679,  0.02045499, -0.01701742, -0.03988662,  0.02134648,
         0.02375562],
       [ 0.00631929,  0.01641358, -0.00362012, -0.0124506 ,  0.02893854,
         0.00803139]])

In [None]:
softImpute.v.shape

(9724, 4)

In [85]:
# (Optional)
# Obtain the weights of archetypes of each user
# each row of this matrix corresponds to a user and each column corresponds to an archetype
weights = np.dot(softImpute.u, np.diagflat(softImpute.d).T)
weights

array([[-1.082249  ,  1.95312619, -0.13735242,  1.85026885, -1.0425069 ,
         0.16457664],
       [ 0.41676406,  1.64198837, -1.67650686, -2.45048942,  1.1105772 ,
        -1.86570826],
       [ 0.83833701,  0.39334791, -0.77961523,  0.15377589, -0.80815855,
        -0.2096459 ],
       ...,
       [ 0.69027674,  1.70456985, -1.71644108,  0.56458211,  1.22690352,
        -0.80059785],
       [-2.76818097,  2.07274528,  2.43900735, -3.3481815 , -2.55072249,
         0.0824649 ],
       [ 1.49001277, -1.61773043,  0.12825277,  1.42889455, -1.54180159,
        -0.26881193]])

In [86]:
weights.shape

(2421, 6)

In [88]:
# And then the predicted matrix is computed by the product of two low-rank matrices
new_prediction = np.dot(weights, softImpute.v.T)
print (new_prediction)

[[ 0.00277411 -0.07718673  0.04857538 ... -0.02559816 -0.07712584
  -0.02616783]
 [-0.04733793  0.15950659 -0.28689381 ... -0.01038956  0.14974541
   0.08331801]
 [ 0.04402436 -0.04981106 -0.02938525 ...  0.01507439  0.01407116
  -0.01240903]
 ...
 [-0.04943213 -0.04163655 -0.0983377  ... -0.02352502  0.06612123
   0.06059938]
 [ 0.02481821  0.42369377 -0.1728257  ... -0.04460079  0.01220082
  -0.02376632]
 [ 0.09151034 -0.17969281  0.15031013 ...  0.0232442  -0.09402128
  -0.08216822]]


In [89]:
# We can see it is the same with the output of the codes in the previous section
np.sum(np.abs(new_prediction - matrix_train_filled_normalized))

1.1596555477798443e-11

end of the note