# The objectives of the lab
This practical session aims at showing how to build a simple recommender system on the movie lens dataset

Packages

In [6]:
# Import required libraries
import pandas as pd
import numpy as np
import scipy
import scipy.sparse

> ## Exercise 1 (SVD and weighted SVD on MovieLens)

### 1. The MovieLens Dataset

(a) What is the Movie Lens data set? You can look at https://grouplens.org/datasets/movielens/

The MovieLens dataset is a collection of movie ratings and user information, widely used for building and evaluating recommender systems.


(b) Why is it preferable to begin with the MovieLens 100K Dataset?

The MovieLens 100K Dataset is a smaller version of the dataset, making it easier to work with initially for experimentation and learning.

(c) Download the MovieLens100k dataset that is the ml-100k.zip

In [7]:
# Define data directory and shape
data_dir = "ml-100k/"
data_shape = (943, 1682)

# Load the MovieLens 100K dataset into a DataFrame
df = pd.read_csv(data_dir + "u.data", sep="\t", header=None)
values = df.values

df.head()

Unnamed: 0,0,1,2,3
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   0       100000 non-null  int64
 1   1       100000 non-null  int64
 2   2       100000 non-null  int64
 3   3       100000 non-null  int64
dtypes: int64(4)
memory usage: 3.1 MB


(d) What are the differences between the table df and the rating matrix M?

 'df' is a DataFrame containing raw user-movie-rating data, while 'M' will be a sparse matrix representing the user-movie rating matrix.


### 2. Data preprocessing

(a) Use scipy sparse type to obtain a (sparse) rating matrix M

In [9]:
import scipy.sparse as sp

# Suppose you have loaded your data into 'values' and calculated 'data_shape'

# Find the maximum user and movie IDs
num_users = np.max(values[:, 0]) + 1  # Adding 1 to account for 0-based indexing
num_movies = np.max(values[:, 1]) + 1  # Adding 1 to account for 0-based indexing

# Create a new 'data_shape' based on the number of users and movies
data_shape = (num_users, num_movies)

# Create a CSR matrix with the corrected data_shape
M = sp.csr_matrix((values[:, 2], (values[:, 0], values[:, 1])), dtype=np.float, shape=data_shape)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  M = sp.csr_matrix((values[:, 2], (values[:, 0], values[:, 1])), dtype=np.float, shape=data_shape)


(b) How is coded the missing data in the sparse matrix?

Missing data in the sparse matrix is represented as zero values.


(c) Split the data into two matrices. Use 90% for training and 10 % for testing.

In [10]:
import numpy as np

# Define the percentage for training data
train_percentage = 0.9

# Get the number of rows in your data (number of ratings)
num_ratings = len(values)

# Calculate the number of samples for training based on the percentage
num_train_samples = int(train_percentage * num_ratings)

# Create an array of indices to shuffle your data
indices = np.arange(num_ratings)

# Shuffle the indices randomly
np.random.shuffle(indices)

# Split the shuffled indices into training and testing indices
train_indices = indices[:num_train_samples]
test_indices = indices[num_train_samples:]

# Use the indices to split the data into training and testing sets
training_data = values[train_indices]
testing_data = values[test_indices]
# Now, 'training_data' and 'testing_data' contain the training and testing portions of your data.


(d) Compute the global mean of the ratings given by the users to the movies

In [11]:
global_mean = M.sum() / M.nnz
global_mean

3.52986

(e) Center the data and compute the test error when predicting missing values by the mean.

In [12]:

# Centering the data by subtracting the global mean
centered_data = values.copy()
centered_data[:, 2] -= global_mean

# Extract the test data (assuming 'testing_data' contains your testing data)
test_users = testing_data[:, 0].astype(int)
test_movies = testing_data[:, 1].astype(int)
test_ratings = testing_data[:, 2]

# Predict missing values using the global mean
predicted_ratings = np.full(len(test_ratings), global_mean)

# Calculate the mean squared error (MSE) for the predictions
mse = mean_squared_error(test_ratings, predicted_ratings)

# Print the MSE as the test error
print(f"Test Error (MSE) when predicting missing values by the mean: {mse:.4f}")


UFuncTypeError: Cannot cast ufunc 'subtract' output from dtype('float64') to dtype('int64') with casting rule 'same_kind'

### 3. Recommending using SVD

(a) Compute the 20 first factors of the SVD of the centered training data

(b) Predict the missing test values using the SVD with an increasing number of component (up to 20).
Evaluate the performance of this approach on the test matrix and plot the resulting performance
as a function of the number of factors of the SVD used to perform the reconstruction.