----

# **Model Based Collaborative Filtering**

## **Author**   :  **Muhammad Adil Naeem**

## **Contact**   :   **madilnaeem0@gmail.com**
<br>

----



### **Import Libraries**

In [31]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
from math import sqrt
import random
import scipy.sparse as sp
from scipy.sparse.linalg import svds

from sklearn.model_selection import  train_test_split
from sklearn.metrics.pairwise import cosine_similarity, pairwise_distances
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings('ignore')

### **Load the Dataset**

In [23]:
df = pd.read_csv('/content/user_data_for_collaborative_filtring.csv.csv')

### **First 5 Rows of Dataset**

In [24]:
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


### **Counting Unique Users and Items**

This code calculates the number of unique users and items (movies) in the DataFrame `df`:

- `n_users` stores the count of unique user IDs.
- `n_items` stores the count of unique item IDs.

This information is useful for understanding the scale of the dataset.

In [25]:
u_users = df.user_id.nunique()
u_items = df.item_id.nunique()

print('Num. of Users: '+ str(u_users))
print('Num of Items: '+str(u_items))

Num. of Users: 651
Num of Items: 1546


### **Split data into Train Test Split**

In [26]:
train_data, test_data = train_test_split(df, test_size=0.2)

### **Creating User-Item Interaction Matrices with Bounds Checking**

- This code initializes two user-item interaction matrices, `train_data_matrix` and `test_data_matrix`, filled with zeros. It iterates through the `train_data` and `test_data` DataFrames, populating the matrices with ratings while checking that the user IDs and item IDs are within the valid bounds before accessing the matrices. This prevents index errors during assignment.

In [27]:
train_data_matrix = np.zeros((u_users, u_items))
for line in train_data.itertuples():
    # Check if item_id and user_id are within the bounds before accessing the matrix
    if line[2] <= u_items and line[1] <= u_users:
        train_data_matrix[line[1]-1, line[2]-1] = line[3]

test_data_matrix = np.zeros((u_users, u_items))
for line in test_data.itertuples():
    # Check if item_id and user_id are within the bounds before accessing the matrix
    if line[2] <= u_items and line[1] <= u_users:
        test_data_matrix[line[1]-1, line[2]-1] = line[3]

### **Calculating Root Mean Squared Error (RMSE)**

This function, `rmse`, computes the Root Mean Squared Error between predicted ratings and ground truth ratings.

1. It filters the `prediction` and `ground_truth` arrays to include only the non-zero entries, which represent the actual ratings.
2. The `flatten()` method is used to convert the filtered arrays to 1D arrays.
3. It then calculates the RMSE using the `mean_squared_error` function from the `sklearn.metrics` module and returns the square root of the result.

This function is useful for evaluating the accuracy of rating predictions in recommendation systems.

In [29]:
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten()
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

### **Calculating Sparsity Level of the Dataset**

This code calculates the sparsity of the MovieLens dataset:

1. It computes the sparsity by subtracting the ratio of the number of entries in the DataFrame `df` from the total possible ratings (the product of unique users `u_users` and unique items `u_items`).
2. The result is rounded to three decimal places.
3. Finally, it prints the sparsity level as a percentage.

This information helps understand how much of the rating matrix is filled versus empty, which is important for evaluating the effectiveness of recommendation algorithms.

In [30]:
sparsity = round(1.0 - len(df) / float(u_users * u_items), 3)
print('The sparsity level of MovieLens is ' +  str(sparsity * 100) + '%')

The sparsity level of MovieLens is 96.1%


### **Performing Singular Value Decomposition (SVD) for Collaborative Filtering**

This code snippet uses Singular Value Decomposition (SVD) to generate predictions for user ratings:

1. It decomposes the `train_data_matrix` into three matrices: `u`, `s`, and `vt`, where `k` is set to 20, indicating the number of latent features.
2. A diagonal matrix `s_diag_matrix` is created from the singular values `s`.
3. The predicted ratings `X_pred` are calculated by multiplying the matrices: `u`, `s_diag_matrix`, and `vt`.
4. Finally, it computes and prints the RMSE between the predicted ratings `X_pred` and the actual ratings in `test_data_matrix`.

This process helps evaluate the accuracy of the SVD-based collaborative filtering model.

In [32]:
u,s,vt = svds(train_data_matrix, k = 20)
s_diag_matrix = np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)
print('User-based CF RMSE: ' + str(rmse(X_pred, test_data_matrix)))

User-based CF RMSE: 3.1509271755010344
