# **Implementing a Matrix Factorization-based Recommender System**

## **Represent user and item by Matrix Factorization**
 - Users and items are represented through matrix factorization.
  - A user-item interaction matrix $( R \in \mathbb{R}^{n \times m})$ is approximated as the product of two matrices: $( R \approx P \times Q)$, where $( P \in \mathbb{R}^{n \times d})$ and $( Q \in \mathbb{R}^{m \times d})$.
  - $ n $ is the number of users, $ m $ is the number of items, and $ d $ is the dimension of the embedding vectors.

**How to do Matrix Factorization**:
   - The goal is to find a good representation for users and items.
   - The objective is to minimize the differences between the predicted and actual interaction values: $ \min_{P,Q} \sum_{(u,i) \in R'} (r_{ui} - P_u Q_i)^2 $.
   - Not all elements in $ R $ are known; $ R' $ is the set of known elements in $ R $.
   - $ r_{ui} $ is the interaction record of user $ u $ and item $ i $.
   - $ P_u $ is the embedding vector for user $ u $, and $ Q_i $ is the embedding vector for item $ i $.
   - The interaction probability between user $ u $ and item $ i $ is $ r_{ui} = P_u Q_i $.

## **Requirements:**
In this practice, you will implement a recommender system using **Matrix Factorization**.
You should:
   - Construct a matrix factorization-based recommender system using the positive data `train_pos.npy` provided in project 3.
   - For each user-item pair $ u, i $ in `train_pos.npy`, $ R_{ui} = 1 $.
   - If a user-item pair $ u^*, i^* $ is not in `train_pos.npy`, $ R_{u^*i^*} = 0 $.
   - The task is to find a good embedding representation for each user and item.


## **Reference Workflow**:
   1. Load the data and construct an interaction matrix.
   2. Obtain the embedding representation for each user and item.
      - **Use the objective function above and optimize the embeddings via gradient descent.**
      - **Note: The number of negative samples is much larger than that of positive samples. You can sample some negative samples in each iteration instead of using all negative samples.**
   3. Validate the effectiveness of the model.

### **Deadline:** 22:00, Dec. 20th

The practice will be checked in this lab class or the next lab class (before **Dec. 20th**) by teachers or SAs.

### **Grading:**
* Submissions in this lab class: 1.1 points.
* Submissions on time: 1 point.
* Late submissions within 2 weeks after the deadline: 0.8 points.

## **1 Load and Explore the Dataset**

In [86]:
import numpy as np
import random

# Load the dataset
train_pos = np.load("train_pos.npy")  # Contains user-item pairs
users, items = set(train_pos[:, 0]), set(train_pos[:, 1])
len(train_pos),train_pos[:5],train_pos[-5:]

(26638,
 array([[   0, 1113,    1],
        [   0,  736,    1],
        [   0,  888,    1],
        [   0,  636,    1],
        [   1,  374,    1]], dtype=int64),
 array([[6014,  934,    1],
        [6014, 1960,    1],
        [6014,  937,    1],
        [6014, 1963,    1],
        [6014, 1485,    1]], dtype=int64))

In [87]:
n_user, n_item = max(users) + 1, max(items) + 1
n_user, n_item

(6015, 2347)

## **2. Initialize Parameters**

Initialize the embedding matrices $P$ for users and $Q$ for items. These matrices represent the user and item embeddings.

**Fill in the missing parts:**


In [88]:
# Define the embedding dimension
dim = 60 # 60 ~ 100

# Initialize user and item embeddings with random values
P = np.random.rand(n_user, dim)  
Q = np.random.rand(n_item, dim) 

## **3. Optimize the embeddings via gradient descent**

The loss function to optimize is Mean Squared Error (MSE):
$$
\text{Loss} = \sum_{(u, i) \in R'} (r_{ui} - P_u Q_i^T)^2
$$

OR add the regularization term:

$$
\text{Loss} = \sum_{(u, i) \in R'} (r_{ui} - P_u Q_i^T)^2 + \lambda (\|P_u\|^2 + \|Q_i\|^2)
$$

Here:
- $ R' $ is the set of the known elements in the $ R $
- $ r_{ui} $ is 1 for positive samples and 0 for negative samples.
- $ \lambda $ is the regularization term to prevent overfitting.


In [89]:
alpha = 0.01     
lambda_reg = 0.1  
iterations = 40      

train_pos_set = set((u, i) for u, i in train_pos[:, :2])

for iterate in range(iterations):
    for u, i, _ in train_pos:
        prediction = np.dot(P[u], Q[i].T)
        error = 1 - prediction  # r_ui = 1
        
        P[u] += alpha * (error * Q[i] - lambda_reg * P[u])
        Q[i] += alpha * (error * P[u] - lambda_reg * Q[i])
    
    if (iterate + 1) % 10 == 0:
        print(f"Epoch {iterate + 1}/{iterations} completed.")

print("Training completed.")

Epoch 10/40 completed.
Epoch 20/40 completed.
Epoch 30/40 completed.
Epoch 40/40 completed.
Training completed.


## **4 Verification**

Choose an appropriate metric to evaluate the results.

In [90]:
def rmse(predictions, true_values):
    return np.sqrt(np.mean((predictions - true_values) ** 2))

predictions = []
true_values = []
for u, i,_ in train_pos:
    pred = np.dot(P[u], Q[i].T)
    predictions.append(pred)
    true_values.append(1)  

print("RMSE: ", rmse(np.array(predictions), np.array(true_values)))


RMSE:  0.11236577838865587
