# Advanced Recommendation System: Matrix Factorization with Blending

## Introduction

Recommendation systems are a cornerstone of modern applications, driving personalized experiences in e-commerce, streaming services, and more. This notebook explores **Matrix Factorization with Blending**, a hybrid approach that combines **Collaborative Filtering (CF)** and **Content-Based Filtering (CBF)** to create a robust and flexible recommendation system.

---

## Concepts Behind This Approach

### 1. **Collaborative Filtering (CF)**
Collaborative filtering uses patterns in user-item interactions to make recommendations. It operates on the principle that users with similar preferences will like similar items. A popular method within CF is **Matrix Factorization**, where the user-item interaction matrix is decomposed into latent feature matrices:
- **User Latent Matrix (U)**: Represents user preferences in a latent feature space.
- **Item Latent Matrix (V)**: Represents item characteristics in the same space.

By learning these latent representations, CF predicts unknown interactions. However, CF struggles with **cold-start problems**, where new users or items lack interaction data.

---

### 2. **Content-Based Filtering (CBF)**
Content-based filtering leverages item metadata (e.g., genres, descriptions, tags) or user attributes to recommend items. It computes the similarity between users and items based on these features. While effective for cold-start items, it can lead to a lack of diversity and serendipity in recommendations.

---

### 3. **Blending CF and CBF**
The hybrid approach addresses the limitations of standalone methods:
- **Matrix Factorization (CF)** excels at learning implicit patterns but struggles with sparse data.
- **Content-Based Filtering (CBF)** provides strong cold-start support but lacks diversity.

By blending these methods, we create a system that:
1. Predicts user-item interactions using **latent factors** (CF).
2. Incorporates **item content features** to improve accuracy and handle cold-start scenarios.

---

## Objective of This Notebook

1. **Prepare Data**:
   - Construct the user-item interaction matrix.
   - Extract item content features (e.g., genres, descriptions).

2. **Implement Matrix Factorization**:
   - Train a collaborative filtering model using user-item interactions.

3. **Incorporate Content-Based Features**:
   - Blend content-based predictions with collaborative filtering predictions.

4. **Evaluate the System**:
   - Assess the blended model's performance using metrics like RMSE and Precision@K.

---

By combining the strengths of collaborative and content-based filtering, this notebook aims to build a recommendation system that is both accurate and versatile, capable of handling sparse data and cold-start challenges effectively.


# Step 1: Prepare Data

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Step 1: Load Data
ratings = pd.read_csv('data/ml-latest-small/ratings.csv')  # User-item interactions
movies = pd.read_csv('data/ml-latest-small/movies.csv')  # Movie metadata (title, genres)
tags = pd.read_csv('data/ml-latest-small/tags.csv')  # User-defined tags for movies

# Step 2: Preprocess and Join Data
# Merge ratings with movies to include genres
ratings = ratings.merge(movies, on='movieId', how='left')

# Join tags with movies and group tags by movieId
tags['tag'] = tags['tag'].str.lower()

# Group tags by movieId and concatenate into a single string
tags_grouped = tags.groupby('movieId')['tag'].apply(lambda x: ' '.join(x)).reset_index()

# Merge tags into the main dataset
ratings = ratings.merge(tags_grouped, on='movieId', how='left')

# Fill missing tags with empty strings
ratings['tag'] = ratings['tag'].fillna('')

# Combine genres and tags into a single column for content features
ratings['content'] = ratings['genres'] + ' ' + ratings['tag']

# Step 3: Split Data into Training, Validation, and Test Sets
# Sort by userId and timestamp to maintain temporal consistency
ratings = ratings.sort_values(by=['userId', 'timestamp'])

# Define split ratios
train_ratio = 0.7
val_ratio = 0.15

# Function to split data for each user
def split_user_data(group):
    n = len(group)
    train_end = int(n * train_ratio)
    val_end = int(n * (train_ratio + val_ratio))
    
    train = group.iloc[:train_end]
    val = group.iloc[train_end:val_end]
    test = group.iloc[val_end:]
    
    return train, val, test

# Apply splitting logic
train_data, val_data, test_data = [], [], []
for _, group in ratings.groupby('userId'):
    train, val, test = split_user_data(group)
    train_data.append(train)
    val_data.append(val)
    test_data.append(test)

# Concatenate results
train_data = pd.concat(train_data)
val_data = pd.concat(val_data)
test_data = pd.concat(test_data)

print(f"Training data size: {len(train_data)}")
print(f"Validation data size: {len(val_data)}")
print(f"Test data size: {len(test_data)}")


Training data size: 70312
Validation data size: 15102
Test data size: 15422


Lets validate the data:

In [2]:
# check for missing values
print("Missing values in train_data:")
print(train_data.isnull().sum())
print()
print("Missing values in val_data:")
print(val_data.isnull().sum())
print()
print("Missing values in test_data:")
print(test_data.isnull().sum())

print()

# check for negative or zero ratings
print("Negative or zero ratings in train_data:", (train_data['rating'] <= 0).sum())
print("Negative or zero ratings in val_data:", (val_data['rating'] <= 0).sum())
print("Negative or zero ratings in test_data:", (test_data['rating'] <= 0).sum())

print()

# check for rating ranges
print("Train data ranges:")
print(train_data.describe())
print("Validation data ranges:")
print(val_data.describe())
print("Test data ranges:")
print(test_data.describe())

print()

total_rows = len(ratings)
split_rows = len(train_data) + len(val_data) + len(test_data)

print(f"Total rows in original data: {total_rows}")
print(f"Total rows in splits: {split_rows}")

if total_rows == split_rows:
    print("Splits are consistent with the original data.")
else:
    print("Inconsistency in splits!")

print()

print("Sample rows from train_data:")
print(train_data.head())
print()
print("Sample rows from val_data:")
print(val_data.head())
print()
print("Sample rows from test_data:")
print(test_data.head())

Missing values in train_data:
userId       0
movieId      0
rating       0
timestamp    0
title        0
genres       0
tag          0
content      0
dtype: int64

Missing values in val_data:
userId       0
movieId      0
rating       0
timestamp    0
title        0
genres       0
tag          0
content      0
dtype: int64

Missing values in test_data:
userId       0
movieId      0
rating       0
timestamp    0
title        0
genres       0
tag          0
content      0
dtype: int64

Negative or zero ratings in train_data: 0
Negative or zero ratings in val_data: 0
Negative or zero ratings in test_data: 0

Train data ranges:
             userId        movieId        rating     timestamp
count  70312.000000   70312.000000  70312.000000  7.031200e+04
mean     326.205712   16105.237029      3.521625  1.195010e+09
std      182.652155   31564.404509      1.036593  2.152768e+08
min        1.000000       1.000000      0.500000  8.281246e+08
25%      177.000000    1097.000000      3.000000  9.9

## Explanations:

## What is TF-IDF?

**TF-IDF (Term Frequency-Inverse Document Frequency)** is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It is widely used in information retrieval and text mining to represent textual data in a way that highlights relevant terms while reducing the impact of common but less informative words.

---

### Components of TF-IDF

1. **Term Frequency (TF)**:
   Measures how often a term appears in a document.
   $$
   TF(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
   $$

2. **Inverse Document Frequency (IDF)**:
   Reduces the weight of terms that appear in many documents (common words).
   $$
   IDF(t, D) = \log\left(\frac{\text{Total number of documents in the corpus } D}{\text{Number of documents containing term } t}\right)
   $$

3. **TF-IDF Score**:
   Combines TF and IDF to calculate the importance of a term in a document:
   $$
   TF\text{-}IDF(t, d, D) = TF(t, d) \times IDF(t, D)
   $$

---

### Why Use TF-IDF?

- **Highlights Important Words**:
  Words that are frequent in a document but rare in the corpus receive higher scores.
- **Ignores Common Words**:
  Words like "the," "is," or "and" have low scores due to high document frequency.
- **Sparse Representation**:
  TF-IDF produces sparse vectors, ideal for efficient storage and computation in machine learning tasks.

---

### Application in Recommendation Systems

In recommendation systems, TF-IDF is used to extract features from item metadata (e.g., genres, tags, descriptions). These features can then be integrated with collaborative filtering methods to enhance recommendations by incorporating content-based insights.


Let's create sparse matrices (scipy) so that we can use efficient data structures for training.

### Example of a Simplified TF-IDF Matrix

#### Example Scenario:
We have a small dataset of 3 movies with genres and tags:

| **Movie ID** | **Genres**         | **Tags**           |
|--------------|--------------------|--------------------|
| 1            | Action Adventure  | epic, battle       |
| 2            | Comedy Drama       | funny, heartwarming|
| 3            | Action Sci-Fi      | space, futuristic  |

---

#### Combined Content (Genres + Tags):
We combine genres and tags into a single string for each movie (as in your `ratings['content']` column):
1. Movie 1: `Action Adventure epic battle`
2. Movie 2: `Comedy Drama funny heartwarming`
3. Movie 3: `Action Sci-Fi space futuristic`

---

#### Step 1: Vocabulary Creation
The **vocabulary** consists of all unique words across the combined content:
$$
\text{Vocabulary} = \{\text{Action, Adventure, epic, battle, Comedy, Drama, funny, heartwarming, Sci-Fi, space, futuristic}\}
$$

---

#### Step 2: Document-Term Matrix
Create a matrix where each row represents a movie, and each column corresponds to a word from the vocabulary. The value is the **Term Frequency (TF)** for that word in the movie's content:

| Movie ID | Action | Adventure | epic | battle | Comedy | Drama | funny | heartwarming | Sci-Fi | space | futuristic |
|----------|--------|-----------|------|--------|--------|-------|-------|--------------|--------|-------|------------|
| 1        | 1      | 1         | 1    | 1      | 0      | 0     | 0     | 0            | 0      | 0     | 0          |
| 2        | 0      | 0         | 0    | 0      | 1      | 1     | 1     | 1            | 0      | 0     | 0          |
| 3        | 1      | 0         | 0    | 0      | 0      | 0     | 0     | 0            | 1      | 1     | 1          |

---

#### Step 3: Compute Inverse Document Frequency (IDF)
The **IDF** for each word is computed as:
$$
\text{IDF}(t) = \log \left( \frac{N}{1 + \text{DF}(t)} \right)
$$
Where:
- $ N = 3 $: Total number of documents (movies).
- $ \text{DF}(t) $: Number of documents containing the term $ t $.

| Term         | DF  | IDF                  |
|--------------|-----|----------------------|
| Action       | 2   | $ \log\left(\frac{3}{1+2}\right) = 0.0 $ |
| Adventure    | 1   | $ \log\left(\frac{3}{1+1}\right) = 0.405 $ |
| epic         | 1   | $ \log\left(\frac{3}{1+1}\right) = 0.405 $ |
| battle       | 1   | $ \log\left(\frac{3}{1+1}\right) = 0.405 $ |
| Comedy       | 1   | $ \log\left(\frac{3}{1+1}\right) = 0.405 $ |
| Drama        | 1   | $ \log\left(\frac{3}{1+1}\right) = 0.405 $ |
| funny        | 1   | $ \log\left(\frac{3}{1+1}\right) = 0.405 $ |
| heartwarming | 1   | $ \log\left(\frac{3}{1+1}\right) = 0.405 $ |
| Sci-Fi       | 1   | $ \log\left(\frac{3}{1+1}\right) = 0.405 $ |
| space        | 1   | $ \log\left(\frac{3}{1+1}\right) = 0.405 $ |
| futuristic   | 1   | $ \log\left(\frac{3}{1+1}\right) = 0.405 $ |

---

#### Step 4: Compute TF-IDF Matrix
Each value is computed as:
$$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \cdot \text{IDF}(t)
$$

| Movie ID | Action | Adventure | epic | battle | Comedy | Drama | funny | heartwarming | Sci-Fi | space | futuristic |
|----------|--------|-----------|------|--------|--------|-------|-------|--------------|--------|-------|------------|
| 1        | 0.0    | 0.405     | 0.405| 0.405  | 0.0    | 0.0   | 0.0   | 0.0          | 0.0    | 0.0   | 0.0        |
| 2        | 0.0    | 0.0       | 0.0  | 0.0    | 0.405  | 0.405 | 0.405 | 0.405        | 0.0    | 0.0   | 0.0        |
| 3        | 0.0    | 0.0       | 0.0  | 0.0    | 0.0    | 0.0   | 0.0   | 0.0          | 0.405  | 0.405 | 0.405      |

---

#### Final Notes:
1. **Sparse Representation**: TF-IDF matrices are usually sparse, as most words don't appear in every document.
2. **Use in Recommendations**: This matrix can be used to compute the similarity between movies or blended with collaborative filtering for hybrid recommendations.


In [3]:
from scipy.sparse import csr_matrix

# Step 4: Create Sparse Interaction Matrices
def create_sparse_matrix(data, user_mapping, movie_mapping, rating_col='rating'):
    row = data['userId'].map(user_mapping).values
    col = data['movieId'].map(movie_mapping).values
    values = data[rating_col].values
    return csr_matrix((values, (row, col)), shape=(len(user_mapping), len(movie_mapping)))

# Create mappings for userId and movieId to indices
user_mapping = {user_id: idx for idx, user_id in enumerate(ratings['userId'].unique())}
movie_mapping = {movie_id: idx for idx, movie_id in enumerate(ratings['movieId'].unique())}

print(f"Shape of idx mappings: ({len(user_mapping)}, {len(movie_mapping)})")

train_sparse = create_sparse_matrix(train_data, user_mapping, movie_mapping)
val_sparse = create_sparse_matrix(val_data, user_mapping, movie_mapping)
test_sparse = create_sparse_matrix(test_data, user_mapping, movie_mapping)

print(f"Sparse train matrix shape: {train_sparse.shape}")
print(f"Sparse validation matrix shape: {val_sparse.shape}")
print(f"Sparse test matrix shape: {test_sparse.shape}")

print()

# Step 6: Extract Content-Based Features as Sparse Matrix
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
content_matrix = tfidf_vectorizer.fit_transform(ratings['content'])

print(f"Content matrix shape (TF-IDF): {content_matrix.shape}")

# Final Outputs
print("Data preparation complete.")

Shape of idx mappings: (610, 9724)
Sparse train matrix shape: (610, 9724)
Sparse validation matrix shape: (610, 9724)
Sparse test matrix shape: (610, 9724)

Content matrix shape (TF-IDF): (100836, 1675)
Data preparation complete.


## Test data for correctness

In [4]:
# Ensure all user IDs in train, val, and test data exist in user_mapping
missing_users_train = train_data[~train_data['userId'].isin(user_mapping.keys())]
missing_users_val = val_data[~val_data['userId'].isin(user_mapping.keys())]
missing_users_test = test_data[~test_data['userId'].isin(user_mapping.keys())]

print("Missing users in train_data:", len(missing_users_train))
print("Missing users in val_data:", len(missing_users_val))
print("Missing users in test_data:", len(missing_users_test))

# Ensure all movie IDs in train, val, and test data exist in movie_mapping
missing_movies_train = train_data[~train_data['movieId'].isin(movie_mapping.keys())]
missing_movies_val = val_data[~val_data['movieId'].isin(movie_mapping.keys())]
missing_movies_test = test_data[~test_data['movieId'].isin(movie_mapping.keys())]

print("Missing movies in train_data:", len(missing_movies_train))
print("Missing movies in val_data:", len(missing_movies_val))
print("Missing movies in test_data:", len(missing_movies_test))

print()

print("Train sparse shape:", train_sparse.shape)
print("Validation sparse shape:", val_sparse.shape)
print("Test sparse shape:", test_sparse.shape)
print(f"Expected shape: ({len(user_mapping)}, {len(movie_mapping)})")

print()

train_rows, train_cols = train_sparse.nonzero()
val_rows, val_cols = val_sparse.nonzero()
test_rows, test_cols = test_sparse.nonzero()

print("Non-zero train row indices:", train_rows[:10])
print("Non-zero validation row indices:", val_rows[:10])
print("Non-zero test row indices:", test_rows[:10])

print()

# Recompute row means to validate normalization
train_sums = np.array(train_sparse.sum(axis=1)).flatten()
train_counts = np.diff(train_sparse.indptr)
train_means = np.divide(train_sums, train_counts, where=train_counts != 0)

print("Mean rating per user (train):", train_means[:10])

Missing users in train_data: 0
Missing users in val_data: 0
Missing users in test_data: 0
Missing movies in train_data: 0
Missing movies in val_data: 0
Missing movies in test_data: 0

Train sparse shape: (610, 9724)
Validation sparse shape: (610, 9724)
Test sparse shape: (610, 9724)
Expected shape: (610, 9724)

Non-zero train row indices: [0 0 0 0 0 0 0 0 0 0]
Non-zero validation row indices: [0 0 0 0 0 0 0 0 0 0]
Non-zero test row indices: [0 0 0 0 0 0 0 0 0 0]

Mean rating per user (train): [4.38888889 3.85       3.14814815 3.70860927 3.63333333 3.57990868
 3.47169811 3.625      3.59375    3.40816327]


# Step 2: Implementing Matrix Factorization

## Outline for Implementing Matrix Factorization

1. **Understand the Matrix Factorization Objective**  
   - Decompose the user-item interaction matrix (e.g., ratings) into two lower-dimensional matrices:
     - **User matrix (P)**: Represents user preferences in a latent feature space.
     - **Item matrix (Q)**: Represents item attributes in the same latent feature space.  
   - The product of these two matrices approximates the original user-item interaction matrix.

2. **Set Up the Mathematical Framework**  
   - Define the reconstruction loss function to minimize:
     - Mean Squared Error (MSE) or similar loss.
     - Regularization terms to prevent overfitting.
   - Represent the optimization as:  
     $$
     \min_{P, Q} \sum_{(u, i) \in \text{observed}} (R_{ui} - P_u^T Q_i)^2 + \lambda (||P||^2 + ||Q||^2)
     $$  
     where:
     - $ R_{ui} $ is the actual rating or interaction.
     - $ P_u $ and $ Q_i $ are the latent feature vectors for the user $ u $ and item $ i $.
     - $ \lambda $ is the regularization parameter.

3. **Incorporating User and Item Biases**  
   - **What are User Biases?**  
     User biases ($ b_u $) account for individual user tendencies to rate items higher or lower than average. For example, some users consistently give higher ratings regardless of the item.  
   - **Effect of User Bias**:  
     - Captures individual user behavior and improves prediction accuracy by correcting for systematic rating tendencies.  
     - A user who tends to rate all movies lower can be modeled with a negative bias, while a generous rater would have a positive bias.  
   - **Mathematical Representation**:  
     To include biases, the prediction formula becomes:  
     $$
     \hat{R}_{ui} = \mu + b_u + b_i + P_u^T Q_i
     $$  
     where:
     - $ \mu $ is the global average rating.
     - $ b_u $ is the user bias for user $ u $.
     - $ b_i $ is the item bias for item $ i $.
     - $ P_u^T Q_i $ is the dot product of user and item latent vectors.  
   - **Effect on the Loss Function**:  
     The loss function is updated to include the biases:  
     $$
     \min_{P, Q, b_u, b_i} \sum_{(u, i) \in \text{observed}} \left( R_{ui} - (\mu + b_u + b_i + P_u^T Q_i) \right)^2 + \lambda \left( ||P||^2 + ||Q||^2 + ||b_u||^2 + ||b_i||^2 \right)
     $$  
     Biases are updated alongside the latent matrices during optimization.

4. **Initialize Latent Matrices (P and Q)**  
   - Randomly initialize the user and item latent feature matrices with small values.
   - Define the number of latent features (dimensionality).
   - Initialize biases ($ b_u $ and $ b_i $) to zero and calculate the global bias ($ \mu $) as the mean of observed ratings.

5. **Implement the Optimization Algorithm**  
   - Use **Stochastic Gradient Descent (SGD)** or a similar optimization method:
     - Loop through observed user-item pairs.
     - Calculate prediction error.
     - Update latent factors $ P $ and $ Q $ based on the gradient of the loss function.
     - Update user and item biases $ b_u $ and $ b_i $.
   - Optionally, implement batch or mini-batch optimization.

6. **Regularization**  
   - Include regularization terms in the gradient updates.
   - Prevents overfitting by penalizing large weights in the matrices and biases.

7. **Implement the Training Loop**  
   - Define stopping criteria:
     - Maximum number of iterations/epochs.
     - Convergence threshold based on loss improvement.
   - Monitor training loss and validation loss.

8. **Evaluate the Model**  
   - Compute reconstruction accuracy on a validation set (e.g., Root Mean Squared Error).
   - Compare predicted values with actual values.

9. **Optional Enhancements**  
   - Experiment with different numbers of latent features.
   - Add content-based features to the factorization model for better performance.
   - Evaluate ranking metrics like Precision@K and Recall@K for top-\(K\) recommendations.

10. **Save and Reuse the Model**  
    - Save the learned matrices $ P $ and $ Q $, user biases $ b_u $, item biases $ b_i $, and global bias $ \mu $ for future use.
    - Provide functions to predict ratings for any user-item pair.

11. **Test the Implementation**  
    - Use synthetic or small real datasets to test and debug the implementation.
    - Ensure the model trains without exploding/vanishing gradients.

---

This framework ensures a structured approach to implementing matrix factorization for recommendation systems, with a focus on incorporating **user and item biases** for improved prediction accuracy.


## Initialize latent matrices

In [64]:
import numpy as np

def initialize_latent_matrices(num_users, num_items, latent_dim):
    """
    Initialize the user and item latent matrices with small random values.
    
    Parameters:
        num_users (int): Number of users.
        num_items (int): Number of items.
        latent_dim (int): Number of latent dimensions (features).
        
    Returns:
        P (numpy.ndarray): User latent matrix of shape (num_users, latent_dim).
        Q (numpy.ndarray): Item latent matrix of shape (num_items, latent_dim).
    """
    # Random initialization of user and item matrices
    P = np.random.normal(scale=0.01, size=(num_users, latent_dim))
    Q = np.random.normal(scale=0.01, size=(num_items, latent_dim))
    
    return P, Q

def compute_sparse_loss(rows, cols, values, P, Q, b_u, b_i, mu, regularization, normalize=True):
    """
    Compute the loss for a given sparse interaction matrix, including bias terms.

    Parameters:
        rows (np.ndarray): Row indices of the non-zero entries.
        cols (np.ndarray): Column indices of the non-zero entries.
        values (np.ndarray): Corresponding non-zero values.
        P (np.ndarray): User latent matrix.
        Q (np.ndarray): Item latent matrix (transposed).
        b_u (np.ndarray): User biases.
        b_i (np.ndarray): Item biases.
        mu (float): Global bias.
        regularization (float): Regularization parameter.
        normalize (bool): Whether to compute normalized loss (MSE).

    Returns:
        float: Computed loss value.
    """

    assert len(rows) == len(cols) == len(values), f"Invalid sizes: {len(rows)}, {len(cols)}, {len(values)}"

    loss = 0
    for idx in range(len(values)):
        u = rows[idx]
        i = cols[idx]

        assert 0 <= u < len(P), f"Invalid user index: {u}"
        assert 0 <= i < len(Q.T), f"Invalid item index: {i}"

        rating = values[idx]

        # Compute prediction including biases
        prediction = mu + b_u[u] + b_i[i] + np.dot(P[u, :], Q[:, i])
        error = rating - prediction
        loss += error**2

    # Add regularization term (includes biases)
    loss += regularization * (
        np.linalg.norm(P)**2 + np.linalg.norm(Q)**2 + np.linalg.norm(b_u)**2 + np.linalg.norm(b_i)**2
    )

    # Normalize the loss if requested
    if normalize:
        loss /= len(values)

    return loss

## The training loop

In [61]:
import numpy as np

def matrix_factorization(train_sparse, val_sparse, num_users, num_items, latent_dim, 
                                epochs, learning_rate, regularization, patience=5, clip_value=5.0):
    """
    Perform matrix factorization using SGD with regularization, early stopping, and gradient clipping.
    Utilizes sparse matrices for efficient computation.

    Parameters:
        train_sparse (csr_matrix): Training interaction matrix (sparse).
        val_sparse (csr_matrix): Validation interaction matrix (sparse).
        num_users (int): Total number of users.
        num_items (int): Total number of items.
        latent_dim (int): Number of latent features.
        epochs (int): Number of training epochs.
        learning_rate (float): Learning rate for gradient descent.
        regularization (float): Regularization parameter.
        patience (int): Early stopping patience.
        clip_value (float): Maximum value for gradient clipping.

    Returns:
        P (np.ndarray): Learned user latent matrix.
        Q (np.ndarray): Learned item latent matrix.
        train_losses (list): Training losses over epochs.
        val_losses (list): Validation losses over epochs.
    """

    # Initialize latent matrices
    P, Q = initialize_latent_matrices(num_users, num_items, latent_dim)

    # Convert item latent matrix for matrix operations
    Q = Q.T

    # Initialize biases
    mu = train_sparse.data.mean()  # Global bias
    b_u = np.zeros(num_users)  # User biases
    b_i = np.zeros(num_items)  # Item biases

    # Track losses
    train_losses = []
    val_losses = []
    best_val_loss = float('inf')
    patience_counter = 0

    # Get non-zero training indices for sparse matrix
    train_rows, train_cols = train_sparse.nonzero()
    train_values = train_sparse.data

    assert len(train_rows) == len(train_cols) == len(train_values), f"Invalid sizes: {len(train_rows)}, {len(train_cols)}, {len(train_values)}"
    
    # Get non-zero validation indices for sparse matrix
    val_rows, val_cols = val_sparse.nonzero()
    val_values = val_sparse.data

    assert len(val_rows) == len(val_cols) == len(val_values), f"Invalid sizes: {len(val_rows)}, {len(val_cols)}, {len(val_values)}"

    for epoch in range(epochs):
        # Shuffle training indices
        shuffle_indices = np.random.permutation(len(train_rows))
        train_rows = train_rows[shuffle_indices]
        train_cols = train_cols[shuffle_indices]
        train_values = train_values[shuffle_indices]

        # SGD for each non-zero entry in the sparse matrix
        for idx in range(len(train_rows)):
            u = train_rows[idx]
            i = train_cols[idx]

            assert 0 <= u < len(P), f"Invalid user index: {u} at index {idx}"
            assert 0 <= i < len(Q.T), f"Invalid item index: {i} at index {idx}"

            rating = train_values[idx]

            # Compute prediction and error
            prediction = mu + b_u[u] + b_i[i] + np.dot(P[u, :], Q[:, i])
            error = rating - prediction

            # Update biases with gradient clipping
            delta_b_u = learning_rate * (error - regularization * b_u[u])
            delta_b_i = learning_rate * (error - regularization * b_i[i])
            b_u[u] += np.clip(delta_b_u, -clip_value, clip_value)
            b_i[i] += np.clip(delta_b_i, -clip_value, clip_value)

            # Update user and item latent vectors with gradient clipping
            delta_P = learning_rate * (error * Q[:, i] - regularization * P[u, :])
            delta_Q = learning_rate * (error * P[u, :] - regularization * Q[:, i])
            P[u, :] += np.clip(delta_P, -clip_value, clip_value)
            Q[:, i] += np.clip(delta_Q, -clip_value, clip_value)

        # Compute training and validation losses
        train_loss = compute_sparse_loss(train_rows, train_cols, train_values, P, Q, b_u, b_i, mu, regularization)
        val_loss = compute_sparse_loss(val_rows, val_cols, val_values, P, Q, b_u, b_i, mu, regularization)

        train_losses.append(train_loss)
        val_losses.append(val_loss)

        print(f"Epoch {epoch + 1}/{epochs}, Train MSE: {train_loss:.4f}, Val MSE: {val_loss:.4f}")

        # Early stopping
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            patience_counter = 0
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print("Early stopping triggered.")
                break

    return P, Q.T, train_losses, val_losses

### Loss Calculation for One Epoch (Without Content Blending)

To calculate the total loss for one epoch, we sum the **squared errors** for all observed user-item interactions and add a **regularization term** to prevent overfitting.

#### Total Loss Formula

$$
\text{Loss} = \sum_{(u, i) \in \mathcal{D}} \left( R_{ui} - \hat{R}_{ui} \right)^2 + \lambda \cdot \left( ||P||^2 + ||Q||^2 + ||b_u||^2 + ||b_i||^2 \right)
$$

Where:
1. $ \mathcal{D} $: Set of all observed user-item interactions in the training data.
2. $ R_{ui} $: Actual rating for user $ u $ and item $ i $.
3. $ \hat{R}_{ui} $: Predicted rating for user $ u $ and item $ i $, calculated as:
   $$
   \hat{R}_{ui} = \mu + b_u[u] + b_i[i] + P[u, :] \cdot Q[i, :]^T
   $$
   - $ \mu $: Global bias (average rating across all interactions).
   - $ b_u[u] $: User-specific bias.
   - $ b_i[i] $: Item-specific bias.
   - $ P[u, :] $: Latent vector for user $ u $.
   - $ Q[i, :] $: Latent vector for item $ i $.

4. $ \lambda $: Regularization parameter (controls the strength of the regularization).
5. $ ||P||^2 $: Frobenius norm of the user latent matrix $ P $.
6. $ ||Q||^2 $: Frobenius norm of the item latent matrix $ Q $.
7. $ ||b_u||^2 $: Squared norm of the user biases.
8. $ ||b_i||^2 $: Squared norm of the item biases.

---

#### Explanation

1. **Error Term**:
   - For each observed interaction $ (u, i) $, compute the squared difference between the actual rating ($ R_{ui} $) and the predicted rating ($ \hat{R}_{ui} $):
     $$
     \left( R_{ui} - \hat{R}_{ui} \right)^2
     $$

2. **Regularization Term**:
   - Add penalties for the magnitudes of $ P $, $ Q $, $ b_u $, and $ b_i $ to prevent overfitting:
     $$
     \lambda \cdot \left( ||P||^2 + ||Q||^2 + ||b_u||^2 + ||b_i||^2 \right)
     $$

3. **Sum Over All Observations**:
   - Calculate the error term for all observed user-item pairs in the training data $ \mathcal{D} $ and add the regularization penalty.

---

#### Example for a Single Epoch
For a dataset with 3 user-item interactions:
$$
\mathcal{D} = \{(u=0, i=1, R_{ui}=4.0), (u=1, i=2, R_{ui}=3.5), (u=2, i=0, R_{ui}=5.0)\}
$$

The total loss would be:
$$
\text{Loss} = \sum_{(u, i) \in \mathcal{D}} \left( R_{ui} - (\mu + b_u[u] + b_i[i] + P[u, :] \cdot Q[i, :]^T) \right)^2 + \lambda \cdot \left( ||P||^2 + ||Q||^2 + ||b_u||^2 + ||b_i||^2 \right)
$$


In [65]:
# Integration with Data Preparation Code
# Sparse matrices should already be prepared (train_sparse, val_sparse, test_sparse).

# {'latent_dim': 42, 'learning_rate': 0.012376565252378615, 'regularization': 3.677000997278053e-05}

# Define parameters
latent_dim = 42
learning_rate = 0.012376565252378615
regularization = 3.677000997278053e-05
epochs = 50
patience = 5

# Get the number of users and items from mappings
num_users = train_sparse.shape[0]
num_items = train_sparse.shape[1]

# Perform matrix factorization with SGD on sparse matrices
P, Q, train_losses, val_losses = matrix_factorization(
    train_sparse, val_sparse, num_users, num_items, latent_dim, epochs, learning_rate, regularization, patience
)

# Output results
print("Training complete.")
print(f"Final user latent matrix (P): {P.shape}")
print(f"Final item latent matrix (Q): {Q.shape}")


Epoch 1/50, Train MSE: 0.7944, Val MSE: 0.8509
Epoch 2/50, Train MSE: 0.7525, Val MSE: 0.8201
Epoch 3/50, Train MSE: 0.7301, Val MSE: 0.8170
Epoch 4/50, Train MSE: 0.7149, Val MSE: 0.8099
Epoch 5/50, Train MSE: 0.7026, Val MSE: 0.7987
Epoch 6/50, Train MSE: 0.6930, Val MSE: 0.7977
Epoch 7/50, Train MSE: 0.6838, Val MSE: 0.7887
Epoch 8/50, Train MSE: 0.6727, Val MSE: 0.7904
Epoch 9/50, Train MSE: 0.6590, Val MSE: 0.7910
Epoch 10/50, Train MSE: 0.6405, Val MSE: 0.7862
Epoch 11/50, Train MSE: 0.6180, Val MSE: 0.7845
Epoch 12/50, Train MSE: 0.5904, Val MSE: 0.7819
Epoch 13/50, Train MSE: 0.5610, Val MSE: 0.7805
Epoch 14/50, Train MSE: 0.5282, Val MSE: 0.7758
Epoch 15/50, Train MSE: 0.4947, Val MSE: 0.7717
Epoch 16/50, Train MSE: 0.4607, Val MSE: 0.7745
Epoch 17/50, Train MSE: 0.4272, Val MSE: 0.7712
Epoch 18/50, Train MSE: 0.3957, Val MSE: 0.7716
Epoch 19/50, Train MSE: 0.3659, Val MSE: 0.7757
Epoch 20/50, Train MSE: 0.3384, Val MSE: 0.7768
Epoch 21/50, Train MSE: 0.3127, Val MSE: 0.7792
E

### Matrix Factorization Training Loop Example (Without Content Blending)
To understand what is going on in the training loop we can have a look at a simplified math example with small matrices and values.

#### Initial Setup
We use small matrices for simplicity:

1. **User Latent Matrix ($ P $)**:
   $$
   P =
   \begin{bmatrix}
   0.1 & 0.2 & 0.3 & 0.4 & 0.5 \\
   0.2 & 0.3 & 0.1 & 0.5 & 0.4 \\
   0.3 & 0.1 & 0.2 & 0.4 & 0.3
   \end{bmatrix}
   $$
   Shape: $ (3, 5) $ (3 users, 5 latent dimensions).

2. **Item Latent Matrix ($ Q $)**:
   $$
   Q =
   \begin{bmatrix}
   0.4 & 0.3 & 0.5 & 0.2 & 0.1 \\
   0.1 & 0.2 & 0.3 & 0.4 & 0.5 \\
   0.5 & 0.4 & 0.3 & 0.2 & 0.1
   \end{bmatrix}
   $$
   Shape: $ (3, 5) $ (3 items, 5 latent dimensions).

3. **Biases**:
   - Global bias ($ \mu $): $ 3.0 $
   - User biases ($ b_u $): $ [0.1, 0.2, 0.3] $
   - Item biases ($ b_i $): $ [0.2, 0.3, 0.4] $

4. **Known Interaction**:
   - User $ u = 0 $, Item $ i = 1 $, Rating $ R_{ui} = 4.0 $.

5. **Learning Parameters**:
   - Learning rate ($ \eta $): $ 0.01 $
   - Regularization ($ \lambda $): $ 0.1 $

---

#### Step-by-Step Calculation

1. **Compute Prediction ($ \hat{R}_{ui} $)**:
   $$
   \hat{R}_{ui} = \mu + b_u[u] + b_i[i] + P[u, :] \cdot Q[i, :]^T
   $$
   Substituting values:
   $$
   \hat{R}_{ui} = 3.0 + 0.1 + 0.3 + \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 \end{bmatrix} \cdot
   \begin{bmatrix} 0.1 \\ 0.2 \\ 0.3 \\ 0.4 \\ 0.5 \end{bmatrix}
   $$
   Dot product:
   $$
   \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 \end{bmatrix} \cdot
   \begin{bmatrix} 0.1 \\ 0.2 \\ 0.3 \\ 0.4 \\ 0.5 \end{bmatrix} = (0.1 \cdot 0.1) + (0.2 \cdot 0.2) + (0.3 \cdot 0.3) + (0.4 \cdot 0.4) + (0.5 \cdot 0.5) = 0.55
   $$
   Prediction:
   $$
   \hat{R}_{ui} = 3.0 + 0.1 + 0.3 + 0.55 = 3.95
   $$

2. **Compute Error ($ E_{ui} $)**:
   $$
   E_{ui} = R_{ui} - \hat{R}_{ui} = 4.0 - 3.95 = 0.05
   $$

3. **Update Biases**:
   $$
   b_u[u] \leftarrow b_u[u] + \eta \cdot (E_{ui} - \lambda \cdot b_u[u])
   $$
   Substituting values:
   $$
   b_u[0] \leftarrow 0.1 + 0.01 \cdot (0.05 - 0.1 \cdot 0.1) = 0.1 + 0.01 \cdot (0.05 - 0.01) = 0.1 + 0.0004 = 0.1004
   $$
   Similarly, for $ b_i[i] $:
   $$
   b_i[1] \leftarrow 0.3 + 0.01 \cdot (0.05 - 0.1 \cdot 0.3) = 0.3 + 0.01 \cdot (0.05 - 0.03) = 0.3 + 0.0002 = 0.3002
   $$

4. **Update Latent Matrices ($ P $ and $ Q $)**:
   $$
   P[u, :] \leftarrow P[u, :] + \eta \cdot (E_{ui} \cdot Q[i, :] - \lambda \cdot P[u, :])
   $$
   Substituting values:
   $$
   P[0, :] \leftarrow \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 \end{bmatrix} +
   0.01 \cdot (0.05 \cdot \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 \end{bmatrix} -
   0.1 \cdot \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 \end{bmatrix})
   $$
   Compute intermediate terms:
   $$
   0.05 \cdot \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 \end{bmatrix} = \begin{bmatrix} 0.005 & 0.01 & 0.015 & 0.02 & 0.025 \end{bmatrix}
   $$
   $$
   0.1 \cdot \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 \end{bmatrix} = \begin{bmatrix} 0.01 & 0.02 & 0.03 & 0.04 & 0.05 \end{bmatrix}
   $$
   $$
   \begin{bmatrix} 0.005 & 0.01 & 0.015 & 0.02 & 0.025 \end{bmatrix} -
   \begin{bmatrix} 0.01 & 0.02 & 0.03 & 0.04 & 0.05 \end{bmatrix} =
   \begin{bmatrix} -0.005 & -0.01 & -0.015 & -0.02 & -0.025 \end{bmatrix}
   $$
   Scale by $ \eta $:
   $$
   0.01 \cdot \begin{bmatrix} -0.005 & -0.01 & -0.015 & -0.02 & -0.025 \end{bmatrix} =
   \begin{bmatrix} -0.00005 & -0.0001 & -0.00015 & -0.0002 & -0.00025 \end{bmatrix}
   $$
   Update $ P[u, :] $:
   $$
   P[0, :] \leftarrow \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 \end{bmatrix} +
   \begin{bmatrix} -0.00005 & -0.0001 & -0.00015 & -0.0002 & -0.00025 \end{bmatrix} =
   \begin{bmatrix} 0.09995 & 0.1999 & 0.29985 & 0.3998 & 0.49975 \end{bmatrix}
   $$
   Similarly, update $ Q[i, :] $:
   $$
   Q[1, :] \leftarrow Q[1, :] + \eta \cdot (E_{ui} \cdot P[u, :] - \lambda \cdot Q[i, :])
   $$

---

#### Summary of Updates
After one loop:
- Updated $ P[0, :] $: $ \begin{bmatrix} 0.09995 & 0.1999 & 0.29985 & 0.3998 & 0.49975 \end{bmatrix} $
- Updated $ Q[1, :] $: Similar update process.
- Updated $ b_u[0] $: $ 0.1004 $
- Updated $ b_i[1] $: $ 0.3002 $

This process repeats for all observed user-item interactions.


## Search hyperparameters with Optuna

In [None]:
import optuna

def objective(trial):
    """
    Objective function for Optuna hyperparameter optimization.

    Parameters:
        trial (optuna.Trial): A trial object for hyperparameter suggestions.

    Returns:
        float: Validation loss for the best set of hyperparameters.
    """
    # Suggest hyperparameters
    latent_dim = trial.suggest_int("latent_dim", 5, 50)  # Latent dimensions
    learning_rate = trial.suggest_float("learning_rate", 1e-4, 1e-1, log=True)  # Learning rate
    regularization = trial.suggest_float("regularization", 1e-5, 1e-1, log=True)  # Regularization parameter
    patience = 5  # Fixed patience for early stopping
    epochs = 50  # Fixed number of epochs per trial

    # Train the model with the suggested hyperparameters
    _, _, _, val_losses = matrix_factorization(
        train_sparse=train_sparse,
        val_sparse=val_sparse,
        num_users=len(user_mapping),
        num_items=len(movie_mapping),
        latent_dim=latent_dim,
        epochs=epochs,
        learning_rate=learning_rate,
        regularization=regularization,
        patience=patience,
    )

    # Return the last validation loss as the objective value
    return val_losses[-1]


# Create Optuna study
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=30)  # Run for 30 trials

# Print the best hyperparameters
print("Best hyperparameters:")
print(study.best_params)

# Best validation loss
print(f"Best validation loss: {study.best_value:.4f}")

# Best trial details
print(f"Best trial: {study.best_trial}")


# Step 3: Incorporating Content-Based Features into Matrix Factorization

## 1. Why Combine Content-Based and Collaborative Filtering?
- Collaborative filtering leverages historical user-item interaction data to recommend items. However, it struggles with:
  - **Cold Start Problem**: Recommending items or users with little or no interaction data.
  - **Sparse Data**: Performance issues when the user-item interaction matrix is sparse.
- Content-based filtering uses item metadata (e.g., descriptions, tags, genres) to extract features and generate recommendations.
- Combining the two approaches enhances the recommendation system by:
  - Enabling predictions for new items/users using metadata.
  - Improving overall recommendation quality by blending collaborative and content-based signals.

---

## 2. Content-Based Feature Extraction
To integrate content-based insights, item metadata must be represented in a machine-readable format. A common method is **TF-IDF (Term Frequency-Inverse Document Frequency)**.

### TF-IDF Overview
- **Term Frequency (TF)**: Measures how often a term appears in an item description.
  $$
  TF(t, d) = \frac{\text{Frequency of term } t \text{ in item } d}{\text{Total terms in item } d}
  $$
- **Inverse Document Frequency (IDF)**: Reduces the weight of common terms across all items.
  $$
  IDF(t, D) = \log \left( \frac{\text{Total items in the dataset}}{\text{Number of items containing term } t} \right)
  $$
- **TF-IDF Score**: Combines TF and IDF to compute term importance in item descriptions.
  $$
  TF\text{-}IDF(t, d, D) = TF(t, d) \cdot IDF(t, D)
  $$

### Implementation in Recommendation Systems
- Use **TF-IDF Vectorizer** to transform item metadata into a sparse matrix (items x features).
- Resulting features represent items in a high-dimensional space, capturing content semantics.

---

## 3. Combining Content-Based Features with Collaborative Filtering
To blend content-based and collaborative filtering:
1. **Feature Matrix Construction**:
   - Create a **content matrix** (e.g., from TF-IDF) for all items.
   - Optionally, reduce dimensions using techniques like **Truncated SVD** for computational efficiency.
2. **Augment Item Representations**:
   - Enhance collaborative filtering's item feature matrix (`Q`) with content-based features.
   - Example: Concatenate or blend the learned item latent features with the content matrix.
3. **Reformulate Prediction Function**:
   - Combine user latent matrix (`P`) with both collaborative and content-based item features.
   $$
   \hat{R}_{ui} = \mu + b_u + b_i + P_u \cdot Q_i + P_u \cdot \text{Content}_i
   $$
   where:
   - $ P_u $: User latent features.
   - $ Q_i $: Collaborative latent item features.
   - $ \text{Content}_i $: Content-based item features.
4. **Optimize**:
   - Train the model with both collaborative and content-based features using the same loss function, ensuring effective joint learning.

---

## 4. Practical Considerations
- **Scalability**: TF-IDF features can lead to high-dimensional matrices; consider dimensionality reduction.
- **Data Integrity**: Ensure metadata is well-prepared, clean, and consistent.
- **Weight Balancing**: Experiment with weight ratios between collaborative and content-based features for optimal results.

By incorporating content-based features into matrix factorization, the recommendation system leverages metadata for robust predictions, addressing cold start and sparsity challenges.


## Comparison of Loss: Matrix Factorization With and Without Content Blending

### 1. **Matrix Factorization Without Content Blending**
The loss is calculated using only the collaborative filtering components:
$$
\text{Loss}_{\text{no content}} = \sum_{(u, i) \in \mathcal{D}} \left( R_{ui} - (\mu + b_u[u] + b_i[i] + P[u, :] \cdot Q[i, :]^T) \right)^2 + \lambda \cdot \left( ||P||^2 + ||Q||^2 + ||b_u||^2 + ||b_i||^2 \right)
$$

- **Error Term**:
  - The predicted rating $ \hat{R}_{ui} $ is derived only from the collaborative latent factors $ P $ and $ Q $.
- **Regularization**:
  - Penalizes the magnitudes of $ P $, $ Q $, user biases $ b_u $, and item biases $ b_i $.

---

### 2. **Matrix Factorization With Content Blending**
The loss incorporates both collaborative and content-based components:
$$
\text{Loss}_{\text{content}} = \sum_{(u, i) \in \mathcal{D}} \left( R_{ui} - (\mu + b_u[u] + b_i[i] + P[u, :] \cdot [Q[i, :], \text{Content}_i]^T) \right)^2 + \lambda \cdot \left( ||P||^2 + ||Q||^2 + ||b_u||^2 + ||b_i||^2 \right)
$$

- **Error Term**:
  - The predicted rating $ \hat{R}_{ui} $ now includes the contribution of the augmented item latent vector:
    - $ [Q[i, :], \text{Content}_i] $: Combines collaborative features ($ Q $) and content-based features ($ \text{Content}_i $).
- **Regularization**:
  - Similar to the non-content case but focuses only on $ P $, $ Q $, $ b_u $, and $ b_i $. The content features ($ \text{Content}_i $) are static and not regularized.

---

### Example Comparison
Consider the following:
- Dataset with $ 3 $ users and $ 3 $ items.
- Ratings ($ R_{ui} $):
  $$
  \mathcal{D} = \{(u=0, i=1, R_{ui}=4.0), (u=1, i=2, R_{ui}=3.5), (u=2, i=0, R_{ui}=5.0)\}
  $$
- Frobenius norms of $ P $ and $ Q $ and biases are $ ||P||^2 = 1.2 $, $ ||Q||^2 = 0.8 $, $ ||b_u||^2 = 0.2 $, $ ||b_i||^2 = 0.3 $.
- Regularization parameter: $ \lambda = 0.1 $.

---

**Without Content Blending:**
1. Predictions:
   - $ \hat{R}_{0,1} = 3.95 $, $ \hat{R}_{1,2} = 3.6 $, $ \hat{R}_{2,0} = 4.8 $.
2. Errors:
   - $ (R_{0,1} - \hat{R}_{0,1})^2 = (4.0 - 3.95)^2 = 0.0025 $,
   - $ (R_{1,2} - \hat{R}_{1,2})^2 = (3.5 - 3.6)^2 = 0.01 $,
   - $ (R_{2,0} - \hat{R}_{2,0})^2 = (5.0 - 4.8)^2 = 0.04 $.
3. Total Error:
   $$
   \text{Error} = 0.0025 + 0.01 + 0.04 = 0.0525
   $$
4. Regularization:
   $$
   \lambda \cdot (||P||^2 + ||Q||^2 + ||b_u||^2 + ||b_i||^2) = 0.1 \cdot (1.2 + 0.8 + 0.2 + 0.3) = 0.25
   $$
5. Total Loss:
   $$
   \text{Loss}_{\text{no content}} = 0.0525 + 0.25 = 0.3025
   $$

---

**With Content Blending:**
1. Predictions:
   - Incorporates both $ Q[i, :] $ and $ \text{Content}_i $. Suppose the content-based contribution improves predictions:
     - $ \hat{R}_{0,1} = 4.0 $, $ \hat{R}_{1,2} = 3.55 $, $ \hat{R}_{2,0} = 4.9 $.
2. Errors:
   - $ (R_{0,1} - \hat{R}_{0,1})^2 = (4.0 - 4.0)^2 = 0.0 $,
   - $ (R_{1,2} - \hat{R}_{1,2})^2 = (3.5 - 3.55)^2 = 0.0025 $,
   - $ (R_{2,0} - \hat{R}_{2,0})^2 = (5.0 - 4.9)^2 = 0.01 $.
3. Total Error:
   $$
   \text{Error} = 0.0 + 0.0025 + 0.01 = 0.0125
   $$
4. Regularization:
   - Regularization remains unchanged: $ 0.25 $.
5. Total Loss:
   $$
   \text{Loss}_{\text{content}} = 0.0125 + 0.25 = 0.2625
   $$

---

### Observations:
1. **Without Content Blending**:
   - Loss: $ \text{Loss}_{\text{no content}} = 0.3025 $.
   - Predictions are based only on collaborative filtering.

2. **With Content Blending**:
   - Loss: $ \text{Loss}_{\text{content}} = 0.2625 $.
   - Improved predictions reduce the error term, leading to a lower total loss.

3. **Impact**:
   - Content blending enhances predictions by incorporating metadata, reducing the overall loss.

4. **Why is the regularization term not affected by the augmented Q matrix**:
   - Because the static TF-IDF matrix cannot be adjusted by the training, so there is no learning parameter.




We augment the training code with the content based features created in Step 1: Prepare Data.

## First we add the content dimensions to the initialisation of P.

In [5]:
def initialize_latent_matrices_content(num_users, num_items, latent_dim):
    """
    Initialize the user and item latent matrices with small random values.
    
    Parameters:
        num_users (int): Number of users.
        num_items (int): Number of items.
        latent_dim (int): Number of latent dimensions (features).
        
    Returns:
        P (numpy.ndarray): User latent matrix of shape (num_users, latent_dim).
        Q (numpy.ndarray): Item latent matrix of shape (num_items, latent_dim).
    """
    # Random initialization of user and item matrices
    P = np.random.normal(scale=0.01, size=(num_users, latent_dim + content_matrix.shape[1]))
    Q = np.random.normal(scale=0.01, size=(num_items, latent_dim))
    
    return P, Q

### Comparison of $ P $ With and Without Content Blending

#### 1. **Initialization of $ P $**
- **Without Content Blending**:  
  $ P $ is initialized with shape:
  $$
  P_{\text{no content}} \in \mathbb{R}^{\text{num\_users} \times \text{latent\_dim}}
  $$
- **With Content Blending**:  
  $ P $ includes extra dimensions to account for the content matrix:
  $$
  P_{\text{content}} \in \mathbb{R}^{\text{num\_users} \times (\text{latent\_dim} + \text{content\_dim})}
  $$

---

#### 2. **How $ P $ Changes During Training**

**Without Content Blending**:
- $ P_{\text{no content}} $ is updated using only collaborative filtering gradients:
  $$
  \Delta P[u, :] = \eta \cdot (E_{ui} \cdot Q[i, :] - \lambda \cdot P[u, :])
  $$
  - $ Q $: Item latent factors (collaborative only).
  - $ E_{ui} $: Prediction error.

**With Content Blending**:
- $ P_{\text{content}} $ has additional dimensions to interact with **content-based features**:
  $$
  \Delta P[u, :] = \eta \cdot (E_{ui} \cdot [Q[i, :], \text{Content}_i] - \lambda \cdot P[u, :])
  $$
  - The collaborative part (first $ \text{latent\_dim} $) interacts with $ Q[i, :] $.
  - The added content dimensions interact with $ \text{Content}_i $ (precomputed TF-IDF features).

---

#### 3. **Differences Between $ P_{\text{no content}} $ and $ P_{\text{content}} $**

1. **Dimensionality**:
   - $ P_{\text{no content}} $: Shape is $ (\text{num\_users}, \text{latent\_dim}) $.
   - $ P_{\text{content}} $: Shape is $ (\text{num\_users}, \text{latent\_dim} + \text{content\_dim}) $.

2. **Interpretation**:
   - **Without Content**: Entire $ P $ represents user preferences derived only from collaborative signals.
   - **With Content**:
     - First $ \text{latent\_dim} $ columns represent collaborative filtering preferences.
     - Last $ \text{content\_dim} $ columns encode **user-specific weights** for content features.

3. **Post-Training Values**:
   - Without content blending, $ P $ learns patterns from user-item interactions alone.
   - With content blending:
     - The collaborative part of $ P $ (first $ \text{latent\_dim} $) behaves similarly to the non-content case.
     - The content dimensions of $ P $ adapt to capture user preferences for specific item features (e.g., genres or tags).

---

#### 4. **Example Illustration**
Suppose:
- $ \text{latent\_dim} = 2 $, $ \text{content\_dim} = 3 $.

**Without Content Blending**:
$$
P_{\text{no content}} = 
\begin{bmatrix}
0.1 & 0.2 \\
0.3 & 0.4 \\
0.5 & 0.6
\end{bmatrix}
$$

**With Content Blending**:
$$
P_{\text{content}} = 
\begin{bmatrix}
0.1 & 0.2 & 0.05 & 0.03 & 0.04 \\
0.3 & 0.4 & 0.07 & 0.02 & 0.06 \\
0.5 & 0.6 & 0.01 & 0.09 & 0.05
\end{bmatrix}
$$

- The first two columns (collaborative dimensions) are similar to $ P_{\text{no content}} $.
- The last three columns (content dimensions) represent user-specific preferences for content features.

---

#### 5. **Impact of Content Blending**
- **Without Content Blending**: $ P $ contains only collaborative user preferences.
- **With Content Blending**: $ P $ incorporates both collaborative preferences and content-based preferences.
- The additional dimensions in $ P $ help improve predictions by leveraging content-based metadata.

---

### Summary
1. **Dimensionality**:
   - $ P_{\text{no content}} $: $ \text{num\_users} \times \text{latent\_dim} $.
   - $ P_{\text{content}} $: $ \text{num\_users} \times (\text{latent\_dim} + \text{content\_dim}) $.

2. **Post-Training Behavior**:
   - Collaborative dimensions in $ P_{\text{content}} $ behave similarly to $ P_{\text{no content}} $.
   - Content dimensions in $ P_{\text{content}} $ encode user preferences for item metadata.

3. **Why It Matters**:
   - Content blending improves predictions by integrating metadata, which helps especially in scenarios like the **cold-start problem**.


In [15]:
def compute_sparse_loss_content(rows, cols, values, P, Q, b_u, b_i, mu, regularization, content_matrix, normalize=True):
    """
    Compute the loss for a given sparse interaction matrix, including bias terms and content-based features.

    Parameters:
        rows (np.ndarray): Row indices of the non-zero entries.
        cols (np.ndarray): Column indices of the non-zero entries.
        values (np.ndarray): Corresponding non-zero values.
        P (np.ndarray): User latent matrix (includes both collaborative and content dimensions).
        Q (np.ndarray): Collaborative item latent matrix (transposed).
        b_u (np.ndarray): User biases.
        b_i (np.ndarray): Item biases.
        mu (float): Global bias.
        regularization (float): Regularization parameter.
        content_matrix (csr_matrix): Sparse matrix of content-based features.
        normalize (bool): Whether to compute normalized loss (MSE).

    Returns:
        float: Computed loss value.
    """

    assert len(rows) == len(cols) == len(values), f"Invalid sizes: {len(rows)}, {len(cols)}, {len(values)}"

    loss = 0
    for idx in range(len(values)):
        u = rows[idx]
        i = cols[idx]

        assert 0 <= u < len(P), f"Invalid user index: {u}"
        assert 0 <= i < len(Q.T), f"Invalid item index: {i}"

        rating = values[idx]

        # Extract content-based features for item i
        content_features = content_matrix.getrow(i).toarray().flatten()  # Extract sparse row as dense array

        # Augment item latent vector with content features
        augmented_Q_i = np.hstack([Q[:, i], content_features])

        # Compute prediction including biases
        prediction = mu + b_u[u] + b_i[i] + np.dot(P[u, :], augmented_Q_i)
        error = rating - prediction
        loss += error**2

    # Add regularization term (includes biases and collaborative latent vectors)
    loss += regularization * (
        np.linalg.norm(P)**2 + np.linalg.norm(Q)**2 + np.linalg.norm(b_u)**2 + np.linalg.norm(b_i)**2
    )

    # Normalize the loss if requested
    if normalize:
        loss /= len(values)

    return loss

def matrix_factorization_with_content(train_sparse, val_sparse, content_matrix, num_users, num_items, latent_dim, 
                                      epochs, learning_rate, regularization, patience=5, clip_value=5.0):
    """
    Perform matrix factorization using SGD with content-based feature integration.

    Parameters:
        train_sparse (csr_matrix): Training interaction matrix (sparse).
        val_sparse (csr_matrix): Validation interaction matrix (sparse).
        content_matrix (csr_matrix): Content-based feature matrix (items x features, sparse).
        num_users (int): Total number of users.
        num_items (int): Total number of items.
        latent_dim (int): Number of latent features.
        epochs (int): Number of training epochs.
        learning_rate (float): Learning rate for gradient descent.
        regularization (float): Regularization parameter.
        patience (int): Early stopping patience.
        clip_value (float): Maximum value for gradient clipping.

    Returns:
        P (np.ndarray): Learned user latent matrix.
        Q (np.ndarray): Learned item latent matrix.
        train_losses (list): Training losses over epochs.
        val_losses (list): Validation losses over epochs.
        b_u (np.ndarray): Learned user biases.
        b_i (np.ndarray): Learned item biases.
        mu (float): Learned global bias.
    """
    # Initialize latent matrices
    P, Q = initialize_latent_matrices_content(num_users, num_items, latent_dim)

    # Convert item latent matrix for matrix operations
    Q = Q.T

    # Initialize biases
    mu = train_sparse.data.mean()  # Global bias
    b_u = np.zeros(num_users)  # User biases
    b_i = np.zeros(num_items)  # Item biases

    # Track losses
    train_losses = []
    val_losses = []
    best_val_loss = float('inf')
    patience_counter = 0

    # Get non-zero training indices for sparse matrix
    train_rows, train_cols = train_sparse.nonzero()
    train_values = train_sparse.data

    assert len(train_rows) == len(train_cols) == len(train_values), f"Invalid sizes: {len(train_rows)}, {len(train_cols)}, {len(train_values)}"
    
    # Get non-zero validation indices for sparse matrix
    val_rows, val_cols = val_sparse.nonzero()
    val_values = val_sparse.data

    assert len(val_rows) == len(val_cols) == len(val_values), f"Invalid sizes: {len(val_rows)}, {len(val_cols)}, {len(val_values)}"

    for epoch in range(epochs):
        # Shuffle training indices
        shuffle_indices = np.random.permutation(len(train_rows))
        train_rows = train_rows[shuffle_indices]
        train_cols = train_cols[shuffle_indices]
        train_values = train_values[shuffle_indices]

        # SGD for each non-zero entry in the sparse matrix
        for idx in range(len(train_rows)):
            u = train_rows[idx]
            i = train_cols[idx]

            assert 0 <= u < len(P), f"Invalid user index: {u} at index {idx}"
            assert 0 <= i < len(Q.T), f"Invalid item index: {i} at index {idx}"

            rating = train_values[idx]

            # Compute content-based features for item `i`
            content_features = content_matrix.getrow(i).toarray().flatten()  # Extract sparse row as dense array

            # Augment item latent vector with content features
            augmented_Q_i = np.hstack([Q[:, i], content_features])

            # Compute prediction and error
            prediction = mu + b_u[u] + b_i[i] + np.dot(P[u, :], augmented_Q_i)
            error = rating - prediction

            # Update biases with gradient clipping
            delta_b_u = learning_rate * (error - regularization * b_u[u])
            delta_b_i = learning_rate * (error - regularization * b_i[i])
            b_u[u] += np.clip(delta_b_u, -clip_value, clip_value)
            b_i[i] += np.clip(delta_b_i, -clip_value, clip_value)

            # Update user and item latent vectors with gradient clipping
            delta_P = learning_rate * (error * augmented_Q_i - regularization * P[u, :])
            delta_Q_collab = learning_rate * (error * P[u, :latent_dim] - regularization * Q[:, i])

            P[u, :] += np.clip(delta_P, -clip_value, clip_value)
            Q[:, i] += np.clip(delta_Q_collab, -clip_value, clip_value)


        # Compute training and validation losses
        train_loss = compute_sparse_loss_content(train_rows, train_cols, train_values, P, Q, b_u, b_i, mu, regularization, content_matrix)
        val_loss = compute_sparse_loss_content(val_rows, val_cols, val_values, P, Q, b_u, b_i, mu, regularization, content_matrix)

        train_losses.append(train_loss)
        val_losses.append(val_loss)

        print(f"Epoch {epoch + 1}/{epochs}, Train MSE: {train_loss:.4f}, Val MSE: {val_loss:.4f}")

        # Early stopping
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            patience_counter = 0
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print("Early stopping triggered.")
                break

    return P, Q.T, train_losses, val_losses, b_u, b_i, mu


In [17]:
# Integration with Data Preparation Code
# Sparse matrices should already be prepared (train_sparse, val_sparse, test_sparse).

# 0.8027304317478758 and parameters: {'latent_dim': 28, 'learning_rate': 0.008089574860460297, 'regularization': 1.6073221934120177e-05}

# Define parameters
latent_dim = 28
learning_rate = 0.008089574860460297
regularization = 1.6073221934120177e-05
epochs = 100
patience = 5

# Get the number of users and items from mappings
num_users = train_sparse.shape[0]
num_items = train_sparse.shape[1]

# Perform matrix factorization with SGD on sparse matrices
P, Q, train_losses, val_losses, b_u, b_i, mu = matrix_factorization_with_content(
    train_sparse, val_sparse, content_matrix, num_users, num_items, latent_dim, epochs, learning_rate, regularization, patience
)

# Output results
print("Training complete.")
print(f"Final user latent matrix (P): {P.shape}")
print(f"Final item latent matrix (Q): {Q.shape}")
print(f"User biases shape: {b_u.shape}")
print(f"Item biases shape: {b_i.shape}")
print(f"Global bias (mu): {mu:.4f}")


Epoch 1/100, Train MSE: 0.8070, Val MSE: 0.8697
Epoch 2/100, Train MSE: 0.7551, Val MSE: 0.8487
Epoch 3/100, Train MSE: 0.7234, Val MSE: 0.8275
Epoch 4/100, Train MSE: 0.7002, Val MSE: 0.8224
Epoch 5/100, Train MSE: 0.6810, Val MSE: 0.8198
Epoch 6/100, Train MSE: 0.6654, Val MSE: 0.8124
Epoch 7/100, Train MSE: 0.6510, Val MSE: 0.8096
Epoch 8/100, Train MSE: 0.6385, Val MSE: 0.8099
Epoch 9/100, Train MSE: 0.6272, Val MSE: 0.8126
Epoch 10/100, Train MSE: 0.6162, Val MSE: 0.8074
Epoch 11/100, Train MSE: 0.6053, Val MSE: 0.8062
Epoch 12/100, Train MSE: 0.5945, Val MSE: 0.8088
Epoch 13/100, Train MSE: 0.5836, Val MSE: 0.8076
Epoch 14/100, Train MSE: 0.5717, Val MSE: 0.8081
Epoch 15/100, Train MSE: 0.5586, Val MSE: 0.8059
Epoch 16/100, Train MSE: 0.5440, Val MSE: 0.8030
Epoch 17/100, Train MSE: 0.5282, Val MSE: 0.8081
Epoch 18/100, Train MSE: 0.5116, Val MSE: 0.8054
Epoch 19/100, Train MSE: 0.4946, Val MSE: 0.8035
Epoch 20/100, Train MSE: 0.4766, Val MSE: 0.8044
Epoch 21/100, Train MSE: 0.45

In [18]:
import numpy as np

import numpy as np

def save_matrices(P, Q, b_u, b_i, mu, 
                  P_path='P_matrix.npy', Q_path='Q_matrix.npy', 
                  b_u_path='b_u.npy', b_i_path='b_i.npy', mu_path='mu.npy'):
    """
    Save the trained P, Q matrices, user biases, item biases, and global bias to files.

    Args:
    - P (np.ndarray): Trained user matrix.
    - Q (np.ndarray): Trained item matrix.
    - b_u (np.ndarray): User biases.
    - b_i (np.ndarray): Item biases.
    - mu (float): Global bias.
    - P_path (str): File path to save the P matrix.
    - Q_path (str): File path to save the Q matrix.
    - b_u_path (str): File path to save the user biases.
    - b_i_path (str): File path to save the item biases.
    - mu_path (str): File path to save the global bias.
    """
    np.save(P_path, P)
    np.save(Q_path, Q)
    np.save(b_u_path, b_u)
    np.save(b_i_path, b_i)
    np.save(mu_path, np.array([mu]))  # Save mu as a single-element array
    print(f"Saved P matrix to {P_path}")
    print(f"Saved Q matrix to {Q_path}")
    print(f"Saved user biases to {b_u_path}")
    print(f"Saved item biases to {b_i_path}")
    print(f"Saved global bias to {mu_path}")


def load_matrices(P_path='P_matrix.npy', Q_path='Q_matrix.npy', 
                  b_u_path='b_u.npy', b_i_path='b_i.npy', mu_path='mu.npy'):
    """
    Load the trained P, Q matrices, user biases, item biases, and global bias from files.

    Args:
    - P_path (str): File path to load the P matrix.
    - Q_path (str): File path to load the Q matrix.
    - b_u_path (str): File path to load the user biases.
    - b_i_path (str): File path to load the item biases.
    - mu_path (str): File path to load the global bias.

    Returns:
    - P (np.ndarray): Loaded user matrix.
    - Q (np.ndarray): Loaded item matrix.
    - b_u (np.ndarray): Loaded user biases.
    - b_i (np.ndarray): Loaded item biases.
    - mu (float): Loaded global bias.
    """
    P = np.load(P_path)
    Q = np.load(Q_path)
    b_u = np.load(b_u_path)
    b_i = np.load(b_i_path)
    mu = np.load(mu_path)[0]  # Extract the single value for mu
    print(f"Loaded P matrix from {P_path} with shape {P.shape}")
    print(f"Loaded Q matrix from {Q_path} with shape {Q.shape}")
    print(f"Loaded user biases from {b_u_path} with shape {b_u.shape}")
    print(f"Loaded item biases from {b_i_path} with shape {b_i.shape}")
    print(f"Loaded global bias (mu) from {mu_path} with value {mu:.4f}")
    return P, Q, b_u, b_i, mu

save_matrices(P, Q, b_u, b_i, mu, 
              'saved_weights/P_matrix.npy', 'saved_weights/Q_matrix.npy', 
              'saved_weights/b_u.npy', 'saved_weights/b_i.npy', 'saved_weights/mu.npy')

Saved P matrix to saved_weights/P_matrix.npy
Saved Q matrix to saved_weights/Q_matrix.npy
Saved user biases to saved_weights/b_u.npy
Saved item biases to saved_weights/b_i.npy
Saved global bias to saved_weights/mu.npy


## Matrix Factorization Training Loop Example (With Content Blending)

#### Initial Setup
We use small matrices for simplicity:

1. **User Latent Matrix ($ P $)**:
   $$
   P =
   \begin{bmatrix}
   0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.1 & 0.2 \\
   0.2 & 0.3 & 0.1 & 0.5 & 0.4 & 0.3 & 0.1 \\
   0.3 & 0.1 & 0.2 & 0.4 & 0.3 & 0.4 & 0.3
   \end{bmatrix}
   $$
   Shape: $ (3, 7) $ (3 users, $ 5 $ collaborative dimensions + $ 2 $ content dimensions).

2. **Collaborative Item Latent Matrix ($ Q $)**:
   $$
   Q =
   \begin{bmatrix}
   0.4 & 0.3 & 0.5 & 0.2 & 0.1 \\
   0.1 & 0.2 & 0.3 & 0.4 & 0.5 \\
   0.5 & 0.4 & 0.3 & 0.2 & 0.1
   \end{bmatrix}
   $$
   Shape: $ (3, 5) $ (3 items, $ 5 $ collaborative dimensions).

3. **Content Features Matrix ($ \text{Content} $)**:
   $$
   \text{Content} =
   \begin{bmatrix}
   0.6 & 0.8 \\
   0.7 & 0.9 \\
   0.5 & 0.4
   \end{bmatrix}
   $$
   Shape: $ (3, 2) $ (3 items, $ 2 $ content-based dimensions).

4. **Biases**:
   - Global bias ($ \mu $): $ 3.0 $
   - User biases ($ b_u $): $ [0.1, 0.2, 0.3] $
   - Item biases ($ b_i $): $ [0.2, 0.3, 0.4] $

5. **Known Interaction**:
   - User $ u = 0 $, Item $ i = 1 $, Rating $ R_{ui} = 4.0 $.

6. **Learning Parameters**:
   - Learning rate ($ \eta $): $ 0.01 $
   - Regularization ($ \lambda $): $ 0.1 $

---

#### Step-by-Step Calculation

1. **Augment Item Latent Vector ($ Q[i, :] $)**:
   - $ Q[i, :] $: Collaborative latent vector for item $ i = 1 $:
     $$
     Q[1, :] = \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 \end{bmatrix}
     $$
   - $ \text{Content}[i, :] $: Content-based vector for item $ i = 1 $:
     $$
     \text{Content}[1, :] = \begin{bmatrix} 0.7 & 0.9 \end{bmatrix}
     $$
   - Augmented $ Q[i, :] $:
     $$
     \text{Augmented } Q[i, :] = \begin{bmatrix} Q[i, :], \text{Content}[i, :] \end{bmatrix}
     = \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.7 & 0.9 \end{bmatrix}
     $$

2. **Compute Prediction ($ \hat{R}_{ui} $)**:
   $$
   \hat{R}_{ui} = \mu + b_u[u] + b_i[i] + P[u, :] \cdot \text{Augmented } Q[i, :]^T
   $$
   Substituting values:
   $$
   \hat{R}_{ui} = 3.0 + 0.1 + 0.3 +
   \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.1 & 0.2 \end{bmatrix} \cdot
   \begin{bmatrix} 0.1 \\ 0.2 \\ 0.3 \\ 0.4 \\ 0.5 \\ 0.7 \\ 0.9 \end{bmatrix}
   $$
   Dot product:
   $$
   \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.1 & 0.2 \end{bmatrix} \cdot
   \begin{bmatrix} 0.1 \\ 0.2 \\ 0.3 \\ 0.4 \\ 0.5 \\ 0.7 \\ 0.9 \end{bmatrix}
   = (0.1 \cdot 0.1) + (0.2 \cdot 0.2) + (0.3 \cdot 0.3) + (0.4 \cdot 0.4) +
   (0.5 \cdot 0.5) + (0.1 \cdot 0.7) + (0.2 \cdot 0.9)
   $$
   $$
   = 0.01 + 0.04 + 0.09 + 0.16 + 0.25 + 0.07 + 0.18 = 0.81
   $$
   Prediction:
   $$
   \hat{R}_{ui} = 3.0 + 0.1 + 0.3 + 0.81 = 4.21
   $$

3. **Compute Error ($ E_{ui} $)**:
   $$
   E_{ui} = R_{ui} - \hat{R}_{ui} = 4.0 - 4.21 = -0.21
   $$

4. **Update Biases**:
   $$
   b_u[u] \leftarrow b_u[u] + \eta \cdot (E_{ui} - \lambda \cdot b_u[u])
   $$
   Substituting values:
   $$
   b_u[0] \leftarrow 0.1 + 0.01 \cdot (-0.21 - 0.1 \cdot 0.1) = 0.1 + 0.01 \cdot (-0.21 - 0.01) = 0.1 - 0.0022 = 0.0978
   $$
   Similarly, for $ b_i[i] $:
   $$
   b_i[1] \leftarrow 0.3 + 0.01 \cdot (-0.21 - 0.1 \cdot 0.3) = 0.3 + 0.01 \cdot (-0.21 - 0.03) = 0.3 - 0.0024 = 0.2976
   $$

5. **Update Latent Matrices ($ P $ and $ Q $)**:
   $$
   P[u, :] \leftarrow P[u, :] + \eta \cdot (E_{ui} \cdot \text{Augmented } Q[i, :] - \lambda \cdot P[u, :])
   $$
   Substituting values:
   $$
   P[0, :] \leftarrow \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.1 & 0.2 \end{bmatrix} +
   0.01 \cdot (-0.21 \cdot \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.7 & 0.9 \end{bmatrix} -
   0.1 \cdot \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.1 & 0.2 \end{bmatrix})
   $$

   Compute intermediate updates:
   - $ -0.21 \cdot \text{Augmented } Q[i, :] $:
     $$
     \begin{bmatrix} -0.021 & -0.042 & -0.063 & -0.084 & -0.105 & -0.147 & -0.189 \end{bmatrix}
     $$
   - $ -0.1 \cdot P[u, :] $:
     $$
     \begin{bmatrix} -0.01 & -0.02 & -0.03 & -0.04 & -0.05 & -0.01 & -0.02 \end{bmatrix}
     $$
   - Combined:
     $$
     \begin{bmatrix} -0.031 & -0.062 & -0.093 & -0.124 & -0.155 & -0.157 & -0.209 \end{bmatrix}
     $$
   - Scale by $ \eta = 0.01 $:
     $$
     \begin{bmatrix} -0.00031 & -0.00062 & -0.00093 & -0.00124 & -0.00155 & -0.00157 & -0.00209 \end{bmatrix}
     $$
   - Update $ P[u, :] $:
     $$
     \begin{bmatrix} 0.09969 & 0.19938 & 0.29907 & 0.39876 & 0.49845 & 0.09843 & 0.19791 \end{bmatrix}
     $$

   Similarly, update $ Q[i, :] $ using only the collaborative part.

---

#### Summary of Updates
After one loop:
- Updated $ P[0, :] $: $ \begin{bmatrix} 0.09969 & 0.19938 & 0.29907 & 0.39876 & 0.49845 & 0.09843 & 0.19791 \end{bmatrix} $
- Updated $ Q[1, :] $: Collaborative part updated similarly.
- Updated $ b_u[0] $: $ 0.0978 $
- Updated $ b_i[1] $: $ 0.2976 $

This process repeats for all observed user-item interactions.


## Optuna search for hyperparameters

In [70]:
import optuna

def objective(trial):
    """
    Objective function for Optuna hyperparameter optimization.

    Parameters:
        trial (optuna.Trial): A trial object for hyperparameter suggestions.

    Returns:
        float: Validation loss for the best set of hyperparameters.
    """
    # Suggest hyperparameters
    latent_dim = trial.suggest_int("latent_dim", 5, 50)  # Latent dimensions
    learning_rate = trial.suggest_float("learning_rate", 1e-4, 1e-1, log=True)  # Learning rate
    regularization = trial.suggest_float("regularization", 1e-5, 1e-1, log=True)  # Regularization parameter
    patience = 5  # Fixed patience for early stopping
    epochs = 25  # Fixed number of epochs per trial

    # Train the model with the suggested hyperparameters
    _, _, _, val_losses = matrix_factorization_with_content(
        train_sparse=train_sparse,
        val_sparse=val_sparse,
        content_matrix=content_matrix,
        num_users=len(user_mapping),
        num_items=len(movie_mapping),
        latent_dim=latent_dim,
        epochs=epochs,
        learning_rate=learning_rate,
        regularization=regularization,
        patience=patience,
    )

    # Return the last validation loss as the objective value
    return val_losses[-1]


# Create Optuna study
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=30)  # Run for 30 trials

# Print the best hyperparameters
print("Best hyperparameters:")
print(study.best_params)

# Best validation loss
print(f"Best validation loss: {study.best_value:.4f}")

# Best trial details
print(f"Best trial: {study.best_trial}")


[I 2024-12-16 19:30:44,579] A new study created in memory with name: no-name-2b3d754b-7b0c-47e9-be22-a04ebdd4b919


Epoch 1/25, Train MSE: 0.9331, Val MSE: 0.9627
Epoch 2/25, Train MSE: 0.8806, Val MSE: 0.9227
Epoch 3/25, Train MSE: 0.8491, Val MSE: 0.8998
Epoch 4/25, Train MSE: 0.8270, Val MSE: 0.8850
Epoch 5/25, Train MSE: 0.8100, Val MSE: 0.8741
Epoch 6/25, Train MSE: 0.7963, Val MSE: 0.8659
Epoch 7/25, Train MSE: 0.7847, Val MSE: 0.8587
Epoch 8/25, Train MSE: 0.7746, Val MSE: 0.8531
Epoch 9/25, Train MSE: 0.7657, Val MSE: 0.8486
Epoch 10/25, Train MSE: 0.7577, Val MSE: 0.8445
Epoch 11/25, Train MSE: 0.7504, Val MSE: 0.8421
Epoch 12/25, Train MSE: 0.7437, Val MSE: 0.8377
Epoch 13/25, Train MSE: 0.7375, Val MSE: 0.8352
Epoch 14/25, Train MSE: 0.7317, Val MSE: 0.8328
Epoch 15/25, Train MSE: 0.7263, Val MSE: 0.8311
Epoch 16/25, Train MSE: 0.7212, Val MSE: 0.8290
Epoch 17/25, Train MSE: 0.7163, Val MSE: 0.8268
Epoch 18/25, Train MSE: 0.7117, Val MSE: 0.8259
Epoch 19/25, Train MSE: 0.7073, Val MSE: 0.8246
Epoch 20/25, Train MSE: 0.7031, Val MSE: 0.8234
Epoch 21/25, Train MSE: 0.6992, Val MSE: 0.8227
E

[I 2024-12-16 19:34:44,467] Trial 0 finished with value: 0.8182290129765261 and parameters: {'latent_dim': 11, 'learning_rate': 0.0015489638117980062, 'regularization': 2.624625377395399e-05}. Best is trial 0 with value: 0.8182290129765261.


Epoch 25/25, Train MSE: 0.6846, Val MSE: 0.8182
Epoch 1/25, Train MSE: 0.8596, Val MSE: 0.9066
Epoch 2/25, Train MSE: 0.8064, Val MSE: 0.8702
Epoch 3/25, Train MSE: 0.7759, Val MSE: 0.8527
Epoch 4/25, Train MSE: 0.7543, Val MSE: 0.8446
Epoch 5/25, Train MSE: 0.7369, Val MSE: 0.8375
Epoch 6/25, Train MSE: 0.7224, Val MSE: 0.8309
Epoch 7/25, Train MSE: 0.7100, Val MSE: 0.8253
Epoch 8/25, Train MSE: 0.6992, Val MSE: 0.8221
Epoch 9/25, Train MSE: 0.6895, Val MSE: 0.8185
Epoch 10/25, Train MSE: 0.6805, Val MSE: 0.8167
Epoch 11/25, Train MSE: 0.6723, Val MSE: 0.8151
Epoch 12/25, Train MSE: 0.6649, Val MSE: 0.8133
Epoch 13/25, Train MSE: 0.6577, Val MSE: 0.8144
Epoch 14/25, Train MSE: 0.6512, Val MSE: 0.8108
Epoch 15/25, Train MSE: 0.6447, Val MSE: 0.8106
Epoch 16/25, Train MSE: 0.6388, Val MSE: 0.8095
Epoch 17/25, Train MSE: 0.6331, Val MSE: 0.8102
Epoch 18/25, Train MSE: 0.6277, Val MSE: 0.8097
Epoch 19/25, Train MSE: 0.6225, Val MSE: 0.8075
Epoch 20/25, Train MSE: 0.6174, Val MSE: 0.8092
E

[I 2024-12-16 19:38:32,218] Trial 1 finished with value: 0.8085951979087731 and parameters: {'latent_dim': 9, 'learning_rate': 0.004075751165456095, 'regularization': 0.0007302273953221217}. Best is trial 1 with value: 0.8085951979087731.


Epoch 24/25, Train MSE: 0.5985, Val MSE: 0.8086
Early stopping triggered.
Epoch 1/25, Train MSE: 0.7022, Val MSE: 0.8185
Epoch 2/25, Train MSE: 0.6396, Val MSE: 0.8181
Epoch 3/25, Train MSE: 0.5987, Val MSE: 0.8188
Epoch 4/25, Train MSE: 0.5676, Val MSE: 0.8118
Epoch 5/25, Train MSE: 0.5211, Val MSE: 0.8149
Epoch 6/25, Train MSE: 0.4707, Val MSE: 0.8245
Epoch 7/25, Train MSE: 0.4166, Val MSE: 0.8213
Epoch 8/25, Train MSE: 0.3726, Val MSE: 0.8334


[I 2024-12-16 19:39:56,095] Trial 2 finished with value: 0.8356080977659864 and parameters: {'latent_dim': 6, 'learning_rate': 0.03560349756849463, 'regularization': 0.0002543069369501906}. Best is trial 1 with value: 0.8085951979087731.


Epoch 9/25, Train MSE: 0.3394, Val MSE: 0.8356
Early stopping triggered.
Epoch 1/25, Train MSE: 0.8072, Val MSE: 0.8670
Epoch 2/25, Train MSE: 0.7552, Val MSE: 0.8469
Epoch 3/25, Train MSE: 0.7235, Val MSE: 0.8339
Epoch 4/25, Train MSE: 0.6998, Val MSE: 0.8204
Epoch 5/25, Train MSE: 0.6812, Val MSE: 0.8167
Epoch 6/25, Train MSE: 0.6654, Val MSE: 0.8115
Epoch 7/25, Train MSE: 0.6514, Val MSE: 0.8143
Epoch 8/25, Train MSE: 0.6384, Val MSE: 0.8087
Epoch 9/25, Train MSE: 0.6269, Val MSE: 0.8099
Epoch 10/25, Train MSE: 0.6161, Val MSE: 0.8118
Epoch 11/25, Train MSE: 0.6055, Val MSE: 0.8084
Epoch 12/25, Train MSE: 0.5946, Val MSE: 0.8092
Epoch 13/25, Train MSE: 0.5839, Val MSE: 0.8076
Epoch 14/25, Train MSE: 0.5721, Val MSE: 0.8097
Epoch 15/25, Train MSE: 0.5587, Val MSE: 0.8068
Epoch 16/25, Train MSE: 0.5446, Val MSE: 0.8057
Epoch 17/25, Train MSE: 0.5296, Val MSE: 0.8071
Epoch 18/25, Train MSE: 0.5137, Val MSE: 0.8038
Epoch 19/25, Train MSE: 0.4975, Val MSE: 0.8027
Epoch 20/25, Train MSE: 

[I 2024-12-16 19:43:53,318] Trial 3 finished with value: 0.8027304317478758 and parameters: {'latent_dim': 28, 'learning_rate': 0.008089574860460297, 'regularization': 1.6073221934120177e-05}. Best is trial 3 with value: 0.8027304317478758.


Epoch 25/25, Train MSE: 0.3897, Val MSE: 0.8027
Epoch 1/25, Train MSE: 0.8905, Val MSE: 0.9322
Epoch 2/25, Train MSE: 0.8380, Val MSE: 0.8942
Epoch 3/25, Train MSE: 0.8081, Val MSE: 0.8747
Epoch 4/25, Train MSE: 0.7874, Val MSE: 0.8624
Epoch 5/25, Train MSE: 0.7713, Val MSE: 0.8521
Epoch 6/25, Train MSE: 0.7583, Val MSE: 0.8454
Epoch 7/25, Train MSE: 0.7473, Val MSE: 0.8396
Epoch 8/25, Train MSE: 0.7379, Val MSE: 0.8360
Epoch 9/25, Train MSE: 0.7297, Val MSE: 0.8313
Epoch 10/25, Train MSE: 0.7225, Val MSE: 0.8285
Epoch 11/25, Train MSE: 0.7159, Val MSE: 0.8252
Epoch 12/25, Train MSE: 0.7098, Val MSE: 0.8237
Epoch 13/25, Train MSE: 0.7046, Val MSE: 0.8214
Epoch 14/25, Train MSE: 0.6994, Val MSE: 0.8198
Epoch 15/25, Train MSE: 0.6947, Val MSE: 0.8173
Epoch 16/25, Train MSE: 0.6903, Val MSE: 0.8153
Epoch 17/25, Train MSE: 0.6863, Val MSE: 0.8159
Epoch 18/25, Train MSE: 0.6823, Val MSE: 0.8127
Epoch 19/25, Train MSE: 0.6788, Val MSE: 0.8121
Epoch 20/25, Train MSE: 0.6754, Val MSE: 0.8125
E

[I 2024-12-16 19:47:50,089] Trial 4 finished with value: 0.8058520326851342 and parameters: {'latent_dim': 48, 'learning_rate': 0.0028197316916477935, 'regularization': 0.089236012824263}. Best is trial 3 with value: 0.8027304317478758.


Epoch 25/25, Train MSE: 0.6608, Val MSE: 0.8059
Epoch 1/25, Train MSE: 1.0311, Val MSE: 1.0507
Epoch 2/25, Train MSE: 1.0011, Val MSE: 1.0219
Epoch 3/25, Train MSE: 0.9785, Val MSE: 1.0013
Epoch 4/25, Train MSE: 0.9604, Val MSE: 0.9856
Epoch 5/25, Train MSE: 0.9453, Val MSE: 0.9730
Epoch 6/25, Train MSE: 0.9324, Val MSE: 0.9624
Epoch 7/25, Train MSE: 0.9211, Val MSE: 0.9535
Epoch 8/25, Train MSE: 0.9111, Val MSE: 0.9458
Epoch 9/25, Train MSE: 0.9022, Val MSE: 0.9390
Epoch 10/25, Train MSE: 0.8941, Val MSE: 0.9329
Epoch 11/25, Train MSE: 0.8868, Val MSE: 0.9274
Epoch 12/25, Train MSE: 0.8800, Val MSE: 0.9224
Epoch 13/25, Train MSE: 0.8738, Val MSE: 0.9179
Epoch 14/25, Train MSE: 0.8681, Val MSE: 0.9136
Epoch 15/25, Train MSE: 0.8627, Val MSE: 0.9097
Epoch 16/25, Train MSE: 0.8577, Val MSE: 0.9062
Epoch 17/25, Train MSE: 0.8530, Val MSE: 0.9028
Epoch 18/25, Train MSE: 0.8485, Val MSE: 0.8997
Epoch 19/25, Train MSE: 0.8444, Val MSE: 0.8968
Epoch 20/25, Train MSE: 0.8404, Val MSE: 0.8940
E

[I 2024-12-16 19:51:46,576] Trial 5 finished with value: 0.882467561898411 and parameters: {'latent_dim': 44, 'learning_rate': 0.0002604460905842133, 'regularization': 0.00019856119275792037}. Best is trial 3 with value: 0.8027304317478758.


Epoch 25/25, Train MSE: 0.8233, Val MSE: 0.8825
Epoch 1/25, Train MSE: 0.8050, Val MSE: 0.8709
Epoch 2/25, Train MSE: 0.7518, Val MSE: 0.8438
Epoch 3/25, Train MSE: 0.7205, Val MSE: 0.8278
Epoch 4/25, Train MSE: 0.6966, Val MSE: 0.8189
Epoch 5/25, Train MSE: 0.6777, Val MSE: 0.8175
Epoch 6/25, Train MSE: 0.6616, Val MSE: 0.8176
Epoch 7/25, Train MSE: 0.6473, Val MSE: 0.8080
Epoch 8/25, Train MSE: 0.6351, Val MSE: 0.8116
Epoch 9/25, Train MSE: 0.6236, Val MSE: 0.8096
Epoch 10/25, Train MSE: 0.6119, Val MSE: 0.8111
Epoch 11/25, Train MSE: 0.6009, Val MSE: 0.8092
Epoch 12/25, Train MSE: 0.5897, Val MSE: 0.8072
Epoch 13/25, Train MSE: 0.5780, Val MSE: 0.8108
Epoch 14/25, Train MSE: 0.5653, Val MSE: 0.8086
Epoch 15/25, Train MSE: 0.5511, Val MSE: 0.8017
Epoch 16/25, Train MSE: 0.5355, Val MSE: 0.8040
Epoch 17/25, Train MSE: 0.5185, Val MSE: 0.8040
Epoch 18/25, Train MSE: 0.5012, Val MSE: 0.8063
Epoch 19/25, Train MSE: 0.4833, Val MSE: 0.8049


[I 2024-12-16 19:54:54,645] Trial 6 finished with value: 0.803030232453534 and parameters: {'latent_dim': 29, 'learning_rate': 0.008425347072925658, 'regularization': 1.1149060035905202e-05}. Best is trial 3 with value: 0.8027304317478758.


Epoch 20/25, Train MSE: 0.4650, Val MSE: 0.8030
Early stopping triggered.
Epoch 1/25, Train MSE: 0.6598, Val MSE: 0.8629
Epoch 2/25, Train MSE: 0.5793, Val MSE: 0.8456
Epoch 3/25, Train MSE: 0.4497, Val MSE: 0.8324
Epoch 4/25, Train MSE: 0.2918, Val MSE: 0.8278
Epoch 5/25, Train MSE: 0.1848, Val MSE: 0.8317
Epoch 6/25, Train MSE: 0.1204, Val MSE: 0.8482
Epoch 7/25, Train MSE: 0.0851, Val MSE: 0.8548
Epoch 8/25, Train MSE: 0.0650, Val MSE: 0.8697


[I 2024-12-16 19:56:20,915] Trial 7 finished with value: 0.8792038208009065 and parameters: {'latent_dim': 38, 'learning_rate': 0.08184702253869122, 'regularization': 0.009077600049022558}. Best is trial 3 with value: 0.8027304317478758.


Epoch 9/25, Train MSE: 0.0512, Val MSE: 0.8792
Early stopping triggered.
Epoch 1/25, Train MSE: 0.9823, Val MSE: 1.0041
Epoch 2/25, Train MSE: 0.9370, Val MSE: 0.9655
Epoch 3/25, Train MSE: 0.9071, Val MSE: 0.9420
Epoch 4/25, Train MSE: 0.8851, Val MSE: 0.9258
Epoch 5/25, Train MSE: 0.8678, Val MSE: 0.9129
Epoch 6/25, Train MSE: 0.8536, Val MSE: 0.9026
Epoch 7/25, Train MSE: 0.8417, Val MSE: 0.8947
Epoch 8/25, Train MSE: 0.8314, Val MSE: 0.8877
Epoch 9/25, Train MSE: 0.8224, Val MSE: 0.8816
Epoch 10/25, Train MSE: 0.8144, Val MSE: 0.8766
Epoch 11/25, Train MSE: 0.8072, Val MSE: 0.8716
Epoch 12/25, Train MSE: 0.8007, Val MSE: 0.8678
Epoch 13/25, Train MSE: 0.7947, Val MSE: 0.8643
Epoch 14/25, Train MSE: 0.7891, Val MSE: 0.8609
Epoch 15/25, Train MSE: 0.7839, Val MSE: 0.8580
Epoch 16/25, Train MSE: 0.7791, Val MSE: 0.8551
Epoch 17/25, Train MSE: 0.7745, Val MSE: 0.8525
Epoch 18/25, Train MSE: 0.7702, Val MSE: 0.8502
Epoch 19/25, Train MSE: 0.7661, Val MSE: 0.8483
Epoch 20/25, Train MSE: 

[I 2024-12-16 20:00:17,911] Trial 8 finished with value: 0.8382263229985745 and parameters: {'latent_dim': 8, 'learning_rate': 0.0007346183927711892, 'regularization': 0.003853731670468113}. Best is trial 3 with value: 0.8027304317478758.


Epoch 25/25, Train MSE: 0.7452, Val MSE: 0.8382
Epoch 1/25, Train MSE: 0.9927, Val MSE: 1.0142
Epoch 2/25, Train MSE: 0.9500, Val MSE: 0.9769
Epoch 3/25, Train MSE: 0.9210, Val MSE: 0.9537
Epoch 4/25, Train MSE: 0.8993, Val MSE: 0.9370
Epoch 5/25, Train MSE: 0.8821, Val MSE: 0.9245
Epoch 6/25, Train MSE: 0.8679, Val MSE: 0.9139
Epoch 7/25, Train MSE: 0.8559, Val MSE: 0.9054
Epoch 8/25, Train MSE: 0.8456, Val MSE: 0.8982
Epoch 9/25, Train MSE: 0.8365, Val MSE: 0.8919
Epoch 10/25, Train MSE: 0.8284, Val MSE: 0.8862
Epoch 11/25, Train MSE: 0.8211, Val MSE: 0.8812
Epoch 12/25, Train MSE: 0.8145, Val MSE: 0.8770
Epoch 13/25, Train MSE: 0.8085, Val MSE: 0.8731
Epoch 14/25, Train MSE: 0.8029, Val MSE: 0.8698
Epoch 15/25, Train MSE: 0.7977, Val MSE: 0.8667
Epoch 16/25, Train MSE: 0.7928, Val MSE: 0.8638
Epoch 17/25, Train MSE: 0.7883, Val MSE: 0.8611
Epoch 18/25, Train MSE: 0.7840, Val MSE: 0.8585
Epoch 19/25, Train MSE: 0.7799, Val MSE: 0.8562
Epoch 20/25, Train MSE: 0.7760, Val MSE: 0.8542
E

[I 2024-12-16 20:04:15,116] Trial 9 finished with value: 0.845200437473686 and parameters: {'latent_dim': 11, 'learning_rate': 0.0006088071805980583, 'regularization': 1.3950149819846142e-05}. Best is trial 3 with value: 0.8027304317478758.


Epoch 25/25, Train MSE: 0.7591, Val MSE: 0.8452
Epoch 1/25, Train MSE: 0.7592, Val MSE: 0.8507
Epoch 2/25, Train MSE: 0.7039, Val MSE: 0.8286
Epoch 3/25, Train MSE: 0.6684, Val MSE: 0.8126
Epoch 4/25, Train MSE: 0.6436, Val MSE: 0.8103
Epoch 5/25, Train MSE: 0.6205, Val MSE: 0.8096
Epoch 6/25, Train MSE: 0.6012, Val MSE: 0.8068
Epoch 7/25, Train MSE: 0.5823, Val MSE: 0.8110
Epoch 8/25, Train MSE: 0.5598, Val MSE: 0.8063
Epoch 9/25, Train MSE: 0.5339, Val MSE: 0.8098
Epoch 10/25, Train MSE: 0.5046, Val MSE: 0.8068
Epoch 11/25, Train MSE: 0.4734, Val MSE: 0.8047
Epoch 12/25, Train MSE: 0.4399, Val MSE: 0.8069
Epoch 13/25, Train MSE: 0.4047, Val MSE: 0.8057
Epoch 14/25, Train MSE: 0.3698, Val MSE: 0.8036
Epoch 15/25, Train MSE: 0.3371, Val MSE: 0.8038
Epoch 16/25, Train MSE: 0.3068, Val MSE: 0.8067
Epoch 17/25, Train MSE: 0.2796, Val MSE: 0.8112
Epoch 18/25, Train MSE: 0.2555, Val MSE: 0.8140


[I 2024-12-16 20:07:13,876] Trial 10 finished with value: 0.8191905892597536 and parameters: {'latent_dim': 23, 'learning_rate': 0.015730564470881538, 'regularization': 7.782764776812118e-05}. Best is trial 3 with value: 0.8027304317478758.


Epoch 19/25, Train MSE: 0.2344, Val MSE: 0.8192
Early stopping triggered.
Epoch 1/25, Train MSE: 0.7935, Val MSE: 0.8593
Epoch 2/25, Train MSE: 0.7406, Val MSE: 0.8341
Epoch 3/25, Train MSE: 0.7078, Val MSE: 0.8269
Epoch 4/25, Train MSE: 0.6840, Val MSE: 0.8204
Epoch 5/25, Train MSE: 0.6648, Val MSE: 0.8198
Epoch 6/25, Train MSE: 0.6481, Val MSE: 0.8136
Epoch 7/25, Train MSE: 0.6335, Val MSE: 0.8105
Epoch 8/25, Train MSE: 0.6208, Val MSE: 0.8139
Epoch 9/25, Train MSE: 0.6077, Val MSE: 0.8078
Epoch 10/25, Train MSE: 0.5955, Val MSE: 0.8059
Epoch 11/25, Train MSE: 0.5831, Val MSE: 0.8069
Epoch 12/25, Train MSE: 0.5696, Val MSE: 0.8092
Epoch 13/25, Train MSE: 0.5547, Val MSE: 0.8105
Epoch 14/25, Train MSE: 0.5367, Val MSE: 0.8082


[I 2024-12-16 20:09:32,591] Trial 11 finished with value: 0.810213931555411 and parameters: {'latent_dim': 28, 'learning_rate': 0.009786529863124538, 'regularization': 1.0338129895477671e-05}. Best is trial 3 with value: 0.8027304317478758.


Epoch 15/25, Train MSE: 0.5170, Val MSE: 0.8102
Early stopping triggered.
Epoch 1/25, Train MSE: 0.7952, Val MSE: 0.8658
Epoch 2/25, Train MSE: 0.7428, Val MSE: 0.8407
Epoch 3/25, Train MSE: 0.7106, Val MSE: 0.8281
Epoch 4/25, Train MSE: 0.6861, Val MSE: 0.8238
Epoch 5/25, Train MSE: 0.6669, Val MSE: 0.8161
Epoch 6/25, Train MSE: 0.6509, Val MSE: 0.8135
Epoch 7/25, Train MSE: 0.6358, Val MSE: 0.8119
Epoch 8/25, Train MSE: 0.6225, Val MSE: 0.8091
Epoch 9/25, Train MSE: 0.6094, Val MSE: 0.8063
Epoch 10/25, Train MSE: 0.5971, Val MSE: 0.8031
Epoch 11/25, Train MSE: 0.5841, Val MSE: 0.8091
Epoch 12/25, Train MSE: 0.5699, Val MSE: 0.8044
Epoch 13/25, Train MSE: 0.5531, Val MSE: 0.8094
Epoch 14/25, Train MSE: 0.5360, Val MSE: 0.8068
Epoch 15/25, Train MSE: 0.5169, Val MSE: 0.8030
Epoch 16/25, Train MSE: 0.4975, Val MSE: 0.8034
Epoch 17/25, Train MSE: 0.4771, Val MSE: 0.8049
Epoch 18/25, Train MSE: 0.4562, Val MSE: 0.8039
Epoch 19/25, Train MSE: 0.4344, Val MSE: 0.8045


[I 2024-12-16 20:12:39,773] Trial 12 finished with value: 0.8042096430050315 and parameters: {'latent_dim': 29, 'learning_rate': 0.009522140875606162, 'regularization': 5.105172242413599e-05}. Best is trial 3 with value: 0.8027304317478758.


Epoch 20/25, Train MSE: 0.4125, Val MSE: 0.8042
Early stopping triggered.
Epoch 1/25, Train MSE: 0.7250, Val MSE: 0.8400
Epoch 2/25, Train MSE: 0.6684, Val MSE: 0.8274
Epoch 3/25, Train MSE: 0.6288, Val MSE: 0.8196
Epoch 4/25, Train MSE: 0.5970, Val MSE: 0.8197
Epoch 5/25, Train MSE: 0.5650, Val MSE: 0.8192
Epoch 6/25, Train MSE: 0.5245, Val MSE: 0.8137
Epoch 7/25, Train MSE: 0.4772, Val MSE: 0.8022
Epoch 8/25, Train MSE: 0.4253, Val MSE: 0.8099
Epoch 9/25, Train MSE: 0.3722, Val MSE: 0.8035
Epoch 10/25, Train MSE: 0.3219, Val MSE: 0.8145
Epoch 11/25, Train MSE: 0.2792, Val MSE: 0.8196


[I 2024-12-16 20:14:31,856] Trial 13 finished with value: 0.8272396471859628 and parameters: {'latent_dim': 22, 'learning_rate': 0.02499567606991659, 'regularization': 9.865533504646962e-05}. Best is trial 3 with value: 0.8027304317478758.


Epoch 12/25, Train MSE: 0.2436, Val MSE: 0.8272
Early stopping triggered.
Epoch 1/25, Train MSE: 0.8467, Val MSE: 0.9004
Epoch 2/25, Train MSE: 0.7944, Val MSE: 0.8617
Epoch 3/25, Train MSE: 0.7635, Val MSE: 0.8473
Epoch 4/25, Train MSE: 0.7416, Val MSE: 0.8401
Epoch 5/25, Train MSE: 0.7239, Val MSE: 0.8305
Epoch 6/25, Train MSE: 0.7094, Val MSE: 0.8248
Epoch 7/25, Train MSE: 0.6966, Val MSE: 0.8216
Epoch 8/25, Train MSE: 0.6854, Val MSE: 0.8205
Epoch 9/25, Train MSE: 0.6754, Val MSE: 0.8156
Epoch 10/25, Train MSE: 0.6662, Val MSE: 0.8143
Epoch 11/25, Train MSE: 0.6578, Val MSE: 0.8127
Epoch 12/25, Train MSE: 0.6499, Val MSE: 0.8117
Epoch 13/25, Train MSE: 0.6424, Val MSE: 0.8121
Epoch 14/25, Train MSE: 0.6352, Val MSE: 0.8063
Epoch 15/25, Train MSE: 0.6285, Val MSE: 0.8088
Epoch 16/25, Train MSE: 0.6219, Val MSE: 0.8107
Epoch 17/25, Train MSE: 0.6156, Val MSE: 0.8095
Epoch 18/25, Train MSE: 0.6094, Val MSE: 0.8092


[I 2024-12-16 20:17:30,396] Trial 14 finished with value: 0.8064851650754148 and parameters: {'latent_dim': 35, 'learning_rate': 0.004785642879822109, 'regularization': 0.0009269262848247135}. Best is trial 3 with value: 0.8027304317478758.


Epoch 19/25, Train MSE: 0.6031, Val MSE: 0.8065
Early stopping triggered.
Epoch 1/25, Train MSE: 1.0534, Val MSE: 1.0733
Epoch 2/25, Train MSE: 1.0354, Val MSE: 1.0553
Epoch 3/25, Train MSE: 1.0202, Val MSE: 1.0404
Epoch 4/25, Train MSE: 1.0071, Val MSE: 1.0279
Epoch 5/25, Train MSE: 0.9957, Val MSE: 1.0172
Epoch 6/25, Train MSE: 0.9855, Val MSE: 1.0078
Epoch 7/25, Train MSE: 0.9763, Val MSE: 0.9996
Epoch 8/25, Train MSE: 0.9679, Val MSE: 0.9923
Epoch 9/25, Train MSE: 0.9603, Val MSE: 0.9857
Epoch 10/25, Train MSE: 0.9532, Val MSE: 0.9798
Epoch 11/25, Train MSE: 0.9467, Val MSE: 0.9744
Epoch 12/25, Train MSE: 0.9406, Val MSE: 0.9694
Epoch 13/25, Train MSE: 0.9349, Val MSE: 0.9647
Epoch 14/25, Train MSE: 0.9295, Val MSE: 0.9605
Epoch 15/25, Train MSE: 0.9245, Val MSE: 0.9565
Epoch 16/25, Train MSE: 0.9197, Val MSE: 0.9527
Epoch 17/25, Train MSE: 0.9152, Val MSE: 0.9492
Epoch 18/25, Train MSE: 0.9109, Val MSE: 0.9459
Epoch 19/25, Train MSE: 0.9068, Val MSE: 0.9427
Epoch 20/25, Train MSE:

[I 2024-12-16 20:21:25,210] Trial 15 finished with value: 0.9268850476793126 and parameters: {'latent_dim': 19, 'learning_rate': 0.0001165723794724678, 'regularization': 3.494141225407038e-05}. Best is trial 3 with value: 0.8027304317478758.


Epoch 25/25, Train MSE: 0.8857, Val MSE: 0.9269
Epoch 1/25, Train MSE: 0.6692, Val MSE: 0.8365
Epoch 2/25, Train MSE: 0.5684, Val MSE: 0.8393
Epoch 3/25, Train MSE: 0.4337, Val MSE: 0.8296
Epoch 4/25, Train MSE: 0.2878, Val MSE: 0.8198
Epoch 5/25, Train MSE: 0.1838, Val MSE: 0.8317
Epoch 6/25, Train MSE: 0.1232, Val MSE: 0.8523
Epoch 7/25, Train MSE: 0.0853, Val MSE: 0.8717
Epoch 8/25, Train MSE: 0.0622, Val MSE: 0.8890


[I 2024-12-16 20:22:49,521] Trial 16 finished with value: 0.9109075853853101 and parameters: {'latent_dim': 34, 'learning_rate': 0.07118383144680686, 'regularization': 1.0518529155194852e-05}. Best is trial 3 with value: 0.8027304317478758.


Epoch 9/25, Train MSE: 0.0464, Val MSE: 0.9109
Early stopping triggered.
Epoch 1/25, Train MSE: 0.8122, Val MSE: 0.8771
Epoch 2/25, Train MSE: 0.7599, Val MSE: 0.8476
Epoch 3/25, Train MSE: 0.7286, Val MSE: 0.8303
Epoch 4/25, Train MSE: 0.7054, Val MSE: 0.8230
Epoch 5/25, Train MSE: 0.6869, Val MSE: 0.8220
Epoch 6/25, Train MSE: 0.6711, Val MSE: 0.8177
Epoch 7/25, Train MSE: 0.6580, Val MSE: 0.8138
Epoch 8/25, Train MSE: 0.6457, Val MSE: 0.8136
Epoch 9/25, Train MSE: 0.6348, Val MSE: 0.8096
Epoch 10/25, Train MSE: 0.6246, Val MSE: 0.8103
Epoch 11/25, Train MSE: 0.6150, Val MSE: 0.8111
Epoch 12/25, Train MSE: 0.6055, Val MSE: 0.8092
Epoch 13/25, Train MSE: 0.5966, Val MSE: 0.8117
Epoch 14/25, Train MSE: 0.5871, Val MSE: 0.8082
Epoch 15/25, Train MSE: 0.5776, Val MSE: 0.8095
Epoch 16/25, Train MSE: 0.5674, Val MSE: 0.8049
Epoch 17/25, Train MSE: 0.5563, Val MSE: 0.8084
Epoch 18/25, Train MSE: 0.5443, Val MSE: 0.8054
Epoch 19/25, Train MSE: 0.5315, Val MSE: 0.8092
Epoch 20/25, Train MSE: 

[I 2024-12-16 20:26:06,047] Trial 17 finished with value: 0.8061654499922987 and parameters: {'latent_dim': 17, 'learning_rate': 0.007550015391444936, 'regularization': 0.00028463727782866786}. Best is trial 3 with value: 0.8027304317478758.


Epoch 21/25, Train MSE: 0.5043, Val MSE: 0.8062
Early stopping triggered.
Epoch 1/25, Train MSE: 0.9071, Val MSE: 0.9432
Epoch 2/25, Train MSE: 0.8538, Val MSE: 0.9050
Epoch 3/25, Train MSE: 0.8225, Val MSE: 0.8831
Epoch 4/25, Train MSE: 0.8008, Val MSE: 0.8675
Epoch 5/25, Train MSE: 0.7841, Val MSE: 0.8587
Epoch 6/25, Train MSE: 0.7702, Val MSE: 0.8508
Epoch 7/25, Train MSE: 0.7587, Val MSE: 0.8444
Epoch 8/25, Train MSE: 0.7484, Val MSE: 0.8408
Epoch 9/25, Train MSE: 0.7392, Val MSE: 0.8357
Epoch 10/25, Train MSE: 0.7310, Val MSE: 0.8346
Epoch 11/25, Train MSE: 0.7235, Val MSE: 0.8295
Epoch 12/25, Train MSE: 0.7166, Val MSE: 0.8270
Epoch 13/25, Train MSE: 0.7102, Val MSE: 0.8256
Epoch 14/25, Train MSE: 0.7042, Val MSE: 0.8244
Epoch 15/25, Train MSE: 0.6986, Val MSE: 0.8216
Epoch 16/25, Train MSE: 0.6932, Val MSE: 0.8196
Epoch 17/25, Train MSE: 0.6881, Val MSE: 0.8189
Epoch 18/25, Train MSE: 0.6834, Val MSE: 0.8180
Epoch 19/25, Train MSE: 0.6787, Val MSE: 0.8163
Epoch 20/25, Train MSE:

[I 2024-12-16 20:29:59,082] Trial 18 finished with value: 0.8103232219348379 and parameters: {'latent_dim': 40, 'learning_rate': 0.0021940670010980987, 'regularization': 0.0034063238651260964}. Best is trial 3 with value: 0.8027304317478758.


Epoch 25/25, Train MSE: 0.6548, Val MSE: 0.8103
Epoch 1/25, Train MSE: 0.7122, Val MSE: 0.8382
Epoch 2/25, Train MSE: 0.6555, Val MSE: 0.8118
Epoch 3/25, Train MSE: 0.6210, Val MSE: 0.8171
Epoch 4/25, Train MSE: 0.5936, Val MSE: 0.8099
Epoch 5/25, Train MSE: 0.5657, Val MSE: 0.8045
Epoch 6/25, Train MSE: 0.5311, Val MSE: 0.7996
Epoch 7/25, Train MSE: 0.4858, Val MSE: 0.7948
Epoch 8/25, Train MSE: 0.4298, Val MSE: 0.7972
Epoch 9/25, Train MSE: 0.3732, Val MSE: 0.7945
Epoch 10/25, Train MSE: 0.3197, Val MSE: 0.7995
Epoch 11/25, Train MSE: 0.2721, Val MSE: 0.8001
Epoch 12/25, Train MSE: 0.2336, Val MSE: 0.8000
Epoch 13/25, Train MSE: 0.2023, Val MSE: 0.8050


[I 2024-12-16 20:32:12,024] Trial 19 finished with value: 0.8137382528465514 and parameters: {'latent_dim': 31, 'learning_rate': 0.032795916950287135, 'regularization': 0.01987612327608112}. Best is trial 3 with value: 0.8027304317478758.


Epoch 14/25, Train MSE: 0.1759, Val MSE: 0.8137
Early stopping triggered.
Epoch 1/25, Train MSE: 0.9421, Val MSE: 0.9698
Epoch 2/25, Train MSE: 0.8906, Val MSE: 0.9305
Epoch 3/25, Train MSE: 0.8592, Val MSE: 0.9074
Epoch 4/25, Train MSE: 0.8369, Val MSE: 0.8924
Epoch 5/25, Train MSE: 0.8199, Val MSE: 0.8800
Epoch 6/25, Train MSE: 0.8060, Val MSE: 0.8707
Epoch 7/25, Train MSE: 0.7944, Val MSE: 0.8645
Epoch 8/25, Train MSE: 0.7843, Val MSE: 0.8581
Epoch 9/25, Train MSE: 0.7754, Val MSE: 0.8531
Epoch 10/25, Train MSE: 0.7675, Val MSE: 0.8486
Epoch 11/25, Train MSE: 0.7602, Val MSE: 0.8458
Epoch 12/25, Train MSE: 0.7536, Val MSE: 0.8426
Epoch 13/25, Train MSE: 0.7474, Val MSE: 0.8399
Epoch 14/25, Train MSE: 0.7417, Val MSE: 0.8372
Epoch 15/25, Train MSE: 0.7363, Val MSE: 0.8348
Epoch 16/25, Train MSE: 0.7312, Val MSE: 0.8329
Epoch 17/25, Train MSE: 0.7265, Val MSE: 0.8307
Epoch 18/25, Train MSE: 0.7219, Val MSE: 0.8291
Epoch 19/25, Train MSE: 0.7176, Val MSE: 0.8280
Epoch 20/25, Train MSE:

[I 2024-12-16 20:36:06,713] Trial 20 finished with value: 0.8212480716888857 and parameters: {'latent_dim': 24, 'learning_rate': 0.0013626506961453655, 'regularization': 2.7741186232506962e-05}. Best is trial 3 with value: 0.8027304317478758.


Epoch 25/25, Train MSE: 0.6952, Val MSE: 0.8212
Epoch 1/25, Train MSE: 0.7909, Val MSE: 0.8601
Epoch 2/25, Train MSE: 0.7384, Val MSE: 0.8334
Epoch 3/25, Train MSE: 0.7062, Val MSE: 0.8303
Epoch 4/25, Train MSE: 0.6823, Val MSE: 0.8193
Epoch 5/25, Train MSE: 0.6625, Val MSE: 0.8181
Epoch 6/25, Train MSE: 0.6454, Val MSE: 0.8132
Epoch 7/25, Train MSE: 0.6308, Val MSE: 0.8071
Epoch 8/25, Train MSE: 0.6171, Val MSE: 0.8096
Epoch 9/25, Train MSE: 0.6043, Val MSE: 0.8070
Epoch 10/25, Train MSE: 0.5910, Val MSE: 0.8044
Epoch 11/25, Train MSE: 0.5769, Val MSE: 0.8067
Epoch 12/25, Train MSE: 0.5616, Val MSE: 0.8068
Epoch 13/25, Train MSE: 0.5440, Val MSE: 0.8074
Epoch 14/25, Train MSE: 0.5246, Val MSE: 0.8049


[I 2024-12-16 20:38:28,317] Trial 21 finished with value: 0.8048383961159468 and parameters: {'latent_dim': 29, 'learning_rate': 0.010051568086820725, 'regularization': 6.977777443370939e-05}. Best is trial 3 with value: 0.8027304317478758.


Epoch 15/25, Train MSE: 0.5038, Val MSE: 0.8048
Early stopping triggered.
Epoch 1/25, Train MSE: 0.7691, Val MSE: 0.8442
Epoch 2/25, Train MSE: 0.7143, Val MSE: 0.8312
Epoch 3/25, Train MSE: 0.6804, Val MSE: 0.8123
Epoch 4/25, Train MSE: 0.6545, Val MSE: 0.8139
Epoch 5/25, Train MSE: 0.6337, Val MSE: 0.8088
Epoch 6/25, Train MSE: 0.6148, Val MSE: 0.8110
Epoch 7/25, Train MSE: 0.5970, Val MSE: 0.8070
Epoch 8/25, Train MSE: 0.5783, Val MSE: 0.8108
Epoch 9/25, Train MSE: 0.5580, Val MSE: 0.8017
Epoch 10/25, Train MSE: 0.5329, Val MSE: 0.8075
Epoch 11/25, Train MSE: 0.5049, Val MSE: 0.8034
Epoch 12/25, Train MSE: 0.4752, Val MSE: 0.8085
Epoch 13/25, Train MSE: 0.4435, Val MSE: 0.8047


[I 2024-12-16 20:40:39,182] Trial 22 finished with value: 0.8030409887372694 and parameters: {'latent_dim': 32, 'learning_rate': 0.013760816418327613, 'regularization': 3.818983867405986e-05}. Best is trial 3 with value: 0.8027304317478758.


Epoch 14/25, Train MSE: 0.4101, Val MSE: 0.8030
Early stopping triggered.
Epoch 1/25, Train MSE: 0.7477, Val MSE: 0.8403
Epoch 2/25, Train MSE: 0.6936, Val MSE: 0.8209
Epoch 3/25, Train MSE: 0.6577, Val MSE: 0.8182
Epoch 4/25, Train MSE: 0.6291, Val MSE: 0.8125
Epoch 5/25, Train MSE: 0.6050, Val MSE: 0.8118
Epoch 6/25, Train MSE: 0.5809, Val MSE: 0.8165
Epoch 7/25, Train MSE: 0.5519, Val MSE: 0.8090
Epoch 8/25, Train MSE: 0.5168, Val MSE: 0.8088
Epoch 9/25, Train MSE: 0.4773, Val MSE: 0.8026
Epoch 10/25, Train MSE: 0.4348, Val MSE: 0.8103
Epoch 11/25, Train MSE: 0.3906, Val MSE: 0.8025
Epoch 12/25, Train MSE: 0.3495, Val MSE: 0.8064
Epoch 13/25, Train MSE: 0.3111, Val MSE: 0.8076
Epoch 14/25, Train MSE: 0.2771, Val MSE: 0.8103
Epoch 15/25, Train MSE: 0.2470, Val MSE: 0.8144


[I 2024-12-16 20:43:08,967] Trial 23 finished with value: 0.8194814121717171 and parameters: {'latent_dim': 33, 'learning_rate': 0.018167765926627312, 'regularization': 2.80527909155703e-05}. Best is trial 3 with value: 0.8027304317478758.


Epoch 16/25, Train MSE: 0.2204, Val MSE: 0.8195
Early stopping triggered.
Epoch 1/25, Train MSE: 0.8347, Val MSE: 0.8900
Epoch 2/25, Train MSE: 0.7819, Val MSE: 0.8586
Epoch 3/25, Train MSE: 0.7513, Val MSE: 0.8433
Epoch 4/25, Train MSE: 0.7288, Val MSE: 0.8321
Epoch 5/25, Train MSE: 0.7109, Val MSE: 0.8268
Epoch 6/25, Train MSE: 0.6959, Val MSE: 0.8175
Epoch 7/25, Train MSE: 0.6827, Val MSE: 0.8150
Epoch 8/25, Train MSE: 0.6712, Val MSE: 0.8160
Epoch 9/25, Train MSE: 0.6609, Val MSE: 0.8156
Epoch 10/25, Train MSE: 0.6511, Val MSE: 0.8115
Epoch 11/25, Train MSE: 0.6420, Val MSE: 0.8114
Epoch 12/25, Train MSE: 0.6336, Val MSE: 0.8098
Epoch 13/25, Train MSE: 0.6255, Val MSE: 0.8092
Epoch 14/25, Train MSE: 0.6176, Val MSE: 0.8086
Epoch 15/25, Train MSE: 0.6099, Val MSE: 0.8089
Epoch 16/25, Train MSE: 0.6022, Val MSE: 0.8075
Epoch 17/25, Train MSE: 0.5942, Val MSE: 0.8071
Epoch 18/25, Train MSE: 0.5857, Val MSE: 0.8069
Epoch 19/25, Train MSE: 0.5769, Val MSE: 0.8058
Epoch 20/25, Train MSE:

[I 2024-12-16 20:47:02,146] Trial 24 finished with value: 0.803979635359419 and parameters: {'latent_dim': 38, 'learning_rate': 0.005626641101542891, 'regularization': 0.00015512389276424682}. Best is trial 3 with value: 0.8027304317478758.


Epoch 25/25, Train MSE: 0.5112, Val MSE: 0.8040
Epoch 1/25, Train MSE: 0.7587, Val MSE: 0.8415
Epoch 2/25, Train MSE: 0.7028, Val MSE: 0.8262
Epoch 3/25, Train MSE: 0.6684, Val MSE: 0.8158
Epoch 4/25, Train MSE: 0.6418, Val MSE: 0.8139
Epoch 5/25, Train MSE: 0.6202, Val MSE: 0.8159
Epoch 6/25, Train MSE: 0.6016, Val MSE: 0.8130
Epoch 7/25, Train MSE: 0.5837, Val MSE: 0.8121
Epoch 8/25, Train MSE: 0.5658, Val MSE: 0.8143
Epoch 9/25, Train MSE: 0.5452, Val MSE: 0.8047
Epoch 10/25, Train MSE: 0.5184, Val MSE: 0.8096
Epoch 11/25, Train MSE: 0.4877, Val MSE: 0.8126
Epoch 12/25, Train MSE: 0.4549, Val MSE: 0.8114
Epoch 13/25, Train MSE: 0.4205, Val MSE: 0.8135


[I 2024-12-16 20:49:12,771] Trial 25 finished with value: 0.8077089298794584 and parameters: {'latent_dim': 17, 'learning_rate': 0.015957128119986147, 'regularization': 1.8509618583573706e-05}. Best is trial 3 with value: 0.8027304317478758.


Epoch 14/25, Train MSE: 0.3873, Val MSE: 0.8077
Early stopping triggered.
Epoch 1/25, Train MSE: 0.7094, Val MSE: 0.8338
Epoch 2/25, Train MSE: 0.6462, Val MSE: 0.8100
Epoch 3/25, Train MSE: 0.6072, Val MSE: 0.8174
Epoch 4/25, Train MSE: 0.5633, Val MSE: 0.8179
Epoch 5/25, Train MSE: 0.5087, Val MSE: 0.8115
Epoch 6/25, Train MSE: 0.4403, Val MSE: 0.8060
Epoch 7/25, Train MSE: 0.3691, Val MSE: 0.8148
Epoch 8/25, Train MSE: 0.3041, Val MSE: 0.8104
Epoch 9/25, Train MSE: 0.2519, Val MSE: 0.8174
Epoch 10/25, Train MSE: 0.2098, Val MSE: 0.8298


[I 2024-12-16 20:50:51,922] Trial 26 finished with value: 0.8420574493357069 and parameters: {'latent_dim': 25, 'learning_rate': 0.032420965854927784, 'regularization': 0.0005751861652980218}. Best is trial 3 with value: 0.8027304317478758.


Epoch 11/25, Train MSE: 0.1771, Val MSE: 0.8421
Early stopping triggered.
Epoch 1/25, Train MSE: 0.6822, Val MSE: 0.8393
Epoch 2/25, Train MSE: 0.6090, Val MSE: 0.8238
Epoch 3/25, Train MSE: 0.5361, Val MSE: 0.8238
Epoch 4/25, Train MSE: 0.4228, Val MSE: 0.8209
Epoch 5/25, Train MSE: 0.3065, Val MSE: 0.8216
Epoch 6/25, Train MSE: 0.2159, Val MSE: 0.8215
Epoch 7/25, Train MSE: 0.1508, Val MSE: 0.8321
Epoch 8/25, Train MSE: 0.1087, Val MSE: 0.8493


[I 2024-12-16 20:52:17,058] Trial 27 finished with value: 0.8630876903529715 and parameters: {'latent_dim': 43, 'learning_rate': 0.050248268974003016, 'regularization': 4.82734028509788e-05}. Best is trial 3 with value: 0.8027304317478758.


Epoch 9/25, Train MSE: 0.0781, Val MSE: 0.8631
Early stopping triggered.
Epoch 1/25, Train MSE: 0.8300, Val MSE: 0.8853
Epoch 2/25, Train MSE: 0.7770, Val MSE: 0.8526
Epoch 3/25, Train MSE: 0.7460, Val MSE: 0.8391
Epoch 4/25, Train MSE: 0.7237, Val MSE: 0.8255
Epoch 5/25, Train MSE: 0.7055, Val MSE: 0.8248
Epoch 6/25, Train MSE: 0.6902, Val MSE: 0.8165
Epoch 7/25, Train MSE: 0.6772, Val MSE: 0.8180
Epoch 8/25, Train MSE: 0.6654, Val MSE: 0.8142
Epoch 9/25, Train MSE: 0.6549, Val MSE: 0.8125
Epoch 10/25, Train MSE: 0.6452, Val MSE: 0.8107
Epoch 11/25, Train MSE: 0.6361, Val MSE: 0.8101
Epoch 12/25, Train MSE: 0.6277, Val MSE: 0.8092
Epoch 13/25, Train MSE: 0.6195, Val MSE: 0.8059
Epoch 14/25, Train MSE: 0.6117, Val MSE: 0.8103
Epoch 15/25, Train MSE: 0.6037, Val MSE: 0.8075
Epoch 16/25, Train MSE: 0.5960, Val MSE: 0.8109
Epoch 17/25, Train MSE: 0.5879, Val MSE: 0.8063


[I 2024-12-16 20:55:06,573] Trial 28 finished with value: 0.8086963823842489 and parameters: {'latent_dim': 26, 'learning_rate': 0.00602255252575581, 'regularization': 1.889410286726043e-05}. Best is trial 3 with value: 0.8027304317478758.


Epoch 18/25, Train MSE: 0.5796, Val MSE: 0.8087
Early stopping triggered.
Epoch 1/25, Train MSE: 0.9207, Val MSE: 0.9543
Epoch 2/25, Train MSE: 0.8676, Val MSE: 0.9142
Epoch 3/25, Train MSE: 0.8362, Val MSE: 0.8912
Epoch 4/25, Train MSE: 0.8142, Val MSE: 0.8762
Epoch 5/25, Train MSE: 0.7974, Val MSE: 0.8660
Epoch 6/25, Train MSE: 0.7836, Val MSE: 0.8583
Epoch 7/25, Train MSE: 0.7720, Val MSE: 0.8522
Epoch 8/25, Train MSE: 0.7619, Val MSE: 0.8469
Epoch 9/25, Train MSE: 0.7529, Val MSE: 0.8412
Epoch 10/25, Train MSE: 0.7448, Val MSE: 0.8389
Epoch 11/25, Train MSE: 0.7374, Val MSE: 0.8355
Epoch 12/25, Train MSE: 0.7306, Val MSE: 0.8335
Epoch 13/25, Train MSE: 0.7243, Val MSE: 0.8317
Epoch 14/25, Train MSE: 0.7184, Val MSE: 0.8288
Epoch 15/25, Train MSE: 0.7128, Val MSE: 0.8265
Epoch 16/25, Train MSE: 0.7076, Val MSE: 0.8249
Epoch 17/25, Train MSE: 0.7026, Val MSE: 0.8230
Epoch 18/25, Train MSE: 0.6979, Val MSE: 0.8213
Epoch 19/25, Train MSE: 0.6935, Val MSE: 0.8205
Epoch 20/25, Train MSE:

[I 2024-12-16 20:59:02,664] Trial 29 finished with value: 0.8144858486495758 and parameters: {'latent_dim': 20, 'learning_rate': 0.0018307852371425963, 'regularization': 0.00011136113953042562}. Best is trial 3 with value: 0.8027304317478758.


Epoch 25/25, Train MSE: 0.6700, Val MSE: 0.8145
Best hyperparameters:
{'latent_dim': 28, 'learning_rate': 0.008089574860460297, 'regularization': 1.6073221934120177e-05}
Best validation loss: 0.8027
Best trial: FrozenTrial(number=3, state=TrialState.COMPLETE, values=[0.8027304317478758], datetime_start=datetime.datetime(2024, 12, 16, 19, 39, 56, 96262), datetime_complete=datetime.datetime(2024, 12, 16, 19, 43, 53, 318659), params={'latent_dim': 28, 'learning_rate': 0.008089574860460297, 'regularization': 1.6073221934120177e-05}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'latent_dim': IntDistribution(high=50, log=False, low=5, step=1), 'learning_rate': FloatDistribution(high=0.1, log=True, low=0.0001, step=None), 'regularization': FloatDistribution(high=0.1, log=True, low=1e-05, step=None)}, trial_id=3, value=None)


# Step 4: Evaluate the model

## Model Performance Evaluation Concepts

When evaluating a recommendation system, it is crucial to choose appropriate metrics that reflect the quality of recommendations and their relevance to the end user. Two commonly used metrics are **Root Mean Square Error (RMSE)** and **Precision@K**. Below is an overview of these metrics and their relevance:

---

### **1. Root Mean Square Error (RMSE)**
- **Purpose:** RMSE is a regression-based metric that measures how well a model predicts numerical ratings.
- **Formula:** 
  $$
  RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}
  $$
  where:
  - $y_i$: True rating for item $i$.
  - $\hat{y}_i$: Predicted rating for item $i$.
  - $n$: Total number of ratings in the test set.
- **Significance:** RMSE quantifies the average prediction error. Lower RMSE values indicate that the model's predictions are closer to the actual ratings.

---

### **2. Precision@K**
- **Purpose:** Precision@K evaluates the quality of top-$K$ recommendations by measuring how many of the recommended items are relevant.
- **Formula:** 
  $$
  Precision@K = \frac{|Relevant \cap Recommended@K|}{K}
  $$
  where:
  - $Relevant$: Set of items relevant to the user.
  - $Recommended@K$: Top-$K$ items recommended to the user.
  - $K$: Number of items considered in the evaluation.
- **Significance:** Precision@K focuses on the relevance of the most prominent recommendations. High Precision@K values indicate that the top recommendations align well with user preferences.

---

### **Why Use Both Metrics?**
- **RMSE** evaluates the model's overall prediction accuracy, making it suitable for assessing rating prediction tasks.
- **Precision@K** focuses on ranking performance, which is critical for recommendation tasks where the goal is to present the most relevant items to users.

By combining these metrics, we can comprehensively evaluate the recommendation system's ability to predict ratings and prioritize relevant items in recommendations.


In [13]:
from sklearn.metrics import mean_squared_error

def calculate_rmse_with_content(Q, P, test_sparse, content_matrix, b_u, b_i, mu):
    """
    Calculate the RMSE for the test data using the trained Q and P matrices with content blending.

    Args:
    - Q (np.ndarray): Trained item matrix (num_items x latent_dim) with only collaborative features.
    - P (np.ndarray): Trained user matrix (num_users x (latent_dim + content_dim)).
    - test_sparse (csr_matrix): Sparse matrix of test data.
    - content_matrix (csr_matrix): Sparse matrix of content-based features (num_items x content_dim).
    - b_u (np.ndarray): User biases.
    - b_i (np.ndarray): Item biases.
    - mu (float): Global bias.

    Returns:
    - rmse (float): Root Mean Square Error for the predictions.
    """
    # Ensure correct dimensions for P and Q + content_matrix
    latent_dim = Q.shape[1]
    content_dim = content_matrix.shape[1]
    if P.shape[1] != latent_dim + content_dim:
        raise ValueError(f"Shape mismatch: P has {P.shape[1]} dimensions, but latent_dim + content_dim is {latent_dim + content_dim}.")

    # Convert test_sparse to COO format for efficient row-column iteration
    test_coo = test_sparse.tocoo()

    # Prepare lists for true ratings and predicted ratings
    true_ratings = []
    predicted_ratings = []

    # Iterate through each non-zero entry in the test sparse matrix
    for row, col, true_rating in zip(test_coo.row, test_coo.col, test_coo.data):
        # Extract the content-based features for the current item
        content_features = content_matrix.getrow(col).toarray().flatten()

        # Augment Q[col, :] with the content-based features
        augmented_Q_col = np.hstack([Q[col, :], content_features])

        # Predict the rating using biases and dot product
        predicted_rating = mu + b_u[row] + b_i[col] + np.dot(P[row, :], augmented_Q_col)
        
        # Store the results
        true_ratings.append(true_rating)
        predicted_ratings.append(predicted_rating)
    
    # Calculate RMSE
    rmse = np.sqrt(mean_squared_error(true_ratings, predicted_ratings))
    return rmse

In [20]:
# Load the trained matrices
P, Q, b_u, b_i, mu = load_matrices('saved_weights/P_matrix.npy', 'saved_weights/Q_matrix.npy', 
                                   'saved_weights/b_u.npy', 'saved_weights/b_i.npy', 'saved_weights/mu.npy')

# Example Usage
# Assuming Q, P, test_sparse, and content_matrix are defined with correct shapes
rmse = calculate_rmse_with_content(Q, P, test_sparse, content_matrix, b_u, b_i, mu)
print()
print(f"Test RMSE: {rmse:.4f}")

Loaded P matrix from saved_weights/P_matrix.npy with shape (610, 1703)
Loaded Q matrix from saved_weights/Q_matrix.npy with shape (9724, 28)
Loaded user biases from saved_weights/b_u.npy with shape (610,)
Loaded item biases from saved_weights/b_i.npy with shape (9724,)
Loaded global bias (mu) from saved_weights/mu.npy with value 3.5216

Test RMSE: 0.9171


This average error is quite **good**!

### Precision@K metric measurement

In [22]:
import numpy as np
from collections import defaultdict

def precision_at_k(Q, P, test_sparse, content_matrix, b_u, b_i, mu, K=10, threshold=4.0):
    """
    Calculate Precision@K for the test set.

    Args:
    - Q (np.ndarray): Trained item matrix (num_items x latent_dim).
    - P (np.ndarray): Trained user matrix (num_users x (latent_dim + content_dim)).
    - test_sparse (csr_matrix): Sparse matrix of test data.
    - content_matrix (csr_matrix): Sparse matrix of content-based features (num_items x content_dim).
    - b_u (np.ndarray): User biases.
    - b_i (np.ndarray): Item biases.
    - mu (float): Global bias.
    - K (int): Number of top items to consider for Precision@K.
    - threshold (float): Rating threshold to consider an item as relevant.

    Returns:
    - precision (float): Average Precision@K across all users.
    """
    # Initialize precision sum and user count
    precision_sum = 0
    user_count = 0

    # Convert test_sparse to COO for easy row iteration
    test_coo = test_sparse.tocoo()

    # Build user-item rating dictionary from the test set
    test_ratings = defaultdict(list)
    for row, col, rating in zip(test_coo.row, test_coo.col, test_coo.data):
        test_ratings[row].append((col, rating))

    # Iterate over each user in the test set
    for user_id, items_ratings in test_ratings.items():
        # Predict ratings for all items for this user
        predicted_ratings = []
        for item_id, _ in items_ratings:
            # Extract content-based features for the item
            content_features = content_matrix.getrow(item_id).toarray().flatten()
            augmented_Q = np.hstack([Q[item_id, :], content_features])

            # Predict rating
            prediction = mu + b_u[user_id] + b_i[item_id] + np.dot(P[user_id, :], augmented_Q)
            predicted_ratings.append((item_id, prediction))

        # Sort predictions by rating in descending order
        predicted_ratings = sorted(predicted_ratings, key=lambda x: x[1], reverse=True)

        # Select the top-K items
        top_k_items = [item_id for item_id, _ in predicted_ratings[:K]]

        # Determine relevance based on threshold
        relevant_items = [item_id for item_id, rating in items_ratings if rating >= threshold]

        # Calculate Precision@K for this user
        hits = len(set(top_k_items) & set(relevant_items))
        precision_sum += hits / K
        user_count += 1

    # Average Precision@K across all users
    precision = precision_sum / user_count if user_count > 0 else 0
    return precision

# Usage
precision_k = precision_at_k(Q, P, test_sparse, content_matrix, b_u, b_i, mu, K=10, threshold=3.5)
print(f"Precision@10: {precision_k:.4f}")


Precision@10: 0.5636


#### Domain Context
Interpret the results in the **context** of your application:

**E-commerce:** A Precision@10 of 0.56 means that about 5-6 out of 10 recommended products are relevant, which is generally considered good.
**Movie Recommendation:** If users see 5-6 relevant movies out of 10 recommendations, the performance is decent, especially in a large catalog.

#### Baseline Models
To determine how well your model performs, compare it against simpler baselines:

**Global Popularity Baseline**:
- Recommend the most popular items (e.g., movies with the highest average rating) to all users.
- Random Recommendations:
- Recommend random items to users and calculate Precision@K.

In [23]:
# Simple popularity-based Precision@K baseline
def precision_at_k_baseline(test_ratings, K=10, threshold=3.5):
    """
    Calculate Precision@K using a global popularity baseline.
    """
    # Rank items by their global average rating
    item_avg_ratings = test_data.groupby('movieId')['rating'].mean().sort_values(ascending=False)
    popular_items = item_avg_ratings.index[:K]  # Top-K most popular items

    precision_sum = 0
    user_count = 0

    for user, group in test_ratings.groupby('userId'):
        relevant_items = group[group['rating'] >= threshold]['movieId'].tolist()
        hits = len(set(popular_items) & set(relevant_items))
        precision_sum += hits / K
        user_count += 1

    return precision_sum / user_count if user_count > 0 else 0

baseline_precision = precision_at_k_baseline(test_data, K=10, threshold=3.5)
print(f"Global Popularity Baseline Precision@10: {baseline_precision:.4f}")


Global Popularity Baseline Precision@10: 0.0016


#### Dataset Sparsity
If your dataset is **sparse** (i.e., many items have few ratings), achieving a high Precision@K is difficult. In sparse datasets, a Precision@K of 0.5+ is often quite strong.

To check dataset sparsity:

In [24]:
sparsity = 1.0 - (len(train_data) / (len(user_mapping) * len(movie_mapping)))
print(f"Dataset Sparsity: {sparsity:.4%}")


Dataset Sparsity: 98.8146%


There you have it! **Sparsity** is more than **98%**. 

#### Recall at K
And for the **fun** of it, here is a recall metric evaluation:

In [25]:
# Recall@K
def recall_at_k(Q, P, test_sparse, content_matrix, b_u, b_i, mu, K=10, threshold=3.5):
    recall_sum = 0
    user_count = 0

    test_coo = test_sparse.tocoo()
    test_ratings = defaultdict(list)
    for row, col, rating in zip(test_coo.row, test_coo.col, test_coo.data):
        test_ratings[row].append((col, rating))

    for user_id, items_ratings in test_ratings.items():
        predicted_ratings = []
        for item_id, _ in items_ratings:
            content_features = content_matrix.getrow(item_id).toarray().flatten()
            augmented_Q = np.hstack([Q[item_id, :], content_features])
            prediction = mu + b_u[user_id] + b_i[item_id] + np.dot(P[user_id, :], augmented_Q)
            predicted_ratings.append((item_id, prediction))

        predicted_ratings = sorted(predicted_ratings, key=lambda x: x[1], reverse=True)[:K]
        top_k_items = [item_id for item_id, _ in predicted_ratings]
        relevant_items = [item_id for item_id, rating in items_ratings if rating >= threshold]

        hits = len(set(top_k_items) & set(relevant_items))
        recall_sum += hits / len(relevant_items) if relevant_items else 0
        user_count += 1

    return recall_sum / user_count if user_count > 0 else 0

recall_k = recall_at_k(Q, P, test_sparse, content_matrix, b_u, b_i, mu, K=10, threshold=3.5)
print(f"Recall@10: {recall_k:.4f}")


Recall@10: 0.6994


A Recall@10 of 0.6994 means that, on average, your model is able to retrieve 69.94% of all relevant items for a user when recommending the top-10 items.

Evaluating Your Recall@10
High Recall@K:

**Strength**: Your model retrieves most of the relevant items, meaning the user is likely to see the movies (or items) they care about.
If Recall@10 is close to 1.0, your model is performing exceptionally well in identifying relevant items.
Comparing Precision@K and Recall@K:

You reported Precision@10 = 0.56 and Recall@10 = 0.6994:
Precision@10 of 0.56 means 56% of the recommended top-10 items are relevant.
Recall@10 of 0.6994 means you are covering almost 70% of all relevant items for users.
The balance indicates that while your model retrieves many relevant items (high recall), some of the top-
𝐾
K recommendations are still not relevant, which lowers precision.
Dataset Sparsity Impact:

In sparse datasets (common in recommendation systems), achieving a Recall@10 near 0.7 is impressive because there are so many items that a user may not have interacted with.
High recall suggests the model effectively prioritizes relevant items despite sparse data.
Domain Context:

Movie Recommendations: A Recall@10 of 0.7 is strong. If a user sees 70% of all the movies they would enjoy in the top 10, they’re likely to be satisfied.
E-commerce: A similar result would indicate that most relevant products are included in the recommendations.
