# Week 2: Recommender Systems

### Table of Contents
1. [Making Recommendations](#Making-Recommendations)
2. [Using per-item features](#using-per-item-features)
3. [Collaborative filtering algorithm](#collaborative-filtering-algorithm)
4. [Binary labels: favs, likes and clicks](#binary-labels-favs-likes-and-clicks)

---

## Making Recommendations

This section introduces the topic of Recommender Systems, highlighting their significant commercial impact and setting up the basic framework and notation using the example of movie rating prediction.

### Commercial Importance

* **Widespread Use:** Recommender systems are used everywhere online (e.g., shopping sites like Amazon, streaming services like Netflix, food delivery apps).
* **High Value:** For many companies, a large fraction of sales and economic value is directly driven by the success of their recommender systems.
* **Academic vs. Commercial Attention:** The commercial impact of recommender systems is arguably vastly greater than the attention it receives in academia.

### Core Framework (Movie Rating Example)
The goal is to predict how users would rate movies they haven't yet watched (denoted by '?') to decide what to recommend.

| Item | Notation | Definition/Example |
| :--- | :--- | :--- |
| **Number of Users** | $n_u$ | In the example, $n_u = 4$ (Alice, Bob, Carol, Dave). |
| **Number of Items (Movies)** | $n_m$ | In the example, $n_m = 5$. |
| **Rating Indicator** | $r(i, j)$ | A binary value: $r(i, j) = 1$ if user $j$ has rated movie $i$; $0$ otherwise. |
| **Actual Rating** | $y^{(i, j)}$ | The rating (0 to 5 stars) given by user $j$ to movie $i$. (E.g., $y^{(3, 2)} = 4$). |

### Next Step
The subsequent lesson will begin developing an algorithm to predict the missing ratings. The first model will temporarily assume that features (extra information) about the movies (e.g., whether it is a romance movie or an action movie) are already available. Later in the notes, we will address how to build the system when these explicit movie features are not available.

---

## Using per-item features

This section details the first approach to building a recommender system: using **pre-existing item features** to create a personalized linear regression model for each user.

### Framework and Notation

We have pre-defined features ($X$) for each item (movie), such as $x_1$ (Romance level) and $x_2$ (Action level).
* $n_u$: Number of users (e.g., 4).
* $n_m$: Number of movies/items (e.g., 5).
* $n$: Number of features (e.g., 2).
* $r(i, j) = 1$: User $j$ has rated movie $i$.
* $y^{(i, j)}$: The actual rating given by user $j$ to movie $i$.

### The Model: Personalized Linear Regression

The system fits a separate linear regression model for each user $j$ to predict their rating for any movie $i$.

$$\text{Prediction for } y^{(i, j)} = \mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)}$$
* $\mathbf{w}^{(j)}$ and $b^{(j)}$ are the unique parameters (weights and bias) learned for user $j$.
* $\mathbf{x}^{(i)}$ is the feature vector for movie $i$.

### The Cost Function

The objective is to learn the parameters ($\mathbf{w}^{(j)}$ and $b^{(j)}$) for all users simultaneously by minimizing a regularized mean squared error cost function.

* **Cost Function for All Users ($J$):** The cost is the sum of the individual cost functions for every user.
    
    $$J(\mathbf{w}^{(1)}, b^{(1)}, \dots, \mathbf{w}^{(n_u)}, b^{(n_u)}) = \sum_{j=1}^{n_u} J(\mathbf{w}^{(j)}, b^{(j)})$$

* **Individual User Cost ($J(\mathbf{w}^{(j)}, b^{(j)})$):**
    
    $$J(\mathbf{w}^{(j)}, b^{(j)}) = \frac{1}{2} \sum_{i: r(i, j)=1} \left( (\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)}) - y^{(i, j)} \right)^2 + \frac{\lambda}{2} \sum_{k=1}^{n} (w_k^{(j)})^2$$

    * The sum $\sum_{i: r(i, j)=1}$ means we only calculate the error for movies that user $j$ has actually rated.
    * The second term is standard regularization to prevent overfitting. (Note: The normalization constant $1/m^{(j)}$ is omitted for convenience, as it doesn't change the parameters at the minimum; <u>for detailed information see Bonus below)</u>.

### Bonus: *Dropping the Normalization Constant in Recommender Systems*

The term related to the number of movies rated by user $j$, $m^{(j)}$, is often dropped from the denominator of the Collaborative Filtering cost function because it is a constant scaling factor that does not affect the model's ultimate performance.

#### 1. Scaling Does Not Change the Minimum

* The overall goal is to find the parameters ($\mathbf{w}^{(j)}$ and $b^{(j)}$) that minimize the cost function $J$.
* $1/2m^{(j)}$ is a constant scaling factor determined by the training data.
* Multiplying or dividing the entire cost function by a positive constant only scales it vertically; it does not change the location of the minimum point (the optimal parameters).
    $$ \text{arg min}_{\mathbf{w}, b} \left[ J_{\text{original}}(\mathbf{w}, b) \right] = \text{arg min}_{\mathbf{w}, b} \left[ \mathbf{C} \cdot J_{\text{simplified}}(\mathbf{w}, b) \right] \quad \text{where } \mathbf{C} = \frac{1}{2m^{(j)}} \text{ is the constant.}$$

#### 2. Simplifies Optimization

* In Gradient Descent, dropping the constant $\frac{1}{2m^{(j)}}$ only scales the magnitude of the gradient. This is compensated for by adjusting the learning rate ($\alpha$).
* For Collaborative Filtering, the overall cost function $J_{\text{overall}}$ is a sum of individual user costs $J(\mathbf{w}^{(j)}, b^{(j)})$. Using different division factors ($m^{(j)}$) for every user's loss and regularization terms unnecessarily complicates the algebra for joint optimization.
* Dropping the constant leads to a cleaner, unified cost function primarily focused on minimization.

#### Comparison to Linear Regression (MSE)

The normalization term ($1/m$) is typically retained in standard Linear Regression (Mean Squared Error, MSE) for statistical and practical reasons.

| Context | Purpose of $J$ | Why $1/m$ is Kept/Dropped |
| :--- | :--- | :--- |
| Linear Regression | Evaluation and Comparison (MSE) | Kept, because it defines the average squared error (MSE), making the cost value interpretable and comparable across datasets of different sizes. |
| Recommender System | Optimization | Dropped, because $m^{(j)}$ is a constant that doesn't change the optimal parameters and unnecessarily complicates the joint cost function. |

### Next Challenge

* The current method relies on having pre-defined features ($\mathbf{x}^{(i)}$) for every item.
* The next section will explore a modification of this algorithm—**Collaborative Filtering**—which works even when these detailed item features are not available beforehand.

---

## Collaborative filtering algorithm

This section introduces **Collaborative Filtering**, a powerful technique for recommender systems where the item features ($\mathbf{x}$) are learned from the user ratings rather than being provided in advance.

### 1. The Challenge: Learning Item Features ($\mathbf{x}$)

In the previous model, we assumed movie features ($\mathbf{x}$) were known (e.g., Romance, Action level). In the new new approach, when features are unknown, the ratings provided by multiple users on the same item can be leveraged to learn what those item features ($\mathbf{x}$) should be.

**Why it Works:** Having ratings from several users (each with known preference parameters $\mathbf{w}$ and $b$) allows the system to infer the features of an unfeatured movie that best explain those ratings. This relies on the "collaboration" of ratings from multiple users on the same item, which defines algorithm's name.

### 2. Cost Function for Learning Features ($\mathbf{x}$)

If the user preference parameters ($\mathbf{w}^{(j)}, b^{(j)}$) are temporarily fixed, the features for a single movie $i$ ($\mathbf{x}^{(i)}$) are learned by minimizing the cost function:

$$\min_{\mathbf{x}^{(i)}} J(\mathbf{x}^{(i)}) = \frac{1}{2} \sum_{j: r(i, j)=1} \left( (\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)}) - y^{(i, j)} \right)^2 + \frac{\lambda}{2} \sum_{k=1}^{n} (x_k^{(i)})^2$$

### 3. The Full Collaborative Filtering Cost Function

The final algorithm combines the objective of learning user preferences ($\mathbf{w}, b$) and learning item features ($\mathbf{x}$) into a single unified cost function ($J$):

* **Minimization:** The algorithm simultaneously minimizes $J$ with respect to all parameters: the user parameters ($\mathbf{w}^{(j)}, b^{(j)}$ for all users $j$) and the movie features ($\mathbf{x}^{(i)}$ for all movies $i$).
* **Unified Cost ($J$):** This combines the prediction error and the regularization terms for both users and movies.

$$J(\mathbf{w}, \mathbf{b}, \mathbf{x}) = \frac{1}{2} \sum_{(i, j): r(i, j)=1} \left( (\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)}) - y^{(i, j)} \right)^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^{n} (w_k^{(j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n} (x_k^{(i)})^2$$

### 4. Optimization

Gradient Descent or other optimization algorithms are used to minimize the cost function $J$. In this full formulation, both the user preferences ($\mathbf{w}, \mathbf{b}$) and the item features ($\mathbf{x}$) are treated as parameters to be learned and are updated iteratively.

$$
\begin{aligned}
w_i^{(j)} &= w_i^{(j)} - \alpha \frac{\partial}{\partial w_i^{(j)}}J(w, b,x) \\
b^{(j)} &= b^{(j)} - \alpha \frac{\partial}{\partial b^{(j)}}J(w, b,x) \\
x_k^{(i)} &= x_k^{(i)} - \alpha \frac{\partial}{\partial x_k^{(i)}}J(w, b,x)
\end{aligned}
$$

### Next Step
The next section will address a generalization of this model to systems using binary labels (e.g., like/dislike) instead of continuous star ratings.

---

## Binary labels: favs, likes and clicks

This section explains how to adapt the collaborative filtering algorithm from predicting continuous ratings (like 1–5 stars) to predicting **binary labels** (like/dislike, purchase/not purchase), using a method analogous to moving from linear regression to logistic regression.

### Binary Label Context

Many recommender systems deal with binary labels (1 or 0) rather than star ratings.

* **Label Meanings (Engagement):**
    * **1 (Engaged):** User liked, purchased, favorited, clicked, or spent a minimum time (e.g., 30 seconds) on an item after exposure.
    * **0 (Not Engaged):** User did not like, did not purchase, or left quickly after being exposed to the item.
    * **? (Question Mark):** The user was not yet exposed to the item (no rating/engagement data).
* **Goal:** Predict the probability that a user will like or engage with a new item (the '?' items) to decide what to recommend.

### The Model: Logistic Regression Analogy

The model shifts from predicting a numerical rating to predicting a probability of engagement. The linear combination of user preferences ($\mathbf{w}^{(j)}$) and item features ($\mathbf{x}^{(i)}$) is passed through the logistic function ($g$) (also known as the sigmoid function).

$$\text{P}(y^{(i, j)}=1) = g(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)})$$

where $g(z) = \frac{1}{1 + e^{-z}}$.

### The Cost Function: Binary Cross-Entropy

Since the output is a probability and the labels are binary, the squared error cost function (used for ratings) is replaced with the Binary Cross-Entropy Loss (or log loss), which is standard for logistic regression.

* **Loss for a Single Example:**
    $$L(f(\mathbf{x}), y) = -y \log(f(\mathbf{x})) - (1-y) \log(1-f(\mathbf{x}))$$
* **Overall Binary Collaborative Filtering Cost ($J$):** The total cost function sums this binary cross-entropy loss over all user-item pairs where a rating/engagement exists ($r^{(i, j)}=1$), plus the regularization terms for all $\mathbf{w}$, $\mathbf{b}$, and $\mathbf{x}$.

$$J(\mathbf{w}, \mathbf{b}, \mathbf{x}) = \sum_{(i, j): r(i, j)=1} L(f(x^{(i)}), y^{(i,j)}) + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^{n} (w_k^{(j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n} (x_k^{(i)})^2$$

### Generalization

This generalization significantly opens up the set of applications that can be addressed by collaborative filtering, allowing the algorithm to work with implicit feedback (like clicks or viewing time) rather than requiring explicit user ratings.

---

## Mean Normalization

Mean normalization is a technique used in recommender systems to preprocess movie ratings, making the learning algorithm run more efficiently and, more importantly, providing better initial predictions for new users who haven't rated any items.

### Purpose of Mean Normalization

* **Algorithm Efficiency:** Normalization can help the optimization algorithm (like gradient descent) run faster, similar to feature normalization in linear regression.
* **Improved Predictions for New Users:** It prevents the algorithm from predicting a zero rating for all movies for a brand new user who has not yet provided any ratings.

### The Problem Without Normalization

* If a new user (like Eve) has rated no movies, the regularization term in the cost function will drive her preference parameters ($\mathbf{w}^{(5)}$ and $b^{(5)}$) to be $\mathbf{0}$.
* The predicted rating for any movie $i$ would be $\mathbf{w}^{(5)} \cdot \mathbf{x}^{(i)} + b^{(5)} = 0$, leading to the unhelpful prediction that the new user will rate all movies zero stars.

### Mean Normalization Process

* **Calculate Movie Means ($\mu$):** For each movie $i$, compute the average rating $\mu_i$ given by only the users who have rated that movie.
* **Normalize Ratings ($Y$):** Create a new rating matrix where the average rating $\mu_i$ is subtracted from every rating $y^{(i, j)}$ for movie $i$. This new matrix is used as the training target.
    * Example: A 5-star rating for a movie with an average of 2.5 becomes $5 - 2.5 = 2.5$.

<img src="images/mean_norm.png" width=700>

* **Impact on New Users:** With this normalization, the parameters for a new user like Eve will still be $\mathbf{w}^{(5)}=\mathbf{0}$ and $b^{(5)}=0$. However, these parameters now predict a **normalized rating of 0**.

### Making Predictions with Normalization

* To make a final, non-normalized rating prediction for user $j$ on movie $i$, the mean rating ($\mu_i$) must be added back:
$$\text{Prediction} = (\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)}) + \mu_i$$

### Benefits for New Users

* Because $\mathbf{w}^{(5)} \cdot \mathbf{x}^{(i)} + b^{(5)} = 0$ for a new user, the predicted rating simplifies to $\text{Prediction} = \mu_i$.
* This means the algorithm initially guesses that the new user will rate a movie equal to the **average rating** that other users gave that movie, which is a much more reasonable initial guess than zero stars.

### Row vs. Column Normalization

* **Row Normalization (Normalizing by Movie/Item):** This is the focus of the process described, which helps generate reasonable predictions for a **new user** who has few or no ratings.
* **Column Normalization (Normalizing by User):** This would help if there was a brand **new movie** with no ratings. However, normalizing by movie (row normalization) is considered more important in this application because new users are often served content immediately, whereas new movies are usually held back until they receive some initial ratings.

---

## TensorFlow implementation of collaborative filtering

This section explains how to implement the collaborative filtering algorithm using **TensorFlow's Automatic Differentiation (Auto Diff)** feature, which simplifies optimization by removing the need for manual calculus.

### TensorFlow for Non-Neural Networks

* TensorFlow is not limited to neural networks; it is a versatile tool for implementing various learning algorithms, including collaborative filtering.
* The primary advantage of using TensorFlow here is its Automatic Differentiation (Auto Diff) feature (sometimes incorrectly called Auto Grad).

### Automatic Differentiation (Auto Diff)

* **Goal:** To implement Gradient Descent (or other optimization algorithms such as Adam) without manually computing the partial derivatives of the cost function ($J$).
* **Mechanism (Gradient Tape):** TensorFlow uses a feature called the **Gradient Tape** (`tf.GradientTape`).
    1.  The user defines how to compute the cost function ($J$).
    2.  The Gradient Tape records the sequence of operations (the forward pass) used to calculate $J$.
    3.  TensorFlow then uses this recorded sequence to automatically compute the necessary derivatives (the backward pass).

### Implementing Gradient Descent (Conceptual Example)

The core steps in Python/TensorFlow syntax are:
1.  **Initialize Parameters:** Define parameters (like $w, b, x$) as TensorFlow variables (`tf.Variable`).
2.  **Use Gradient Tape:** Wrap the calculation of the cost function $J$ inside the `with tf.GradientTape as tape:` block.
3.  **Compute Derivatives:** Use `tape.gradient(J, [parameters])` to get the derivatives (gradients) of $J$ with respect to the specified parameters.
4.  **Update Parameters:** Use the gradients to update the parameters (e.g., $w = w - \alpha \cdot \frac{\partial J}{\partial w}$).

### Collaborative Filtering Implementation

* **Algorithm Choice:** Using Auto Diff allows the use of more powerful algorithms than simple Gradient Descent, such as the Adam optimization algorithm.
* **Cost Function:** The user must provide the code to compute the collaborative filtering cost function $J$ (which takes inputs like $\mathbf{x}, \mathbf{w}, \mathbf{b}$, normalized ratings $Y_{\text{norm}}$, and the regularization parameter $\lambda$).
* **Optimization Steps (Adam):**
    1.  Define the Adam optimizer (`keras.optimizers.Adam`).
    2.  Use the Gradient Tape to compute the cost $J$.
    3.  Compute the derivatives (`grads`).
    4.  Apply the gradients using the optimizer (`optimizer.apply_gradients`).

### Why Not Use Standard Keras (`model.compile`, `model.fit`)?

The collaborative filtering cost function does not neatly fit into the standard, pre-defined neural network layer types (like `Dense` layers) provided by Keras/TensorFlow. Therefore, the custom approach of defining the cost function and using Auto Diff is necessary, making it a very effective way to implement custom learning algorithms.

---

## Finding related items

The collaborative filtering algorithm, which learns item features from user ratings, can be used to identify related items based on the similarity of their learned feature vectors. The algorithm, however, suffers from limitations such as the cold start problem and an inability to easily integrate external side information.

The collaborative filtering algorithm automatically learns a feature vector $\mathbf{x}^{(i)}$ for every item $i$. Although these features (e.g., $x_1, x_2$) are often difficult for a human to interpret (they don't neatly correspond to genres like "action" or "romance"), they collectively capture the essence of the item.

To find items related to item $i$, the algorithm searches for other items $k$ whose feature vectors $\mathbf{x}^{(k)}$ are mathematically close to $\mathbf{x}^{(i)}$. The similarity (or dissimilarity) between two items' feature vectors is typically measured using the **squared distance** between them:

$$\text{Squared Distance} = \sum_{l=1}^{n} (x_l^{(k)} - x_l^{(i)})^2$$

By finding the items (movies, products, etc.) with the smallest squared distance, the system identifies and recommends the most similar items to the user.

### Limitations of Collaborative Filtering

The collaborative filtering algorithm has two primary weaknesses:

#### 1. The Cold Start Problem

The algorithm struggles when it lacks sufficient data for either a new item or a new user.
* **New Items:** If a new movie or product is added to the catalog and few users have rated it yet, the algorithm cannot accurately determine its features $\mathbf{x}^{(i)}$ or recommend it effectively.
* **New Users:** Similarly, if a new user has rated only a few items, the system cannot accurately determine their preference parameters ($\mathbf{w}^{(j)}, b^{(j)}$) to give personalized predictions (though mean normalization can help provide reasonable initial guesses based on overall average ratings).

#### 2. Inability to Use Side Information

Collaborative filtering does not provide a natural mechanism to incorporate external data (side information) that is known about items or users.
* **Item Side Information:** This includes known facts about a movie like its genre, cast, director, budget, or studio.
* **User Side Information:** This includes user demographics (age, gender, location), stated preferences, or even behavioral cues like their web browser or whether they are using a mobile or desktop device.

Integrating this rich, external side information is necessary to improve accuracy and address the cold start problem more robustly.

### Next
The next step in recommender system development is **Content-Based Filtering**, which is designed to specifically address these limitations by leveraging side information.

---


## Vectorized Formulation for Collaborative Filtering in TensorFlow

To efficiently implement the collaborative filtering cost function using matrix operations in TensorFlow, we organize all learning parameters and ratings into matrix form.

### Notation

| Notation | Description | Dimension |
| :--- | :--- | :--- |
| $n_m$ | Number of movies/items ||
| $n_u$ | Number of users | |
| $n$ | Number of learned features ||
| $y^{(i,j)}$ | Rating given by user $j$ on movie $i$ | Scalar |
| $r^{(i,j)}$ | Binary indicator: 1 if rated, 0 otherwise | Scalar |
| $\mathbf{x}^{(i)}$ | Feature vector for movie $i$ | $n \times 1$ |
| $\mathbf{w}^{(j)}$ | Parameter vector for user $j$ | $n \times 1$|
| $b^{(j)}$ | Bias parameter for user $j$ | Scalar |
| $\mathbf{X}$ | Matrix of all item feature vectors | $n_m \times n$ | Rows are $(\mathbf{x}^{(i)})^T$ |
| $\mathbf{W}$ | Matrix of all user parameter vectors | $n_u \times n$ | Rows are $(\mathbf{w}^{(j)})^T$ |
| $\mathbf{b}$ | Vector of all user bias parameters | $1 \times n_u$ | Rows are $b^{(j)}$ |
| $\mathbf{Y}$ | Matrix of user ratings (normalized) | $n_m \times n_u$ | Elements are $y^{(i,j)}$ |
| $\mathbf{R}$ | Binary indicator matrix | $n_m \times n_u$ | Elements are $r^{(i,j)}$ |

### Matrix Definitions
Here is how learning parameters would look like in matrix form:

$$
\mathbf{X} = 
\begin{bmatrix}
--- (\mathbf{x}^{(0)})^T --- \\
--- (\mathbf{x}^{(1)})^T --- \\
\vdots \\
--- (\mathbf{x}^{(n_m-1)})^T --- \\
\end{bmatrix} , \quad
\mathbf{W} = 
\begin{bmatrix}
--- (\mathbf{w}^{(0)})^T --- \\
--- (\mathbf{w}^{(1)})^T --- \\
\vdots \\
--- (\mathbf{w}^{(n_u-1)})^T --- \\
\end{bmatrix},\quad
\mathbf{ b} = 
\begin{bmatrix}
b^{(0)}  \\
b^{(1)} \\
\vdots \\
b^{(n_u-1)} \\
\end{bmatrix}^T
$$

### Vectorized Cost Function ($J$)

The vectorized collaborative filtering cost function includes the squared error between the predicted rating and the actual rating, along with regularization terms for the feature matrix $\mathbf{X}$ and the parameter matrix $\mathbf{W}$.

$$
J(\mathbf{X}, \mathbf{W}, \mathbf{b}) = \frac{1}{2} \sum_{(i,j): r^{(i,j)}=1} \left( (\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)}) - y^{(i, j)} \right)^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n} (x_k^{(i)})^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^{n} (w_k^{(j)})^2
$$

### Python Implementation with TensorFlow

#### 1\. Cost Function Definition

The prediction matrix is calculated using the dot product of the feature matrix $\mathbf{X}$ and the transpose of the parameter matrix $\mathbf{W}$ (plus the bias vector $\mathbf{b}$).

```python
import tensorflow as tf

def cofi_cost_func_v(X, W, b, Y, R, lambda_):
    """
    Returns the vectorized cost for Collaborative Filtering.
    Args:
      X (tf.Variable (num_movies,num_features)): Matrix of item features
      W (tf.Variable (num_users,num_features)) : Matrix of user parameters
      b (tf.Variable (1, num_users))            : Vector of user bias parameters
      Y (ndarray (num_movies,num_users)    : Matrix of user ratings
      R (ndarray (num_movies,num_users)    : Binary indicator matrix (1 if rated)
      lambda_ (float): Regularization parameter
    Returns:
      J (float) : Cost
    """
    # Prediction matrix (tf.linalg.matmul(X, tf.transpose(W)) + b) 
    # masked by R (tf.transpose(W) is used because W is (num_users, num_features))
    error_matrix = (tf.linalg.matmul(X, tf.transpose(W)) + b - Y) * R
    
    # Squared Error Term
    squared_error = 0.5 * tf.reduce_sum(error_matrix**2)
    
    # Regularization Term
    regularization = (lambda_ / 2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
    
    J = squared_error + regularization
    return J
```

#### 2\. Optimization using GradientTape (Auto Diff)

TensorFlow's `tf.GradientTape` is used to automatically compute the gradients (partial derivatives) of the cost function with respect to the trainable parameters ($\mathbf{X}$, $\mathbf{W}$, and $\mathbf{b}$).

```python
# Assuming X, W, and b are initialized as tf.Variables and optimizer is defined
iterations = 200
lambda_ = 1

for iter in range(iterations):
    # 1. Record the forward pass computation of the cost
    with tf.GradientTape() as tape:
        cost_value = cofi_cost_func_v(X, W, b, Ynorm, R, lambda_)

    # 2. Compute the gradients using Automatic Differentiation
    # Retrieves derivatives of cost_value with respect to the variables [X, W, b]
    grads = tape.gradient( cost_value, [X, W, b] )

    # 3. Apply the gradients using the defined optimizer (e.g., Adam)
    optimizer.apply_gradients( zip(grads, [X, W, b]) )

    # Log periodically
    if iter % 20 == 0:
        print(f"Training loss at iteration {iter}: {cost_value:0.1f}")
```