# Week 2: Recommender Systems
---
## Table of Contents
* [Making Recommendations](#Making-Recommendations)
* [Using per-item features](#Using-per-item-features)
* [Collaborative filtering algorithm](#Collaborative-filtering-algorithm)
* [Binary labels: favs, likes and clicks](#Binary-labels:-favs,-likes-and-clicks)

<br>

## Making Recommendations

This section introduces the topic of Recommender Systems, highlighting their significant commercial impact and setting up the basic framework and notation using the example of movie rating prediction.

---
### Commercial Importance
* **Widespread Use:** Recommender systems are used everywhere online (e.g., shopping sites like Amazon, streaming services like Netflix, food delivery apps).
* **High Value:** For many companies, a large fraction of sales and economic value is directly driven by the success of their recommender systems.
* **Academic vs. Commercial Attention:** The commercial impact of recommender systems is arguably vastly greater than the attention it receives in academia.

### Core Framework (Movie Rating Example)
The goal is to predict how users would rate movies they haven't yet watched (denoted by '?') to decide what to recommend.

| Item | Notation | Definition/Example |
| :--- | :--- | :--- |
| **Number of Users** | $n_u$ | In the example, $n_u = 4$ (Alice, Bob, Carol, Dave). |
| **Number of Items (Movies)** | $n_m$ | In the example, $n_m = 5$. |
| **Rating Indicator** | $r(i, j)$ | A binary value: $r(i, j) = 1$ if user $j$ has rated movie $i$; $0$ otherwise. |
| **Actual Rating** | $y^{(i, j)}$ | The rating (0 to 5 stars) given by user $j$ to movie $i$. (E.g., $y^{(3, 2)} = 4$). |

### Next Step
* The subsequent lesson will begin developing an algorithm to predict the missing ratings. The first model will temporarily assume that **features (extra information)** about the movies (e.g., whether it is a romance movie or an action movie) are already available. Later in the notes, we will address how to build the system when these explicit movie features are not available.

<br>

## Using per-item features

This section details the first approach to building a recommender system: **using pre-existing item features** to create a personalized linear regression model for each user.

---

### Key Bullet Points: Recommender Systems with Item Features

#### Framework and Notation
* **Initial Assumption:** We have pre-defined **features ($X$)** for each item (movie), such as $x_1$ (Romance level) and $x_2$ (Action level).
    * $n_u$: Number of users (e.g., 4).
    * $n_m$: Number of movies/items (e.g., 5).
    * $n$: Number of features (e.g., 2).
    * $r(i, j) = 1$: User $j$ has rated movie $i$.
    * $y^{(i, j)}$: The actual rating given by user $j$ to movie $i$.

#### The Model: Personalized Linear Regression
* The system fits a **separate linear regression model for each user $j$** to predict their rating for any movie $i$.
* **Prediction Formula:**
    $$\text{Prediction for } y^{(i, j)} = \mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)}$$
    * $\mathbf{w}^{(j)}$ and $b^{(j)}$ are the unique parameters (weights and bias) learned for **user $j$**.
    * $\mathbf{x}^{(i)}$ is the feature vector for **movie $i$**.

### The Cost Function
The objective is to learn the parameters ($\mathbf{w}^{(j)}$ and $b^{(j)}$) for **all users** simultaneously by minimizing a regularized mean squared error cost function.

* **Cost Function for All Users ($J$):** The cost is the sum of the individual cost functions for every user.
    $$J(\mathbf{w}^{(1)}, b^{(1)}, \dots, \mathbf{w}^{(n_u)}, b^{(n_u)}) = \sum_{j=1}^{n_u} J(\mathbf{w}^{(j)}, b^{(j)})$$

* **Individual User Cost ($J(\mathbf{w}^{(j)}, b^{(j)})$):**
    $$J(\mathbf{w}^{(j)}, b^{(j)}) = \frac{1}{2} \sum_{i: r(i, j)=1} \left( (\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)}) - y^{(i, j)} \right)^2 + \frac{\lambda}{2} \sum_{k=1}^{n} (w_k^{(j)})^2$$
    * The sum $\sum_{i: r(i, j)=1}$ means we only calculate the error for movies that **user $j$ has actually rated**.
    * The second term is standard **regularization** to prevent overfitting. (Note: The normalization constant $1/m^{(j)}$ is omitted for convenience, as it doesn't change the parameters at the minimum; see Bonus section below).

### Next Challenge
* The current method relies on having **pre-defined features ($\mathbf{x}^{(i)}$)** for every item.
* The next section will explore a modification of this algorithm—**Collaborative Filtering**—which works even when these detailed item features are **not available** beforehand.

## Bonus: Dropping $m^{(j)}$ and model performance
The reason that dropping the term related to the number of movies rated by user $j$, denoted here as $m^{(j)}$ (or the number of training examples), from the denominator of the cost function **does not affect the model performance** is rooted in how optimization works.

---

### The Role of the Denominator in Optimization

The individual cost function for user $j$ (before dropping $m^{(j)}$) typically looks like this:

$$J(\mathbf{w}^{(j)}, b^{(j)}) = \frac{1}{2m^{(j)}} \sum_{i: r(i, j)=1} \left( \text{Prediction} - \text{Actual Rating} \right)^2 + \frac{\lambda}{2m^{(j)}} \sum_{k=1}^{n} (w_k^{(j)})^2$$

### 1. It's Just a Scaling Constant

When optimizing the model, the goal is to find the set of parameters ($\mathbf{w}^{(j)}$ and $b^{(j)}$) that **minimize** the value of $J$.

* In this context, $\frac{1}{2m^{(j)}}$ is a **constant scaling factor**. It is a fixed number determined by the training data *before* the optimization process begins.
* Multiplying or dividing the entire cost function by a positive constant only **scales the cost function vertically**; it **does not change the location** of the minimum point.

### 2. The Minimum Remains the Same

Imagine a simple parabolic function, $f(x) = x^2$. The minimum occurs at $x=0$.
If you scale it by a constant $c=5$, the new function is $g(x) = 5x^2$. The minimum still occurs at **$x=0$**.

In the recommender system:

$$\text{arg min}_{\mathbf{w}, b} \left[ J_{\text{original}}(\mathbf{w}, b) \right] = \text{arg min}_{\mathbf{w}, b} \left[ \mathbf{C} \cdot J_{\text{simplified}}(\mathbf{w}, b) \right]$$

Where $\mathbf{C} = \frac{1}{2m^{(j)}}$ is the constant.

Since both the squared error term and the regularization term are multiplied by the same constant, the parameter values ($\mathbf{w}^{(j)}$ and $b^{(j)}$) that make the original function minimum are **the exact same values** that make the simplified function minimum.

### 3. Impact on Gradient Descent

This simplification is also beneficial when using Gradient Descent:

* The gradients (derivatives) of the cost function are used to determine the step size and direction.
* Dropping the constant $\frac{1}{2m^{(j)}}$ just **scales the magnitude of the gradient**. We compensate for this by simply adjusting the **learning rate ($\alpha$)** used in Gradient Descent. If the cost function is scaled up, we just use a smaller learning rate, and vice versa.

In summary, the normalization constant is often included in academic settings for statistical correctness (to compute an average loss), but for the purely practical goal of finding the optimal parameters, it can be safely dropped to simplify the mathematical expression.

### How is this different from cost function in linear or logistic regression?

While you *could* technically drop the division by $m$ (the number of training examples) in the denominator of the standard Linear Regression cost function, it is almost always kept.

Here is why $1/m$ (or $1/2m$) is typically included in **Linear Regression** but often dropped in **Recommender Systems** (like the individual user cost):

---

### Why the $1/m$ Term is Kept in Linear Regression

The standard cost function for linear regression (Mean Squared Error, MSE) is:

$$J(\mathbf{w}, b) = \frac{1}{2m} \sum_{i=1}^{m} \left( (\mathbf{w} \cdot \mathbf{x}^{(i)} + b) - y^{(i)} \right)^2$$

### 1. Statistical Meaning (Averaging)
The most important reason is to define the cost as the **average loss per example**.

* **Interpretability:** By dividing by $m$, the value of $J(\mathbf{w}, b)$ becomes the **average squared error**. This gives the cost function an intuitive meaning: if you add more data, the cost value doesn't automatically skyrocket; it represents the error regardless of the dataset size.
* **Comparison:** It allows you to **compare models and performance across different datasets** of varying sizes. A model trained on 100 examples with an MSE of 5 is directly comparable to a model trained on 1,000 examples with an MSE of 5.

### 2. Standard Practice and Consistency
The MSE is the standard, well-established error metric. Keeping the division by $m$ aligns with textbook definitions and ensures that the final reported loss value is the actual MSE, which is important for evaluation and publication.

---

###  Why the $1/m^{(j)}$ Term is Dropped in Recommender Systems

In the individual user cost for the recommender system, the term $\frac{1}{2m^{(j)}}$ is often simplified to just $\frac{1}{2}$ (or dropped entirely) for two main reasons:

### 1. Simplification for Joint Optimization
In the recommender system, the overall cost function $J_{\text{overall}}$ is the **sum of all individual user costs** $J(\mathbf{w}^{(j)}, b^{(j)})$:

$$J_{\text{overall}} = \sum_{j=1}^{n_u} \left( \frac{1}{2} \sum_{i: r(i, j)=1} (\dots)^2 + \frac{\lambda}{2} \sum_{k=1}^{n} (w_k^{(j)})^2 \right)$$

When calculating the gradient for the whole system, having different division factors ($m^{(j)}$) for every single user's loss term and every single user's regularization term makes the algebra unnecessarily complex. Dropping the $m^{(j)}$ allows for a cleaner, unified cost function for the entire system, where we are primarily concerned with **minimization**, not calculating a statistically precise average.

### 2. Relative Size of Regularization
In recommender systems, the relationship between the regularization term ($\lambda$) and the error term is crucial.

If you keep the $1/m^{(j)}$ in the error term, you must also decide whether to keep it in the regularization term. If the goal is simplification without affecting the minimum, removing the constant scaling factor from the entire loss calculation (including regularization) is the cleanest route.

---

In essence:

| Context | Purpose of $J$ | Why $1/m$ is Kept/Dropped |
| :--- | :--- | :--- |
| **Linear Regression** | **Evaluation and Comparison (MSE)** | Kept, because it defines the **average error (MSE)**, which is the standard, comparable metric. |
| **Recommender System** | **Optimization** | Dropped, because $m^{(j)}$ is a **constant** that unnecessarily complicates the overall cost function when summing across users, and it doesn't change the optimal parameter values. |

## Collaborative filtering algorithm

This section introduces **Collaborative Filtering**, a powerful technique for recommender systems where the item features ($\mathbf{x}$) are **learned from the user ratings** rather than being provided in advance.

---

### Key Bullet Points: Collaborative Filtering Algorithm

#### 1. The Challenge: Learning Item Features ($\mathbf{x}$)
* **Previous Model:** Assumed movie features ($\mathbf{x}$) were known (e.g., Romance, Action level).
* **New Approach:** When features are unknown, the ratings provided by **multiple users** on the same item can be leveraged to learn what those item features ($\mathbf{x}$) should be.
    * **Why it Works:** Having ratings from several users (each with known preference parameters $\mathbf{w}$ and $b$) allows the system to infer the features of an unfeatured movie that best explain those ratings.
    * **Collaborative Filtering Name:** This relies on the "collaboration" of ratings from multiple users on the same item.

#### 2. Cost Function for Learning Features ($\mathbf{x}$)
If the user preference parameters ($\mathbf{w}^{(j)}, b^{(j)}$) are temporarily fixed, the features for a single movie $i$ ($\mathbf{x}^{(i)}$) are learned by minimizing the cost function:

$$\min_{\mathbf{x}^{(i)}} J(\mathbf{x}^{(i)}) = \frac{1}{2} \sum_{j: r(i, j)=1} \left( (\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)}) - y^{(i, j)} \right)^2 + \frac{\lambda}{2} \sum_{k=1}^{n} (x_k^{(i)})^2$$

#### 3. The Full Collaborative Filtering Cost Function
The final algorithm combines the objective of learning user preferences ($\mathbf{w}, b$) and learning item features ($\mathbf{x}$) into a single unified cost function ($J$):

* **Minimization:** The algorithm simultaneously minimizes $J$ with respect to all parameters: the user parameters ($\mathbf{w}^{(j)}, b^{(j)}$ for all users $j$) and the movie features ($\mathbf{x}^{(i)}$ for all movies $i$).
* **Unified Cost ($J$):** This combines the prediction error and the regularization terms for both users and movies.

$$J(\mathbf{w}, \mathbf{b}, \mathbf{x}) = \frac{1}{2} \sum_{(i, j): r(i, j)=1} \left( (\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)}) - y^{(i, j)} \right)^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^{n} (w_k^{(j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n} (x_k^{(i)})^2$$

#### 4. Optimization
* **Method:** Gradient Descent or other optimization algorithms are used to minimize the cost function $J$.
* **Parameters:** In this full formulation, both the user preferences ($\mathbf{w}, \mathbf{b}$) and the item features ($\mathbf{x}$) are treated as parameters to be learned and are updated iteratively.

$$
\begin{aligned}
w_i^{(j)} &= w_i^{(j)} - \alpha \frac{\partial}{\partial w_i^{(j)}}J(w, b,x) \\
b^{(j)} &= b^{(j)} - \alpha \frac{\partial}{\partial b^{(j)}}J(w, b,x) \\
x_k^{(i)} &= x_k^{(i)} - \alpha \frac{\partial}{\partial x_k^{(i)}}J(w, b,x)
\end{aligned}
$$

### Next Step
The next discussion will address a generalization of this model to systems using **binary labels** (e.g., like/dislike) instead of continuous star ratings.

## Binary labels: favs, likes and clicks

This section explains how to adapt the collaborative filtering algorithm from predicting continuous ratings (like 1–5 stars) to predicting **binary labels** (like/dislike, purchase/not purchase), using a method analogous to moving from linear regression to logistic regression.

---

### Key Bullet Points: Collaborative Filtering with Binary Labels

#### 1. Binary Label Context
* **Problem:** Many recommender systems deal with binary labels (1 or 0) rather than star ratings.
* **Label Meanings (Engagement):**
    * **1 (Engaged):** User liked, purchased, favorited, clicked, or spent a minimum time (e.g., 30 seconds) on an item after exposure.
    * **0 (Not Engaged):** User did not like, did not purchase, or left quickly after being exposed to the item.
    * **? (Question Mark):** The user was not yet exposed to the item (no rating/engagement data).
* **Goal:** Predict the probability that a user will like or engage with a new item (the '?' items) to decide what to recommend.

#### 2. The Model: Logistic Regression Analogy
* **Prediction Shift:** The model shifts from predicting a numerical rating to predicting a probability of engagement.
* **Logistic Function:** The linear combination of user preferences ($\mathbf{w}^{(j)}$) and item features ($\mathbf{x}^{(i)}$) is passed through the logistic function ($g$) (also known as the sigmoid function).
    * **Probability Prediction:**
    $$\text{P}(y^{(i, j)}=1) = g(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)})$$
    * Where $g(z) = \frac{1}{1 + e^{-z}}$.

#### 3. The Cost Function: Binary Cross-Entropy
* **Cost Function Modification:** Since the output is a probability and the labels are binary, the squared error cost function (used for ratings) is replaced with the Binary Cross-Entropy Loss (or log loss), which is standard for logistic regression.
* **Loss for a Single Example:**
    $$L(f(\mathbf{x}), y) = -y \log(f(\mathbf{x})) - (1-y) \log(1-f(\mathbf{x}))$$
* **Overall Binary Collaborative Filtering Cost ($J$):** The total cost function sums this binary cross-entropy loss over all user-item pairs where a rating/engagement exists ($r^{(i, j)}=1$), plus the regularization terms for all $\mathbf{w}$, $\mathbf{b}$, and $\mathbf{x}$.

#### 4. Generalization
* This generalization significantly opens up the set of applications that can be addressed by collaborative filtering, allowing the algorithm to work with implicit feedback (like clicks or viewing time) rather than requiring explicit user ratings.