Recommendation Systems (Matrix Factorization Recommenders)

* Objectives: 
    * Where are recommenders used?
    * What does our dataset look like?
    * What are the high-level approaches to building a recommender?
        * Content-based
        * Collaborative filtering
        * Matrix factorization
    * How do we evaluate our recommender
    * How to deal with "cold start"?
    * What are the computational performance concerns?

1) Motivation / Basics
* Where are recommenders used?
    * Amazon item recommendation
    * Pandora music recommendation
    * Coursera course recommendation
    * Netflix show/movie recommendation
* Business goal for recommender systems to answer:
    * Name a business that cares about each of these questions, and indicate why they care:
        * What will the user **like**?
        * What will the user **buy**?
        * What will the user **click**?
* Netflix (Kaggle-style) Competition From Oct. 2006 - July 2009:
    * Goal: Beat Netflix's own recommender by 10%
    * Length: Took almost 3 years
    * The winning team used gradient boosted decision trees over the predictions of 500 other models
    * Netflix never deployed the winning algorithm!
* For recommenders, learn to build, evaluate, and deploy your recommender
* Types of High-Level Approaches to Building Recommenders:
    1. **Popularity** - makes the **same** recommendation to **every** user, based only on the popularity of an item
        * e.g. Twitter "moments"
    2. **Content-based (or Content Filtering)** - predictions are made based on the properties/characteristics of an item. User behavior is **not** considered.
        * e.g. Pandora Radio
    3. **Collaborative Filtering** - only considers past user behavior (**not** content properties)
        * **User-User Similarity**
        * **Item-Item Similarity**
        * e.g. Netflix & Amazon Recommendations
            * e.g. Google Ads
            * e.g. Facebook Ads, Search, Friends Recommendations, News Feed, Trending News, Rank Notifications, Rank Comments
    4. **Matrix Factorization** - finding latent features (or factors)
* What does our dataset look like?
    * Setup For **Sparse Ratings Matrix** (or **Utility Matrix**)
    * Matrix can be very, very sparse (99% of entries unknown)
* Dataset Type: Has **explicit** ratings and mostly missing values    
    
|              |  Movie 1 |  Movie 2 |  Movie 3 | $\cdots$ | Movie $m$ |
|:------------:|:--------:|:--------:|:--------:|:--------:|:---------:|
| **User 1**   | 4        | ?        | ?        | $\cdots$ | 1         |
| **User 2**   | 3        | 3        | 2        | $\cdots$ | 2         |
| **User 3**   | ?        | 3        | ?        | $\cdots$ | ?         |
| $\vdots$     | $\vdots$ | $\vdots$ | $\vdots$ | $\ddots$ | $\vdots$  |
| **User $n$** | ?        | 5        | 4        | $\cdots$ | 5         |

* Dataset Type: Has **implicit** ratings and mostly missing values
![implicit_ratings](implicit_ratings.png)

2) Collaborative Filtering - only considers past user behavior (**not** content properties)
* **User-User Similarities** - examine all pairs of **users** and calculate their similarities of row vectors (e.g. euclidean)
![user_user](user_user.png)
* **Item-Item Similarities** - examine all pairs of **items** and calculate their similarities of column vectors (e.g. cosine similarity)
![item_item](item_item.png)
* Does user-user or item-item algorithm have better efficiency?
    * Let $m$ = # of users and $n$ = # of items
    * Compute the similarities of all pairs
    * Compute the performance or complexity of an algorithm ($O$)
        * The Big $O$ - can be used to describe the execution time required or the space used (e.g. in memory or on disk) by an algorithm
        * **User-User** = $O(m^2n)$
        * **Item-Item** = $O(mn^2)$
* Types of Similarity Metrics:
    * **Euclidean Distance**
        * Normal equation: $$dist(a,b)=\Vert a-b \Vert=\sqrt{\sum_{i=1}(a_i-b_i)^2}$$
        * Similarity equation: $$similarity(a,b)=\frac{1}{1+dist(a,b)}$$
    * **Cosine Similarity**
        * Normal equation: $$cos(\theta_{a,b})=\frac{a \cdot b}{\Vert a \Vert \Vert b \Vert}=\frac{\sum_{i=1}a_i b_i}{\sqrt{\sum_{i=1}a_i^2}\sqrt{\sum_{i=1}b_i^2}}$$
        * Standardized Similarity equation: $$similarity(a,b)=0.5+0.5 \times cos(\theta_{a,b})$$
    * **Pearson's Correlation**
        * Normal equation: $$pearson(a,b)=\frac{cov(a,b)}{std(a) \times std(b)}=\frac{\sum_{i=1}(a_i-\bar{a})(b_i-\bar{b})}{\sqrt{\sum_{i=1}(a_i-\bar{a})^2}\sqrt{\sum_{i=1}(b_i-\bar{b})^2}}$$
        * Similarity equation: $$similarity(a,b)=0.5+0.5 \times pearson(a,b)$$
    * **Jaccard Index**
        * Similarity equation: $$similarity(a,b)=\frac{|U_a \cap U_b|}{|U_a \cup U_b|}$$
        * $U_k$ denotes the set of users who rated item $k$
* The Similarity Matrix:
    * Pick a similarity metric, and create the similarity matrix:
    
|            |  Item 1  |  Item 2  |  Item 3  | $\cdots$ |
|:----------:|:--------:|:--------:|:--------:|:--------:|
| **Item 1** | 1        | 0.3      | 0.2      | $\cdots$ |
| **Item 2** | 0.3      | 1        | 0.7      | $\cdots$ |
| **Item 3** | 0.2      | 0.7      | 1        | $\cdots$ |
| $\vdots$   | $\vdots$ | $\vdots$ | $\vdots$ | $\ddots$ |

* Making Predictions with Collaborative Filtering:
    * Example: user $u$ hasn't rated item $i$ and we want to predict the rating that this user **would** give this item: $$rating(u,i)=\frac{\sum_{j\in I_u} similarity(i,j) (r_{u,j})}{\sum_{j\in I_u} similarity(i,j)}$$ $$I_u = \text{set of items rated by user } u$$ $$r_{u,j} = \text{user }u\text{'s ratings of item }j$$
    * Order by descending predicted rating for a single user, and recommend the top $k$ items to the user
* Making Predictions **Using Neighborhoods** with Collaborative Filtering:
    * This calculation of predicted ratings can be **very costly**
    * To **mitigate** this issue, we will only consider the **$n$ most similar items** to an item when calculating the prediction: $$rating(u,i)=\frac{\sum_{j\in I_u \cap N_i} similarity(i,j) (r_{u,j})}{\sum_{j\in I_u \cap N_i} similarity(i,j)}$$ $$I_u = \text{set of items rated by user } u$$ $$r_{u,j} = \text{user }u\text{'s ratings of item }j$$ $$N_i \text{ is the } n \text{ items which are most similar to item }i$$
    * Order by descending predicted rating for a single user, and recommend the top $k$ items to the user
* Deploying the CF Recommender:
    * Compute similarities between all pairs of items
    * Compute the neighborhood of each item
    * At request time, predict scores for candidate items and make a recommendation
* Evaluating the Recommenders:
    * Is it possible to do cross-validation like normal?
        * (-) Recommenders are inherently hard to validate
        * (-) There is no "one" answer for all dataset
    * Calculate MSE between targets and our predictions over the holdout set
    ![mse_cv](mse_cv.png)
        * question marks denotes the holdout set values (**not** missing values)
        * K-fold cross-validation is optional
        * Why isn't the method above a true estimate of a recommender's performance in the field?
        * Why would A/B Testing be better?
    * Another validation method: **Splitting dataset by time**
    ![split_data](split_data.png)
        * Why might we prefer doing this instead of the more "normal" cross-validation from the previous slide?
    * Bad validation split: Splitting dataset by movie
    ![split_by_movie](split_by_movie.png)
* Dealing with **"Cold Start"** items or users:
    * **Cold Start** - refers to scenario where a new user or item is introduced into the dataset with no information
    * Example: A new **user** signs up
        * What will our recommender do assuming we're using item-item similarities?
            * One Strategy: Force users to rate 5 items as part of the sign-up process and/or recommend popular items at first
        * What will our recommender do assuming we're YouTube and we're using item popularity to make recommendations?
            * Not much of a problem
    * Example: A new **item** is introduced
        * What will our recommender do assuming we're using item-item similarities?
            * One Strategy: Put it in the "new releases" section until enough users rate it and/or use item metadata if any exists
        * What will our recommender do assuming we're YouTube and we're using item popularity to make recommendations?
            * Don't use **total number of views** as the popularity metric, and use a different strategy
* Downfall of **Collaborative Filtering**
    * **Item-Item Collaborative Filtering**
        * Example: "I like action movies" $\rightarrow$ rate "Top Gun" and "Mission Impossible" $\rightarrow$ 5s
            * (-) Item-Item Recommender: Recommends "Jerry Maguire" even though I won't like it
    * **User-User Collaborative Filtering**
        * Example: "I like Tom Cruise" $\rightarrow$ rate "Top Gun" and "Mission Impossible" $\rightarrow$ 5s 
            * (-) User-User Recommender: Recommends "Transformers" even though I won't like it 
* Movies Have Attributes
    * Genres: Action, Romance, Comedy, etc.
    * Actors: Tom Cruise, Tom Hanks, Megan Fox, etc.
    * Description: Long, Short, Subtitles, Foreign, Happy, Sad, etc.
* What about using Linear Regression for Rating Prediction?
    * Example: Rating Prediction = $$\beta_0 + \beta_1 \times actionness + \dots + \beta_i \times foxiness + \dots + \beta_j \times sadness + \epsilon$$
    * Possible, but we would have to come up with some measure of actionness, etc. This is both subject to error and rather brittle.

3) Using Matrix Factorization to Predict Ratings (or Recommendations)
* Benefits of **Matrix Factorization**:
    * Matrix Factorization could account for something along the lines of these attributes like linear regression
    * All of the matrix factorization models that we know can be **interpreted as a linear combination of bases**
    * There is a chance, especially with NMF, that those bases, latent features, could correspond with some of these "attributes" that we're looking to describe the movies
* Factorization Issue/Requirement:

|                  |                                                              UVD                                                             |                                                                SVD                                                                |                                                                  NMF                                                                  |
|:----------------:|:----------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------:|
| **Equation**     | $X \approx UV$                                                                                                               | $X = USV^T$                                                                                                                       | $V \approx WH$                                                                                                                        |
| **Matrix Shape** | $U$ and $V$ will not (likely) be orthogonal                                                                                  | $U$ is an orthogonal matrix<br/>$S$ is a diagonal matrix of decreasing positive "singular" values<br/>$V$ is an orthogonal matrix | Same as UVD, but with one extra constraint: **all values of $V, W, H$ must be non-negative**<br/>NMF is a specialization of UVD |
| **Solution**     | Has many approximate non-unique solutions<br/>Non-convex optimization with many local minima<br/>Has a tunable parameter $k$ | Has a unique, exact solution                                                                                                      | Both NMF and UVD are approximate factorizations, and both optimize to reduce the RSS                                              |
* PCA/SVD/NMF vs UVD:
    * Problem: PCA, SVD, and NMF **all** must be computed on a **dense** data matrix, $X$
        * Possible Solution: Impute missing values, naively, with something like the mean of the known values. (note: this is what sklearn does when it says it factorizes sparse matrices)
    * Missing Values:
        * **SVD** - works **poorly** if $X$ has missing values
            * (-) Forced to fill in missing values
            * (-) Solution fits these fill-values
            * (-) Makes for a much larger memory footprint
            * (-) Slow to compute for large matrices
        * **UVD** - handles missing values when computed via **SGD**
* Factorization Goal:
    * Create a factorization for a sparse data matrix, $X$, into $U V$, such that the reconstruction to $\hat{X}$ serves as a model: $$X_{m\times n}\approx U_{m\times k}V_{k\times n}$$ $$x_{i,j} \approx u_i v_j$$
    * More formally, for a previously unknown entry in $X$:
        * $X_{i,j}$ for the corresponding entry in $\hat{X}$
        * $\hat{X}_{i,j}$ serves as a prediction
    * Since we could easily **overfit** the known values in $X$, we will want to **regularize**
        * Regularization by reducing the inner dimension in $U_k$ and $V_k$
    * Factorization visualized:
    ![factorization](factorization.png)
    * Reconstruction visualized:
    ![reconstruction](reconstruction.png)
* Difference between Collaborative Filtering (CF) and Matrix Factorization (MF):
    * **Collaborative Filtering (Neighborhood Models)** $\rightarrow$ **Memory Based**
        * Just store data so we can query what or whom is most similar when asked to recommend
    * **Factorization Techniques** $\rightarrow$ **Model Based**
        * Creates predictions, from which the most suitable can be recommended
* Computing the Factorization
    * Similar to what we did to find the factorization in NMF, we're going to **minimize a cost function**
    * Now, though we can't minimize at the level of the **entirety** of $X$, since it is **sparse** (99% missing data)
    * However, we can **optimize with respect to the data in $X$** that we do have (1% of known data)
* Factorization Plan
    * **UV Decomposition (UVD)** Steps:
        1. Choose $K$
        2. $UV$ approximates $X$ by necessity if $k$ is less than the rank of $R$
        3. Usually choose: $k< min(n,m)$
        4. Compute $U$ and $V$ such that: (least squares algorithm) $$arg\text{ }min_{U,V}\sum_{(i,j) \in K}(X_{i,j}-U_i V_j)^2$$ 
    * For each of the known ratings in $X_{i,j}$, we want to minimize the square error in the prediction that results from $U_i  V_j$: $$min_{U,V}\sum_{(i,j) \in K}(X_{i,j}-U_i V_j)^2$$ 
    * where $U_i$ is the i$^{th}$ row of $U$
    * $V_j$ is the j$^{th}$ row of $V$
    * K is the set of indices in $X$ that have data
* Reconstructing a Single Entry:
![reconstructing_single](reconstructing_single.png)

4) Factorization Algorithms
* Types of Algorithms
    * **Alternating Least Squares (ALS)** - minimization by rotating between fixing the $U_i$ to solve for the $V_j$ and fixing the $V_j$ to solve for the $U_i$
    * **Funk SVD** - developed by Simon Funk during the Netflix prize which is a popular alternative version of gradient descent
* ALS vs SGD:

|                          |                                          ALS                                          |                            SGD                           |
|:------------------------:|:-------------------------------------------------------------------------------------:|:--------------------------------------------------------:|
| **Speed**                | Parallelizes very well                                                                | Faster (if on single machine)                            |
| **Learning Rate**        |                                                                                       | Requires tuning learning rate                            |
| **Availability/Results** | Available in Spark/MLlib                                                              | Anecdotal evidence of better results                     |
| **Missing Values**       | Only appropriate for matrices that don't have missing values (needs **dense** matrix) | Works with missing values (works with **sparse** matrix) |

* Questions to consider if considering SVD algorithm for recommendations:
    * Would using SVD be good for this sparse utility matrix (we used it previously for finding latent features)?
    * What's the problem with using SVD on this sparse utility matrix?
    * What UVD (or NMF) work better than SVD to find latent factors when the utility matrix is sparse?
* UVD (or NMF) + SGD is normally the best option for Recommender Systems:
    * NMF + SGD is "best in class" option for **many** recommender domains:
        * (+) no need to impute missing values
        * (+) use regularization to avoid overfitting
        * (+) optionally include biases terms to communicate prior knowledge
        * (+) can handle time-dynamics (e.g. change in user preference over time)
        * (+) used by the winning entry in the Netflix challenge
* **Funk SVD**
    * Define the error on a particular prediction in $X$: $$e_{i,j}=X_{i,j}-\hat{X}_{i,j}$$
    * Then, we can update the columns in $U$ and $V$ with: $$U_i \leftarrow U_i + v(e_{i,j}V_j)$$ $$V_j \leftarrow V_j + v(e_{i,j}U_i)$$
    * Funk SVD Algorithm:
        * Initialize $U$ and $V$ with small random values
        * While error is decreasing:
            * For each user, $i$:
                * For each item rated by that user, $j$:
                    1. Predict rating, $\hat{X}_{i,j}$
                    2. Calculate $e_{i,j}$
                    3. Update $U_i$ and $V_j$

5) Factorization Nuances
* **Baseline Predictors (Biases)**
    * (-) Much of the observed ratings are associated with a specific user's personality (user bias) or an item's intrinsic value (item bias), **not an interaction between the item and user**, which is what we get captured in the factorization
        * e.g. Some items (e.g. movies) have a tendency to be rated high, some low
        * e.g. Some users have a tendency to rate high, some low
    * To encapsulate these effects, which do not involve user-item interactions, we introduce **baseline predictors**: $$b_{i,j}=\mu+b_i+b_j$$
        * $b_{i,j} \rightarrow$ overall bias of the rating by user $i$ for item $j$
        * $\mu \rightarrow$ overall average rating in $X$
        * $b_i \rightarrow$ user $i$'s average deviation from the overall average
        * $b_j \rightarrow$ item $j$'s average deviation from the overall average
    * From this, we can describe our predictions with: $$\hat{X}_{i,j}=\mu+b_i+b_j+U_i V_j$$
        * $X_{i,j} \rightarrow$ the prediction of user $i$ rating item $j$
        * $\mu \rightarrow$ the average rating
        * $b_i \rightarrow$ user $i$'s tendency to deviate from the average
        * $b_j \rightarrow$ item $j$'s tendency to deviate from the average
        * $U_i V_j \rightarrow$ the prediction of how user $i$ will interact with item $j$
* **Regularization of Spare Dataset**
    * Another way to regularize our decomposition to help prevent from overfitting to our sparse data is via a penalty, $\lambda$, placed on the magnitude of: $b_i, b_j, U_i, V_j$. The most common is the $L_2$ norm
    * Such a penalty changes our cost function: $$min_{b,U,V}\sum_{(i,j) \in K}(X_{i,j}-U_i V_j)^2 + \lambda(b_i^2+b_j^2+|U_i|^2 + |V_j|^2)$$
    * With these consideration, the regularization update rules become: $$b_i \leftarrow b_i + v(e_{i,j}-\lambda b_i)$$ $$b_j \leftarrow b_j + v(e_{i,j}-\lambda b_j)$$ $$U_i \leftarrow U_i + v(e_{i,j}V_j-\lambda U_i)$$ $$V_j \leftarrow V_j + v(e_{i,j}U_i-\lambda V_j)$$
* **Validation For Recommenders**
    * Validating any recommender is difficult, but it is necessary as we're going to want to tune the hyperparameters that we introduced into our model, $v$ and $\lambda$
    * The most frequently used metric is RMSE on the known data: $$RMSE=\sqrt{\sum_{(i,j)\in K}(X_{i,j}-\hat{X}_{i,j})^2}$$
    * Example: RMSE over the Netflix dataset using various matrix factorization models
    ![netflix_example](netflix_example.png)
        * Numbers on the chart denote each model's dimensionality, $k$
        * The more refined models perform better and have lower errors
        * **Netflix's inhouse model performs at RMSE=0.9514 on this dataset** (the worst score), so even the **simple** matrix factorization models are beating it!

6) Matrix Factorization Pros & Cons
* (+) Decent with sparsity, so long as we regularize
* (+) Prediction is fast, only need to do an inner product
* (+) Can inspect latent features for topical meaning
* (+) Can be extended to include side information
* (-) Need to re-factorize with new data. Very slow
* (-) Fails in the cold start case
* (-) Not great open source tools for huge matrices
* (-) Difficult to tune directly to the type of recommendation you want to make. Tied to the difficulty of measuring success

7) Advanced Factorization Models
1. **Non-negativity constraint** - more interpretable latent features
2. **SVD++** - uses implicit feedback (e.g. clicks, likes, etc.) to enhance model
3. **Time-aware Factor Model** - accounts for temporal information about data