# Matrix Factorization

### Recommender System

- data: how 'many users' have rated 'some movies'
- skill: predict how a user would rate an unrated movie.

如 N 個 users 對 M 部 movies 做評分 rating r.  
使用者:n 對 電影:m 的 評分:r

$$
\mathcal{D}_m = \Big\{ \big( \tilde{x}_n = (n), y_n = r_{nm} \big) \Big\}
$$

abstract feature $ \tilde{x}_n = (n) $ : 使用者只是一個編號ID, 如何從這樣的 [抽象特徵] 資料學習到偏好?

### Binary Vector Encoding of Categorical Feature

Categorical Features: 

- $ \tilde{x}_n = (n) $ : User IDs, such as 2412,5123,6872; 沒有特殊意義，只是個編號
- Blood Type: A, B, AB, O
- Programming Language: C, Golang, Java, Python, ...

Many ML model operate on numerical features

- linear models
- extended linear models such as NNet
- Decision Tree can operate on categorical feature.

NEED: encoding (transform) **from categorical to numerical**

Binary Vector encoding:

$$
A = \begin{bmatrix} 1\\0\\0\\0 \end{bmatrix}, \ \ 
B = \begin{bmatrix} 0\\1\\0\\0 \end{bmatrix}, \ \ 
AB = \begin{bmatrix} 0\\0\\1\\0 \end{bmatrix}, \ \ 
O = \begin{bmatrix} 0\\0\\0\\1 \end{bmatrix}
$$

### Feature Extraction from Encoded Vector

$$
\mathcal{D}_m = \Big\{ \big( \tilde{x}_n = BinaryVectorEncoding(n), y_n = r_{nm} \big) \Big\}
$$

? 問號表示沒有評分

$$
\mathcal{D}_m = \Big\{ \big( \tilde{x}_n = BinaryVectorEncoding(n), y_n = 
\begin{bmatrix}
r_{n1} \\ ? \\ ? \\ r_{n4}  \\ r_{n5} \\ \vdots \\ r_{nM}
\end{bmatrix}
\big) \Big\}
$$

idea: try feature extraction using $ N-\tilde{d}-M $ NNet without all $ x_0^{(\mathcal{l})} $ (為了簡單起見忽略常數項)

因為 x 是只有一項為 1, 其他項為 0 的向量， NNet 中的 tanh 節點可以直接用 linear, 還是能夠表達單一 x項 的權重。

![img](imgs/c215-matrix-fact-nnet.png)

matrix $ V^T $ of size $ N \times \tilde{d} $

matrix $ W $ of size $ \tilde{d} \times M $

hypothesis: 

$ h(x) = W^T_{M \times \tilde{d}} \ V_{\tilde{d} \times N} \vec{x}_{N \times 1} $

per-user output:

$ h(x_n) = W^T v_n $, where $ v_n $ is n-th column of V, 因为 $ x_n $ 只有一個項是 1, 其他是 0.

linear network for recommender system: learn V and W.

### Linear Network: Linear Model Per Movie

linear network:

$$
h(x) = W^T \underbrace{\ V \ x}_{\Phi(x)}
$$

for m-th movie, just linear model: $ h_m(x) = w_m^T \ \Phi(x) $

for every $ \mathcal{D}_m $, want $ r_{nm} = y_n \approx w_m^T v_n $

$ E_{in} $ over all $ \mathcal{D}_m $ with squared error measure: (user:n rated movie:m)

$$
E_{in} \big( \{ w_m \}, \{ v_n \} \big) =
\frac{1}{\sum_{m=1}^M | D_m | }
\sum \Big( r_{nm} - w_m^T v_n \Big)^2
$$

linear network: transform and linear modelS jointly learned from all $ \mathcal{D}_m $

## Martix Factorization

$$
r_{nm} \approx w_m^T v_n = v_n^T w_m \iff R \approx V^T W
$$

learning:

- known rating
- learned factors $ v_n $ and $ w_m $
- unknown rating prediction

similar modeling can be used for other abstract features.

![img](imgs/c215-matrix-r.png)

![img](imgs/c215-model-vis.png)

### Matrix Factorization Learning

$$
\begin{align}
\min_{W,V} E_{in} \big( \{ w_m \}, \{ v_n \} \big) \propto & \sum_{\text{user n rated movie m}} \big( r_{nm} - w_m^T v_n \big)^2 \\
= & \sum_{m=1}^M \Big( \sum_{(x_n, r_{nm}) \in D_m} \big( r_{nm} - w_m^T v_n \big)^2 \Big)
\end{align}
$$

two set of variables: can consider **alternating minimization**

when $ v_n $ fixed, minimizing $ w_m \equiv $ minimize $ E_{in} $ within $ D_m $

- simply per-movie (per Dm) linear regression without $ w_0 $

when $ w_m $ fixed, minimizing $ v_n $

- per-user linear regression without $ v_0 $

by symmetry between users/movies

called **alternating least squares** algorithm

### Alternating Least Squares

STEP 1: initialize $ \tilde{d} $ dimension vectors $ \{ w_m \}, \{ v_n \} $

STEP 2: alternating optimization of $ E_{in} $: repeatedly

STEP 2.a: optimize $ w_1, w_2, \cdots, w_M $: update $ w_m $ by m-th-movie linear regression on $ \{ (v_n, r_{nm}) \} $

STEP 2.b: optimize $ v_1, v_2, \cdots, v_N $: update $ v_n $ by n-th-user linear regression on $ \{ (w_m, r_{nm}) \} $

until converge.

- initialize: usually just randomly.
- converge: guaranteed as $ E_{in} $ decreases during alternating minimization

Alternating Least Squares: the 'tango' dance between users/movies.

### Linear Autoencoder versus Matrix Factorization

|x|Linear Autoencoder|Matrix Factorization|
|-|-|-|
|motivation   | special d-$\tilde{d}$-d NNet | N-$\tilde{d}$-M linear NNet |
|error measure| squared on all $ x_{ni} $ | squared on known $ r_{nm} $ |
|solution     | global optimal at eigenvectors of $ X^T X $ | local optimal via alternating least squares |
|usefulness   | extract dimension-reduced features | extract hidden user/movie features |

Linear Autoencoder $ \equiv $ Special matrix factorization of complete X