Non-Negative Matrix Factorization (Interpretable Topic Modeling)

* Objectives:
    * Motivation of Topic Modeling
    * Thinking About Topics
    * Topic Analysis Assumptions
    * Math For NMF
    * Topics as Latent Feature Bases
    * Identifying Topics
    * NMF Algorithm

1) **Topic Modeling** - attempts to extract an underlying structure from the data as a form of unsupervised learning
* To discover an underlying set of "topics" that describe high level ideas about the data well
* Example: Consider having the term-frequency matrix for a corpus of documents, all coming from some sort of related source (e.g. articles from a newspaper)
    * In trying to discover topics, or latent features, in these data, we might expect to find such overarching articles as: "Sports", "International", "Arts and Leisure", etc.
* **Topic Analysis Assumptions**:
    1. The observations are well described by underlying topics
        * e.g. all "Sports" articles will have similar sport-sy words. In math, each topic has a corresponding distribution of words: $$tf(word \mid topic)$$
    2. The words in a document can be represented by an appropriate combination of topics
        * e.g. an article about FIFA could be represented by the topics: "International" and "Sports". Math: $$tf(word \mid doc)=\sum_{t \in T} tf(word \mid topic) \times w(t \mid doc)$$ where $T$ is the set of topics and $w$ is some positive weight

2) **Non-Negative Matrix Factorization (NMF)** - a group of algorithms that takes a large dimensional matrix and factors into 2 (or more) smaller dimensional matrics
* Another way to mathematically express the topic modeling assumptions is through the equation: $$W \times H = V$$ $$(\text{m} \times \mathbf{r}) \times (\mathbf{r} \times \text{n}) = (\text{m} \times \text{n})$$ where each entry: $$v_{ij} \geq 0$$ $$w_{ij} \geq 0$$ $$h_{ij} \geq 0$$
![nmf](https://upload.wikimedia.org/wikipedia/commons/f/f9/NMF.png)
    * $V$ can be approximated by the dot product of two matrices (aka factorization)
    * Cannot be solved analytically, so approximated numerically
    * $\mathbf{r}$ set by user ($\mathbf{r} < \text{min(m,n)}$)
    * Can easily rationalize that when the internal dimension, $\mathbf{r}\geq m$ can perfectly recreate $V$    
    * notice the columns of $V$ are **sum of columns of $W$ weighted by corresponding column in $h_i$**: $$v_i = W \times h_i$$
    * NMF is a **relatively new way** of reducing dimensionality of data into linear combination of bases
        * Columns of $W$ as basis
        * Weighted by $h_i$
    * **Non-negativity constraint** - unlike the other decompositions models
* What happens when this dimension, $\mathbf{r}<m$?
    * In looking at the dimensions of the $W$ matrix from our decomposition, we notice that the number of rows remain the same, as we would expect
    * Thus, the rows of $W$ must represent some information about each row in $V$
    * When we use a smaller number of dimensions to represent our data in $W$ when ($k<m$), we will necessarily find that some sort of data compression is happening
    * Or, in math, we are projecting our data onto a lower dimensioned basis
* **Topics as Latent Feature Bases** - the values of each row in the latent feature space corresponds to their strength with the associated topic
    * It, somewhat magically, turns out that when we conduct a **factorization** of this nature each of bases in the **lower** dimensional representation of $V$, aka $W$, can be viewed as a **latent feature**
    * These **latent features** are discovered as somewhat of a side-effect of projecting into a smaller number of dimensions, of performing some sort of compression
    * Example $W$ (m $\times$ r) Matrix:

|  Topic 1 |  Topic 2 | $\cdots$ | Topic $r$ |              |
|:--------:|:--------:|:--------:|:---------:|:------------:|
| 0.3      | 5.1      | $\cdots$ | 1.2       | **Document 1**   |
| 9.76     | 0.04     | $\cdots$ | 2.7       | **Document 2**   |
| $\vdots$ | $\vdots$ | $\cdots$ | $\vdots$  | $\vdots$     |
| $\vdots$ | $\vdots$ | $\cdots$ | $\vdots$  | $\vdots$     |
| 0.06     | 0.3      | $\cdots$ | 0.001     | **Document $m$** |
        
* **Identifying Topics** - figuring out what topics these latent feature bases correspond to
    * This is an unsupervised approach (e.g. not explicitly telling it to look for "sports" word)
    * To put "labels" on the latent features and identify them as a topics we can do one of two things:
        1. Look at the obeservations that load heavily on each topic and manually inspect them, trying to identify some commonalities (aka **latent features**)
        2. Inspect the $H$ matrix and see what features contribute to each topic
    * Inspection of dimensions of $H$ matrix:
        * The number of columns in $V$, $m$, is the same as in $H$
        * The columns in $H$ represents some information about the columns in $V$
        * More specifically, viewing the features in $V$ as being the basis for the latent topics
    * Example $H$ (r $\times$ n) Matrix:

| Feature 1 | Feature 2 | $\cdots$ | Feature $n$ |           |
|:---------:|:---------:|:--------:|:-----------:|:---------:|
| 0.3       | 5.1       | $\cdots$ | 4.2         | **Topic 1**   |
| 10.3      | 1.07      | $\cdots$ | 0.08        | **Topic 2**   |
| $\vdots$  | $\vdots$  | $\cdots$ | $\vdots$    | $\vdots$  |
| $\vdots$  | $\vdots$  | $\cdots$ | $\vdots$    | $\vdots$  |
| 2.03      | 0.3       | $\cdots$ | 0.001       | **Topic $r$** |

| "president" | "coach" | $\cdots$ | "team" |           |
|:---------:|:---------:|:--------:|:-----------:|:---------:|
| 0.3       | 5.1       | $\cdots$ | 4.2         | **Topic 1**   |
| 10.3      | 1.07      | $\cdots$ | 0.08        | **Topic 2**   |
| $\vdots$  | $\vdots$  | $\cdots$ | $\vdots$    | $\vdots$  |
| $\vdots$  | $\vdots$  | $\cdots$ | $\vdots$    | $\vdots$  |
| 2.03      | 0.3       | $\cdots$ | 0.001       | **Topic $r$** |
* Predicting Latent Topics (From example above) From $H$ Matrix:
    * Topic 1 $\rightarrow$ "Sports" category?
    * Topic 2 $\rightarrow$ "Politics" category?
* **Choosing $r$** from NMF
    * Unfortunately, choosing $r$ is more of an art than a science
    * Try examining how "good" the approximation of $W \times H$ for $V$ and find the **smallest $r$** that makes it suitably small
    * However, $r$, is likely going to be chosen based on intuition that is derived from inspecting the topics and possibly from some domain knowledge
* Popular Applications of NMF
    * Computer Visioning
    ![nmf_vision](nmf_vision.png)
        * identify or classify objects
        * generally reducing feature space of images
    * Document Clustering
    ![nmf_doc_cluster](nmf_doc_cluster.png)
    * Recommender Systems
    ![nmf_rec_sys](nmf_rec_sys.png)
* Document Clustering with NMF
    * Example: 500 documents and 10,000 words
    ![nmf_dim_doc_cluster](nmf_dim_doc_cluster.png)
        * $W$: (words $\times$ latent factors) - think of column of $W$ as **document archetype** where the higher the word's cell value, the higher the word's rank for that latent feature
        * $H$: (latent factors $\times$ documents) - think of column of $H$ as the **original document**, where cell value is document's rank for a particular latent feature
        * $V$: (words $\times$ documents) - think of **reconstituting a particular document** as linear combination of "document archetypes" weighed by how important they are
* Reconstructing $V$ Matrix:
    * $V$ is approximated by the inner product of $W$ and $H$
    * How do we reconstruct **only one cell of $V$ (or $X$)**?
        * Inner product of $W$ and $H$'s correct column and row
        ![reconstructing_v](reconstructing_v.png)
    * How do we find the non-negative matrices ($W$,$H$) that approximates $V$?
        * With PCA/SVD, there is a **closed form solution** for finding those **factorizations**
        * However, with NMF, there is no such closed form solution, but we can use biconvex optimization via **Alternating Least Squares (ALS)**
* NMF Algorithm
    * Minimize: $$\Vert V-WH \Vert^2$$ with respect to $W$ and $H$ and subject to $W, H \geq 0$
    * **Alternating Least Squares (ALS)** - take advantage of the biconvexivity by alternating matrix, $W$ or $H$, that is treated as stationary, solving for the other's optimal values, and then clipping all the negative values in that solution to 0
        * Finding $W$ and $H$:
            * This problem is a biconvex optimization issue where it's convex in either $W$ or $H$, but not both.
            * There is a straightforward way to brute force an approximate solution in this case
            * While there is no closed form solution for $W$ and $H$, if we hold one of these matrices constant there **is a closed form optimum** for the **other**
        * ALS Steps:
            1. Randomly initialize $W$ and $H$ to the **appropriate shapes of matrices**
            2. Repeat the following:
                * Holding $W$ fixed, update $H$ by minimizing sum of squared errors (Ensure all $H>0$)
                * Holding $H$ fixed, update $W$ by minimizing sum of squared errors (Ensure all $W>0$)
            3. Stop when some threshold is met
                * Decrease in RMSE
                * \# of iterations
        * Pseudo-code for ALS:
            1. Initialize $W$ to small, positive random values
            2. For max number of iterations:
                1. Find the least squares solution to $X=W\times H$ with respect to $H$
                2. Clip negative values in $H$ to 0 ($H<0=0$)
                3. Find the least squares solution to $X=W\times H$ with respect to $W$
                4. Clip negative values in $H$ to 0 ($W<0=0$)
        * ALS Pros and Cons:
            * (+) Fast algorithm
            * (+) Works well in practice
            * (-) Non-negativity enforced in an ad hoc way
            * (-) Not guaranteed to find a local minimum (much less global)
            * (-) No convergence theory (e.g. the function $y = \frac{1}{x}$ converges to zero as $x$ increases)
    * **Multiplicate Update** - solving optimization problem with gradient descent using a cost function and reducing the gradient descent updates by choosing correct step sizes
        * Cost Function - defining the cost function for this optimization problem
            1. Define how much we're missing in our approximation of $V$ ($WH$ as the reconstruction error)
            2. Use generalization of Euclidean distance on matrices, also known as **Frobenius norm** on the reconstruction error to quantify how well we are approximating $V$
        * Gradient Descent For NMF:
            1. Let the Frobenius norm of the reconstruction error be the quantity that we are trying to minimize: $$min_{W,H} \Vert V-WH \Vert^2$$
            2. From this, with a little bit of matrix calculus, we can determine that the update rules for a single entry in $W$ and $H$ are: $$W_{i,a} \leftarrow W_{i,a}+v_{i,a}[(VH^T)_{i,a}-(WHH^T)_{i,a}]$$ $$H_{a,\mu} \leftarrow H_{a,\mu}+v_{i,a}[(W^TV)_{a,\mu}-(W^TWH)_{a,\mu}]$$
            3. If we are clever about the value that we choose as our step size, $v$, we can reduce those gradient descent updates to what we known as the multiplicate update rules: $$v_{i,a} = \frac{W_{i,a}}{(WHH^T)_{i,a}}$$ $$v_{a,\mu} = \frac{H_{a,\mu}}{(W^TWH)_{a,\mu}}$$
            4. Rewrite the gradient descent updates as: $$W_{i,a} \leftarrow W_{i,a}\frac{(VH^T)_{i,a}}{(WHH^T)_{i,a}}$$ $$H_{a,\mu} \leftarrow H_{a,\mu}\frac{(W^TV)_{a,\mu}}{(W^TWH)_{a,\mu}}$$
            5. These updates will be iteratively performed from $W$ and $H$ initialized with random, small, positive values
        * Multiplicative Update Rules Steps:
            1. Start with random $W$ and $H$
            2. Repeatedly adjust $W$ and $H$ to make RMSE smaller
                * $W_{i,a} \leftarrow W_{i,a}\frac{(VH^T)_{i,a}}{(WHH^T)_{i,a}}$
                * $H_{a,\mu} \leftarrow H_{a,\mu}\frac{(W^TV)_{a,\mu}}{(W^TWH)_{a,\mu}}$
                * Lee and Seung's popular "multipicative update rules" offers compromise between speed and implementation
                * Gradient descent is simple but can be slow. Also, convergence sensitive to choice of step size
            3. Stop when some threshold is met
                * Decrease in RMSE
                * \# of iterations
* OLS for Transforming New Data
    * Once we have a factorization set up we might want to be able to project a new document into our latent feature space using the same OLS technique
    * What we have is a new row in our $V$ matrix, $V_{new}$, and we're trying to find a representation in the space spanned by the columns of $W$ matrix, $W_{new}$. Solve equation for $W_{new}$: $$V_{new}=W_{new}\times H_{same}$$ $$(\text{1} \times \text{n}) = (\text{1} \times \mathbf{r}) \times (\mathbf{r} \times \text{n})$$
    * Looks very similar to: $$y=X\times \beta$$ $$(\text{n} \times \text{1}) = (\text{n} \times \text{m}) \times (\text{m} \times \text{1})$$
        * But, it's worth noting that this functionality is available to you built into existing NMF algorithms (e.g. sklearn's transform() method)
* NMF as "soft" clustering:
    * NMF is considered soft clustering because the latent features can be viewed as a clsuter
        * Each observation can be partially in **more than one "cluster"**
        * There are a number of penalties that have been devised to make factorizations more friendly to interpretation
            * e.g. Simple Models (use Ridge/Lasso regularization)
            * e.g. Complex Models (try to enforce sparsity in $W$ and $H$)
* PCA/SVD vs NMF:
    * PCA and SVD decompose into three matrics and NMF only two
    * The bases in NMF are not orthogonal like in PCA/SVD
    * The main difference is the **non-negativity constraint** for NMF
        * Why do we care about having all the entries in the factorized matrics be positive (non-negative)? Interpretability of the topics
        * How do we interpret negative values in the decomposed matrics?
        ![pca_vs_nmf](pca_vs_nmf.png)
        * Have **only additive components** of a topic is more interpretable (Non-negative)
    * Summary of PCA/SVD vs NMF:
    
|                          |                             PCA/SVD                           |                  NMF                  |
|:------------------------:|:----------------------------------------------------------:|:-------------------------------------:|
| Dimensionality Reduction | Unsupervised dimensionality reduction                      | Unsupervised dimensionality reduction |
| Coefficients             | Orthogonal vectors with positive and negative coefficients | Non-negative coefficients             |
| Interpretation           | "Holistic"; difficult to interpret                         | "Parts-based"; easier to interpret    |
| Algorithm                | Non-iterative                                              | Iterative (the presented algorithm)   |

* Questions For Understanding NMF:
    * What parameter choice must you make before performing NMF?
    * When doing document clustering using NMF:
        * What does a column in the $W$ matrix represent?
        * What does a column in the $H$ matrix represent?
        * How do we combine $W$ and $H$ to reconstitute a document in $V$ (column in $V$)?