---
title: 12.1 Low Rank Approxmiation
subject:  Low Rank  Approxmiation
subtitle: 
short_title: 12.1 Low Rank Approxmiation
authors:
  - name: Nikolai Matni
    affiliations:
      - Dept. of Electrical and Systems Engineering
      - University of Pennsylvania
    email: nmatni@seas.upenn.edu
license: CC-BY-4.0
keywords: 
math:
  '\vv': '\mathbf{#1}'
  '\bm': '\begin{bmatrix}'
  '\em': '\end{bmatrix}'
  '\R': '\mathbb{R}'
---

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/10_Ch_11_PCA_Apps/121-Apps.ipynb)

{doc}`Lecture notes <../lecture_notes/Lecture 20 - Low-Rank Matrix Approximations via the SVD with applications to matrix completion and recommender systems.pdf>`

## Reading

Material related to this page can be found in [Lecture 9](https://web.stanford.edu/class/cs168/l/l9.pdf) from Stanford CS168 course.

## Learning Objectives

By the end of this page, you should know:
- 

## What are the Missing Entries?

Suppose that I run a web streaming service for movies for three of my friends, Amy, Bob, and Carol. It's a very specialized web movie service, with only five movie options: The Matrix, Inception, Star Wars: Episode 1, Moana, and Inside Out. After 1 month, we ask our friends Amy, Bob, and Carol to rate the movies they've watched from one to five. We collect their ratings into a table [below](#table1) (we mark unrated movies with ?):

:::{table} Movie Ratings
:label: table1
:align: center

|     | The Matrix | Inception | Star Wars: Ep.1 | Moana | Inside Out |
| --- | ---        | ---       |---              |---    |---         |
| Amy | 9 | ? | ? | ? | 5 |
| Bob | ? | 3 | 4 | ? | 2 |
| Carol | ? | ? | 2 | 1 | ? |

:::

and are asked to provide recommendations to Amy, Bob, and Carol as to which movie they should watch next. Said another way, we are asked to fill in the unknown ? ratings in the table [above](#table1).

This seems a bit unfair! Each of the unknown entries could be any value in 1-5 after all! But what if I told you an additional hint: Amy, Bob, and Carol have the same relative preferences for each movie. For example, Amy likes Inside Out $\frac{5}{2}$ more than Bob likes Inside Out, and this ratio is the same across all movies. Mathematically, we are making the assumption that _all columns of the table above are multiples of each other._

Thus we can conclude that Bob likes The Matrix $\frac{2}{5} \cdot (\text{Amy's rating}) = \frac{4}{5}$. Similarly, Carol's rating of Inception is $\frac{1}{2} \cdot (\text{Bob's rating}) = 1.5$, Carol's rating of Inside Out is $\frac{1}{2} \cdot (\text{Bob's rating}) = 1$, and so on. Here's the completed matrix:

$$
M = \begin{bmatrix}
2 & 7.5 & 10 & 5 & 5 \\
0.8 & 3 & 4 & 2 & 2 \\
0.4 & 1.5 & 2 & 1 & 1
\end{bmatrix}
$$

The point of this example is that when you know something about the _structure_ of a partially known matrix, then sometimes it is possible to intelligently fill in missing entries. In this example, the assumption that every column is a multiple of each means that rank $M = 1$  (since dim column $(M) = 1$), which is pretty extreme! One natural and useful definition is that assuming a matrix $M$ has _low-rank_. What rank counts as "low" is application dependent but it typically means that for a matrix $M \in \mathbb{R}^{m \times n}$, that rank $M = r << \min\{m,n\}$.

This lecture will explore how we can use this idea of structure to solve the matrix completion problem by finding the best low-rank approximation to a partially known matrix. The SVD will of course be our main tool.

## Low-Rank Matrix Approximations: Motivation

Before diving into the math, let's highlight some applications of low-rank matrix approximation:

1. **Compression**: We saw this idea last class, but it's worth revisiting through the lens of low-rank approximations. If the original matrix $M \in \mathbb{R}^{m \times n}$ is described by $mn$ numbers, then a rank $k$ approximation requires $k(m+n)$ numbers. To see this, recall that if $M$ has rank $k$, then we can write its SVD as:

    $$
    M &=  \bm U \em_{m \times k} \bm \Sigma \em_{k \times k} \bm V^T \em_{k \times n}
    \quad \left(\Sigma^{\frac{1}{2}} = \text{diag}(\sigma_1^{\frac{1}{2}}, \ldots, \sigma_k^{\frac{1}{2}})\right) \\
    &= \bm U\Sigma^{\frac{1}{2}} = Y \em_{m \times k} \bm \Sigma^{\frac{1}{2}}V^T = Z^T \em_{k \times n}
    $$
   
    or product $\hat{M} = YZ^T$ where $Y \in \mathbb{R}^{m \times k}$ and $Z \in \mathbb{R}^{n \times k}$. For example, if $M$ represents a grayscale image (with entries = pixel intensities), $m$ and $n$ are typically in the 100s (or 1000s for HD images), and a modest value of $k$ ($\sim$100-150) is usually enough to give a good approximation of the original image.
    
2. **Updating Huge AI Models**: A modern application of low-rank matrix approximation is for "fine-tuning" huge AI models. In the setting of large language models (LLMs) like ChatGPT, we are typically given some huge off-the-shelf model with billions (or more) parameters. Given this large model that has been trained on an enormous but generic corpus of text from the web, one often performs "fine-tuning". This fine-tuning is a second round of training, typically using a much smaller domain specific dataset (for example, the lecture notes for this class could be used to fine-tune a "LinearAlgebraGPT"). The challenge of fine-tuning is that because these models are so big, making these updates is extremely challenging. The 2021 paper [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) argued that fine-tuning updates are generally approximately low-rank and that one can learn these updates in their factored $YZ^T$ forms, allowing model fine-tuning with 1000x-10000x fewer parameters.
    
3. **Denoising**: If $M$ is a noisy version of some "true" matrix that is approximately low-rank, then finding a low-rank approximation to $M$ will typically remove a lot of noise (and maybe some signal), resulting in a matrix that is actually more informative than the original.
    
4. **Matrix Completion**: Low-rank approximations offers a way of solving the matrix completion problem we introduced above. Given a matrix $M$ with missing entries, the first step is to obtain a full matrix $\hat{M}$ by filling in the missing entries with "default" values:  what these default values should be often requires trial and error, but natural things to try include 0, the average of known entries in the same column, row, or the entire matrix. The second step is then to find a rank $k$ approximation to $\hat{M}$. This approach works well when the unknown matrix is close to a rank $k$ matrix and there aren't too many missing entries.

With this motivation in mind, let's see how the SVD can help us in finding a good rank $r$ approximation of a matrix $M$. Once we've described our procedure and seen some examples of it in action, we'll make precise how our method is actually producing the "best" rank $r$ approximation possible.

## Low-Rank Approximations from the SVD

Given an $m \times n$ matrix $M \in \mathbb{R}^{m \times n}$, which we'll assume has rank $r$. Then the SVD of $M$ is given by

\begin{equation}
\label{SVD}
M = U \Sigma V^T = \sum_{i=1}^r \sigma_i \vv u_i \vv v_i^T \quad \text{(SVD)}
\end{equation}

for $U = \bm \vv u_1 \cdots \vv u_r\em \in \mathbb{R}^{m \times r}$, $V = \bm \vv v_1 \cdots \vv v_r\em \in \mathbb{R}^{n \times r}$, and $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_r)$ the matrices of left singular vectors, right singular vectors, and singular values, respectively.

This right-most expression of [(SVD)](#SVD) is a particularly convenient expression for our purposes, which expresses $M$ as a sum of rank 1 matrices $\sigma_i \vv u_i \vv v_i^T$ with mutually orthogonal column and row spaces.

This sum expression suggests a very natural way of forming a rank $k$ approximation to $M$: simply truncate the sum to the top $k$ terms, as measured by the singular values $\sigma_i$:

\begin{equation}
\label{SVD-k}
\hat{M}_k = \sum_{i=1}^k \sigma_i \vv u_i \vv v_i^T = U_k \Sigma_k V_k^T \quad \text{(SVD-k)}
\end{equation}

where the right-most expression is defined in terms of the truncated matrices:

$$
U_k = \bm \vv u_1 \cdots \vv u_k\em \in \mathbb{R}^{m \times k}, \quad V_k = \bm v_1 \cdots v_k\em \in \mathbb{R}^{n \times k}, \quad \Sigma_k = \text{diag}(\sigma_1, \ldots, \sigma_k) \in \mathbb{R}^{k \times k}
$$

Before analyzing the properties of $\hat{M}_k = U_k \Sigma_k V_k^T$, let's examine if $\hat{M}_k$ could plausibly address our motivating applications. Storing the matrices $U_k, V_k,$ and $\Sigma_k$ requires storing $km + kn + k^2 \approx k(m+n)$ numbers if $k << \min\{m, n\}$ which is much less than the $mn$ numbers needed to store $M \in \mathbb{R}^{m \times n}$ when $m$ and $n$ are relatively large.

It is also natural to interpret [(SVD-k)](#SVD-k) as approximating the raw data $M$ in terms of $k$ "concepts" (e.g., "sci-fi", "romcom", "drama", "classic"), where the singular values $\sigma_1, \ldots, \sigma_k$ express the "prominance" of the concepts, the rows of $V^T$ and columns of $U$ express the "typical row/column" associated with each concept (e.g., a viewer likes only sci-fi movies, or a movie liked only by romcom viewers), and the rows of $U$ (or columns of $V^T$) approximately express each row (or column) of $M$ as a linear combination (scaled by $\sigma_1,\ldots,\sigma_k$) of the "typical rows" (or "typical columns").

This method of producing a low-rank approximation is beautiful: we interpret the SVD of a matrix $M$ as a list of "ingredients" ordered by "importance", and we retain only the $k$ most important ingredients. But is this elegant procedure any "good"?

## A Matrix Norm

For an $m\times n$ matrix $M\in\mathbb{R}^{m\times n}$, let $\hat{M}$ be a low-rank approximation of $M$, and define the approximation error as $E = M-\hat{M}$. Intuitively, a "good" approximation will lead to "small" error $E$. But we need to quantify the "size" of $E\in\mathbb{R}^{m\times n}$. We know that for vectors $\vv x\in\mathbb{R}^n$, the right way to quantify the size of $\vv x$ was through its norm $\|\vv x\|$, where $\|\cdot\|$ is a function that needs to satisfy the axioms of a norm.

1. $\|a \vv x\| = |a| \|\vv x\|$ for all $\vv x\in\mathbb{R}^n$, $a \in\mathbb{R}$
2. $\|\vv x\| \geq 0$ for all $\vv x\in\mathbb{R}^n$, with $\|\vv x\|=0$ if and only if $\vv x=0$
3. $\|\vv x+ \vv y\| \leq \|\vv x\| + \|\vv y\|$ for all $\vv x,\vv y\in\mathbb{R}^n$

It turns out we can define functions on the vector space of $m\times n$ matrices that satisfy these same properties: these are called _matrix norms_. We'll introduce one of them here that is particularly relevant to low-rank matrix approximations, but be aware that just as for vectors there are many kinds of matrix norms.

## The Frobenius Norm

:::{prf:definition} Frobenious Norm
:label: frob-norm
The _Frobenius norm_ of an $m\times n$ matrix $M\in\mathbb{R}^{m\times n}$ simply computes the Euclidean norm of $M$ as if it were an $mn$ vector:

\begin{equation}
\label{frob-norm_eqn}
\|M\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n m_{ij}^2} \quad \text{(F)}
\end{equation}
:::

:::{prf:example}
:label: eg_fn

$M = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$ has $\|M\|_F^2 = 1^2 + 2^2 + 3^2 + 4^2 = 30$

which is the same as $\|\text{vec}(M)\|_2^2 = \left\|\begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix}\right\|_2^2$.
:::

We need a couple of properties of the Frobenius norm before we can connect the SVD to low-rank matrix approximation.

1. **Property 1: For $A \in \mathbb{R}^{n \times n}$ a square matrix, $\|A\|_F = \|A^T\|_F$**

This isn't too hard to check from the definition of [(F)](#frob-norm_eqn): taking the transpose just swaps the role of $(i,j)$ in the sum, but you still end up adding together the square of all entries in $A$, which are the same as the square of all of the entries in $A^T$.

2. **Property 2: If $Q \in \mathbb{R}^{n \times n}$ is an orthogonal matrix and $A \in \mathbb{R}^{n \times n}$ is a square matrix, then $\|QA\|_F = \|AQ\|_F = \|A\|_F$, i.e., the Frobenius norm of a matrix $A$ is unchanged by left or right multiplication by an orthogonal matrix.**

To see why this is true, recall that if $A = \bm \vv a_1 \cdots \vv a_n\em$ are the columns of $A$, then $QA = \bm Q \vv a_1 \cdots Q\vv a_n\em$. Then, since we can write the Frobenius norm squared of a matrix as the sum of the Euclidean norm squared of its columns, we have:

$$
\|QA\|_F^2 &= \|Q\vv a_1\|^2 + \cdots + \|Q\vv a_n\|^2 \\
&= \|\vv a_1\|^2 + \cdots + \|\vv a_n\|^2 = \|A\|_F^2
$$

Here, the second equality holds because multiplying a vector by an orthogonal matrix does not change its Euclidean norm. Finally we use this and Property 1 to conclude:

$$
\|AQ\|_F = \|Q^T A^T\|_F \underbrace{=}_{\text{since} \ Q^{\top} \ \text{is also orthogonal}} \|A^T\|_F \underbrace{=}_{\text{property 1}} \|A\|_F
$$

We will measure the quality of our rank $k$ approximation [(SVD-k)](#SVD-k) $\hat{M}$ to $M$ in terms of the Frobenius norm of their difference.

The following theorem tells us that the SVD-based approximation [(SVD-k)](#SVD-k) is _optimal with respect to the Frobenius norm of the approximation error!_

:::{prf:theorem} 
:label: svd_approx_thm
For every $m \times n$ matrix $M \in \mathbb{R}^{m \times n}$, every rank target $k \geq 1$, and every rank $k$ $m \times n$ matrix $B \in \mathbb{R}^{m \times n}$,

$$
\|M - \hat{M}_k\|_F \leq \|M - B\|_F
$$

where $\hat{M}_k$ is the rank $k$ approximation derived from the SVD $M = U \Sigma V^T$ as in [(SVD-k)](#SVD-k).
:::

We won't formally prove [this theorem](#svd_approx_thm), but let's get some intuition as to why this is true. 

### Understanding [](#svd_approx_thm)

To keep things simple, we'll assume $M$ is square and full rank, i.e., $M \in \mathbb{R}^{n \times n}$ with rank $M= n$. Nearly the exact same argument works for general $M$, but we have to use the non-compact SVD of M (which keeps zero singular values around).

Our goal is to find a rank k matrix $\hat{M}$ which minimizes $\|\hat{M} - M\|_F^2$. Let $M = U \Sigma V^T$ be the SVD of M, where $U, \Sigma, V \in \mathbb{R}^{n\times n}$ since rank$M=n$. By Property 2 of the Frobenius norm, we then have the following sequence of equalities:

$$
\|\hat{M} - M\|_F^2 &= \|\hat{M} - U \Sigma V^T\|_F^2 \\
&= \|(U^T(\hat{M} - U \Sigma V^T))\|_F^2 \quad (\|AB\|_F = \|BA\|_F) \\
&= \|U^T\hat{M} - \Sigma V^T\|_F^2 \quad (U^TU = I) \\
&= \|(U^T\hat{M} - \Sigma V^T)V\|_F^2 \quad (\|A\|_F = \|AQ\|_F) \\
&= \|U^T\hat{M}V - \Sigma\|_F^2 \quad (V^TV = I)
$$

Now notice that since $\Sigma$ is a diagonal matrix, any non-diagonal entry in $U^T\hat{M}V$ adds to our approx error, so $U^T\hat{M}V$ should be diagonal. Let $\hat{M} = UDV^T$ for some diagonal matrix $D$. Then

\begin{equation}
\label{m_hat_svd}
\|\hat{M} - M\|_F^2 = \|U^T(UDV^T)V - \Sigma\|_F^2 = \|D - \Sigma\|_F^2 = \sum_{i=1}^n (d_{ii} - \sigma_{i})^2.
\end{equation}

Therefore, we want to pick the diagonal entries $d_{ii}$ of $D$ to minimize the right-most expression in [](#m_hat_svd). If there was no rank restriction on $\hat{M}$, we simply would set $d_{ii} = \sigma_{i}$.
However, notice $\hat{M} = UDV^T$ is an SVD of $\hat{M}$! Therefore, for $\hat{M}$ to be rank $k$, only $k$ of the $d_{ii}$ can be nonzero: if we can only knock off $k$ of the $(d_{ii} - \sigma_{i})^2$ terms in [](#m_hat_svd), we should pick the top $k$, i.e., $d_{ii} = \sigma_{i}$  for $i = 1, \ldots, k$ and $d_{ii} = 0 \text{ for } i = k+1, \ldots, n$.

Then, 
$$\hat{M} = \bm \vv u_1 \ldots \vv u_k \vv u_{k+1} \ldots \vv u_n\em \begin{bmatrix}
\sigma_1 & & & \\
& \ddots & & \\
& & \sigma_k & \\
& & & 0 \\
& & & & \ddots \\
& & & & & 0
\end{bmatrix} \begin{bmatrix}
\vv v_1^T \\
\vdots \\
\vv v_k^T \\
\vv v_{k+1}^T \\
\vdots \\
\vv v_n^T
\end{bmatrix} = \sum_{i=1}^k \sigma_i \vv u_i \vv v_i^T = U_k\Sigma_k V_k^T
$$

is exactly the expression in [(SVD-k)](#SVD-k), and the square approximation error it incurs is

$$
\|\hat{E}\|_F^2 = \|\hat{M} - M\|_F^2 = \sum_{i=k+1}^n \sigma_i^2,
$$
i.e. the sum of the squares of the "tail" singular values of $M$.




:::{prf:example}
:label: eg_lec18
Recall the matrix $A = \begin{bmatrix}
4 & 11 & 14 \\
8 & 7 & -2
\end{bmatrix}$ from {doc}`Lecture 18 <../lecture_notes/Lecture 18 - Singular Values and the Singular Value Decomposition.pdf>`, we computed its SVD as:

$$
A =  \begin{bmatrix}
\frac{3}{\sqrt{10}} & \frac{1}{\sqrt{10}} \\
\frac{1}{\sqrt{10}} & -\frac{3}{\sqrt{10}}
\end{bmatrix} \begin{bmatrix}
6\sqrt{10} & 0 \\
0 & 3\sqrt{10}
\end{bmatrix} \begin{bmatrix}
\frac{1}{3} & \frac{2}{3} & \frac{2}{3} \\
-\frac{2}{3} & -\frac{1}{3} & \frac{2}{3}
\end{bmatrix} = U \Sigma V^T.
$$

$A$ is rank 2, and its rank 1 approximation is, according to [(SVD-k)](#SVD-k), given by

$$
\hat{A}_1 = \begin{bmatrix}
\frac{3}{\sqrt{10}} \\
\frac{1}{\sqrt{10}}
\end{bmatrix} 6\sqrt{10} \begin{bmatrix}
\frac{1}{3} & \frac{2}{3} & \frac{2}{3}
\end{bmatrix} = \begin{bmatrix}
6 & 12 & 12 \\
2 & 4 & 4
\end{bmatrix}
$$

If we compute $\|\hat{A}_1 - A\|_F^2$ we get:
$$
\left\|\begin{bmatrix}
2 & -1 & 2 \\
-6 & -3 & 6
\end{bmatrix}\right\|_F^2 &= 2^2 + (-1)^2 + 2^2 + (-6)^2 + (-3)^2 + 6^2 \\
&= 90
$$

which is exactly $\sigma_2^2 = (3\sqrt{10})^2 = 90$.
:::



Finally, we address an obvious question when applying these ideas in practice: **how should we pick the rank $k$ of our approximation?**

In a perfect world, the singular values of the original data matrix will give strong guidance: if the top few singular values are much larger than the rest, then the obvious solution is to take $k = \#$ of big values. This was the case in the handset example previous lecture: the 1st singular value was significantly larger than others, suggesting a rank 1 approximation would be a good choice (which was the [image (d)](./121-Basics_of_Statistics.ipynb#railroad)).

In less clear settings, the rule of thumb is to take $k$ as small as possible while still providing a "useful" approximation of the original data. For example, it is common to choose $k$ so that the sum of the top k singular values is at least $c$ times larger than the sum of the other singular values. The ratio $c$ is typically a domain-dependent constant picked based on the application.

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/10_Ch_11_PCA_Apps/121-Apps.ipynb)
