---
title: 11.2 Low Rank
subject:  PCA
subtitle: 
short_title: 11.2 Low Rank
authors:
  - name: Nikolai Matni
    affiliations:
      - Dept. of Electrical and Systems Engineering
      - University of Pennsylvania
    email: nmatni@seas.upenn.edu
license: CC-BY-4.0
keywords: 
math:
  '\vv': '\mathbf{#1}'
  '\bm': '\begin{bmatrix}'
  '\em': '\end{bmatrix}'
  '\R': '\mathbb{R}'
---

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/10_Ch_11_PCA_Apps/121-Apps.ipynb)

{doc}`Lecture notes <../lecture_notes/Lecture 20 - Low-Rank Matrix Approximations via the SVD with applications to matrix completion and recommender systems.pdf>`

## Reading

Material related to this page, as well as additional exercises, can be

## Learning Objectives

By the end of this page, you should know:
- 

What are the Missing Entries?

Suppose that I run a web streaming service for movies for two of my friends, Amy, Bob, and Carol. It's a very special web movie service for the sci-fi movie lovers. The Movies: Inception, Star Wars Episode 1, Moana, and Inside Out. After 1 month, we ask our friends Amy, Bob, and Carol to rate the movies they've watched. Here are the results. We collect their ratings into a table below (we only allowed movies with 1-5):

\begin{tabular}{|c|c|c|c|c|c|}
\hline
 & The Matrix & Inception & Star Wars Ep.1 & Moana & Inside Out \\
\hline
Amy & 9 & ? & ? & ? & 5 \\
Bob & ? & 3 & ? & ? & 2 \\
Carol & ? & ? & 2 & 1 & 1 \\
\hline
\end{tabular}

Carol was asked to provide recommendations to Amy, Bob, and Carol as to which movie they should watch next. Said another way, we are asked to fill in the unknown ? ratings in the table above.

This seems a bit unfair! Each of the unknown entries could be any value in 1-5 after all! But what if I told you an additional hint: Amy, Bob, and Carol have the same relative preferences for each movie. For example, Amy likes Inside Out 5 more than Bob likes Inside Out, and this ratio is the same across all movies. Furthermore, we are making the assumption that all columns of the table have a.e. multiples of each other.

Thus we can conclude that Bob likes the Matrix $\frac{9}{5} \cdot (Amy's rating) = 4$. Similarly, Carol's rating of Inception is $\frac{3}{2} \cdot (Bob's rating) = 1.5$, Carol's rating of Inside Out is $\frac{5}{2} \cdot (Bob's rating) = 1$, and so on. Here's the completed matrix:

\[
M = \begin{bmatrix}
9 & 7.5 & 10 & 5 & 5 \\
4 & 3 & 4 & 2 & 2 \\
1.8 & 1.5 & 2 & 1 & 1
\end{bmatrix}
\]

The point of this example is that when you know something about the structure of a partially known matrix, then sometimes it is possible to intelligently fill in missing entries. In this perhaps example, the assumption that every column is a scalar multiple of every other column (one column ($\propto$) every other column) is pretty extreme!

One natural and useful definition is that assuming a matrix $M$ has low-rank, that rank counts as "how" it approximates dependency but it typically means that for a matrix $M \in \mathbb{R}^{m \times n}$, that rank $M \ll \min\{m,n\}$.

This lecture will explore how we can use this idea of structure to solve the matrix completion problem by finding the best low-rank approximation to a partially known matrix. The SVD will of course be our main tool.

\section*{Low-Rank Matrix Approximations: Motivation}

Before diving into the math, let's highlight some applications of low-rank matrix approximation:

\begin{enumerate}
    \item Compression: We saw this idea last class, but it's worth revisiting through the lens of low-rank approximation. If the original matrix $M \in \mathbb{R}^{m \times n}$ is described by $mn$ numbers, then a rank $k$ approximation requires $k(m+n)$ numbers. To see this, recall that if $M$ has rank $k$, then we can write its SVD as:
    
    \[
    M = \underbrace{U}_{\in \mathbb{R}^{m \times k}} \underbrace{\Sigma}_{\in \mathbb{R}^{k \times k}} \underbrace{V^T}_{\in \mathbb{R}^{k \times n}}
    \]
    
    \[
    (\Sigma^{1/2} = \text{diag}(\sigma_1^{1/2}, \ldots, \sigma_k^{1/2}))
    \]
    
    \[
    = \underbrace{U\Sigma^{1/2}}_{=Y} \underbrace{\Sigma^{1/2}V^T}_{=Z^T}
    \]
    
    or product $M = YZ^T$ where $Y \in \mathbb{R}^{m \times k}$ and $Z \in \mathbb{R}^{n \times k}$. For example, if $M$ represents a grayscale image (with entries = pixel intensities), $m$ and $n$ are typically in the 100s (or 1000s for HD images), and a modest value of $k$ ($\sim$10-150) is usually enough to give a good approximation of the original image.
    
    \item Updating Huge AI Models: A modern application of low-rank matrix approximation is for "fine-tuning" huge AI models. In the setting of large language models (LLMs) like ChatGPT, we are typically given some huge off-the-shelf model with billions (or more) parameters. Given this large model that has been trained on an enormous but generic corpus of text from the web, one often performs "fine-tuning". This fine-tuning is on several eras of training: typically it starts by training the full model on some task-specific data, then fine-tuning on a "clean/aligned GPT-3". The challenge of fine-tuning is that because these models are so big, updating these parameters is extremely challenging. The LORA paper "LoRA: Low-Rank Adaptation of Large Language Models" argued that fine-tuning updates are generally upper low-rank and that one can learn these updates in their factored $YZ^T$ form, allowing model fine-tuning with 10000-10000x fewer parameters.
    
    \item Denoising: If $M$ is a noisy version of some "true" matrix that is low-rank, then finding a low-rank approximation to $M$ will typically remove a lot of noise (and maybe some signal), resulting in a matrix that is actually more informative than the original.
    
    \item Matrix Completion: Low-rank approximation offers a way of solving the matrix completion problem we mentioned above. Given a matrix $M$ with missing entries, the first step is to obtain a full matrix $\hat{M}$ by filling in the missing entries with "default" values.
\end{enumerate}

What these default values should be often requires trial and error, but natural things to try include 0, the average of known entries in the same column/row, or the entire matrix. The second step is then to find a rank $k$ approximation to $\hat{M}$. This approach works well when the unknown matrix is close to a rank $k$ matrix with much fewer than $mn$ entries.

With this motivation in mind, let's see how the SVD can help us in finding a good rank $r$ approximation of a matrix $M$. Once we've described our procedure and seen some examples of it in action, we'll make precise how our method is actually producing the "best" rank $r$ approximation possible.

\section*{Low-Rank Approximation Using The SVD}

Given an $m \times n$ matrix $M \in \mathbb{R}^{m \times n}$, which we'll assume has rank $r$. Then the SVD of $M$ is given by

\[
M = U \Sigma V^T = \sum_{i=1}^r \sigma_i u_i v_i^T \quad \text{(SVD)}
\]

for $U = [u_1 \cdots u_r] \in \mathbb{R}^{m \times r}$, $V = [v_1 \cdots v_r] \in \mathbb{R}^{n \times r}$, and $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_r)$
the matrices of left singular vectors, right singular vectors, and singular values, respectively.

This right-most expression of (SVD) is a particularly convenient expression for our purposes, which expresses $M$ as a sum of rank 1 matrices $\sigma_i u_i v_i^T$ with mutually orthogonal column and row spaces.

This sum expression suggests a very natural way of finding a rank $k$ approximation to $M$: simply truncate the sum to the top $k$ terms, as measured by the singular values $\sigma_i$:

\[
\hat{M}_k = \sum_{i=1}^k \sigma_i u_i v_i^T = U_k \Sigma_k V_k^T \quad \text{(SVD-k)}
\]

where the right-most expression is defined in terms of the truncated matrices:

\[
U_k = [u_1 \cdots u_k] \in \mathbb{R}^{m \times k}, \quad V_k = [v_1 \cdots v_k] \in \mathbb{R}^{n \times k}, \quad \Sigma_k = \text{diag}(\sigma_1, \ldots, \sigma_k) \in \mathbb{R}^{k \times k}
\]

Before analyzing the properties of $\hat{M}_k = U_k \Sigma_k V_k^T$, let's examine if $\hat{M}_k$ could plausibly address our motivating applications. Storing the matrices $U_k, V_k,$ and $\Sigma_k$ requires storing $k(m+n+k)$ numbers, which is much less than the $mn$ numbers needed to store $M \in \mathbb{R}^{m \times n}$ when $m$ and $n$ are relatively large.

It is also natural to interpret (SVD-k) as approximating the raw data $M$ in terms of $k$ "concepts" (e.g., "sci-fi", "romance", "drama", "classic"), where the singular values $\sigma_1, \ldots, \sigma_k$ express the "importance" of the concepts, the rows of $V^T$ (or columns of $V$) express the "typical row/column" associated with each concept (e.g., a viewer likes sci-fi movies, or a movie liked only by romance viewers), and the rows of $U$ (or columns of $V^T$) approximately express each row

(or column) of $M$ as a linear combination (scaled by $\sigma_1,\ldots,\sigma_k$) of the "typical rows" (or "typical columns").

This method of producing a low-rank approximation is beautiful: we interpret the SVD of a matrix $M$ as a list of "ingredients" ordered by "importance", and we retain only the $k$ most important ingredients. But is this elegant procedure any "good"?

\section*{A Matrix Norm}

For an $m\times n$ matrix $M\in\mathbb{R}^{m\times n}$, let $\hat{M}$ be a low-rank approximation of $M$, and define the approximation error as $E = M-\hat{M}$. Intuitively, a "good" approximation will have a "small" error $E$. But we need to quantify the "size" of $E\in\mathbb{R}^{m\times n}$. We know that for vectors $x\in\mathbb{R}^n$, the right way to quantify the size of $x$ was through its norm $\|x\|$, where $\|\cdot\|$ is a function that needs to satisfy the axioms of a norm:

\begin{enumerate}
    \item $\|\alpha x\| = |\alpha| \|x\|$ for all $x\in\mathbb{R}^n$, $\alpha\in\mathbb{R}$
    \item $\|x\| \geq 0$ for all $x\in\mathbb{R}^n$, with $\|x\|=0$ if and only if $x=0$
    \item $\|x+y\| \leq \|x\| + \|y\|$ for all $x,y\in\mathbb{R}^n$
\end{enumerate}

It turns out we can define functions on the vector space of $m\times n$ matrices that satisfy these same properties: these are called matrix norms. We'll introduce one of these here that is particularly relevant to low-rank matrix approximations, but be aware that just as for vectors there are many kinds of matrix norms.

\section*{The Frobenius Norm}

The Frobenius norm of an $m\times n$ matrix $M\in\mathbb{R}^{m\times n}$ simply computes the Euclidean norm of $M$ as if it were an $mn$ vector:

\[
\|M\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n m_{ij}^2} \quad \text{(F)}
\]

Example: $M = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$ has $\|M\|_F^2 = 1^2 + 2^2 + 3^2 + 4^2 = 30$

which is the same as $\|\text{vec}(M)\|_2^2 = \left\|\begin{pmatrix} 1 \\ 2 \\ 3 \\ 4 \end{pmatrix}\right\|_2^2$.

We need a couple of properties of the Frobenius norm before we can connect the SVD to low-rank matrix approximation.

\textbf{Property 1:} For $A \in \mathbb{R}^{n \times n}$ a square matrix, $\|A\|_F = \|A^T\|_F$.

This isn't too hard to check from the definition of (F): taking the transpose just swaps the role of $(i,j)$ in the sum, but you still end up adding/squaring the square of all entries in $A$, which are the same as the square of all of the entries in $A^T$.

\textbf{Property 2:} If $Q \in \mathbb{R}^{n \times n}$ is an orthogonal matrix and $A \in \mathbb{R}^{n \times n}$ is a square matrix, then $\|QA\|_F = \|AQ\|_F = \|A\|_F$, i.e., the Frobenius norm of a matrix $A$ is unaltered by left or right multiplication by an orthogonal matrix.

To see why this is true, recall that if $A = [a_1 \cdots a_n]$ are the columns of $A$, then $QA = [Qa_1 \cdots Qa_n]$. Then, since we can write the Frobenius norm squared of a matrix as the sum of the Euclidean norm squared of its columns, we have:

\[
\|QA\|_F^2 = \|Qa_1\|^2 + \cdots + \|Qa_n\|^2
\]
\[
= \|a_1\|^2 + \cdots + \|a_n\|^2 = \|A\|_F^2
\]

Here, the second equality holds because multiplying a vector by an orthogonal matrix does not change its Euclidean norm. Finally we use this and Property 1 to conclude:

\[
\|AQ\|_F = \|Q^T A^T\|_F = \|A^T\|_F = \|A\|_F
\]

We will measure the quality of our rank $k$ approximation (SVD-k) $\hat{M}$ to $M$ in terms of the Frobenius norm of their difference.

The following theorem tells us that the SVD-based approximation (SVD-k) is optimal with respect to the Frobenius norm of the approximation error!

\textbf{Theorem:} For every $m \times n$ matrix $M \in \mathbb{R}^{m \times n}$, every rank target $k \geq 1$, and every rank $k$ $m \times n$ matrix $D \in \mathbb{R}^{m \times n}$,

\[
\|M - \hat{M}_k\|_F \leq \|M - D\|_F
\]

where $\hat{M}_k$ is the rank $k$ approximation derived from the SVD $M = U \Sigma V^T$ as in (SVD-k).

We won't formally prove this theorem, but let's get some intuition as to why this is true. To keep things simple, we'll assume $M$ is square and full rank, i.e., $M \in \mathbb{R}^{n \times n}$ with rank $n = n$. Nearly the exact same argument works for general $M$, but we have to be

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/10_Ch_11_PCA_Apps/121-Apps.ipynb)
