---
title: 11.1 Applications
subject:  PCA
subtitle: 
short_title: 11.1 Applications
authors:
  - name: Nikolai Matni
    affiliations:
      - Dept. of Electrical and Systems Engineering
      - University of Pennsylvania
    email: nmatni@seas.upenn.edu
license: CC-BY-4.0
keywords: 
math:
  '\vv': '\mathbf{#1}'
  '\bm': '\begin{bmatrix}'
  '\em': '\end{bmatrix}'
  '\R': '\mathbb{R}'
---

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/10_Ch_11_PCA_Apps/121-Apps.ipynb)

{doc}`Lecture notes <../lecture_notes/Lecture 19 - Principal Component Analysis with Applications to Imaging and Data Compression.pdf>`

## Reading

Material related to this page, as well as additional exercises, can be

## Learning Objectives

By the end of this page, you should know:
- 

\title{Principal Component Analysis with Applications to Image Processing and Statistics}


We start with a motivating application from satellite imagery analysis. The Landsat satellites are a pair of imaging satellites that record images of terrain and coastlines. These satellites cover almost every square mile of the Earth's surface every 16 days.

Satellite sensors acquire seven simultaneous images of any given region, with each sensor recording energy from separate wavelength bands: three in the visible light spectrum and four in the infrared and thermal bands.

Each image is digitized and stored as a rectangular array of numbers, with each number representing the signal strength for that pixel. Each of the seven images is one channel of a \textit{multispectral image}.

The seven Landsat images of a given region typically contain a lot of redundant information, as some features will appear across most channels. However, other features, because of their color or temperature, may only appear in one or two channels. A goal of multispectral image processing is to find a way that combines all seven channels of information better than studying each image separately.

One approach, called Principal Component Analysis (PCA), seeks to find a special linear combination of the data that finds a weighted combination of all seven images into just one or two images. Importantly, we want these one or two composite images to preserve as much of the original variance (features) as possible. In particular, features should be more visible in the composite images than any of the original individual ones.

This idea, which we'll explore in detail today, is illustrated with some Landsat imagery taken over Portland Valley Nevada.

\begin{figure}[h]
\centering
\begin{subfigure}{.3\textwidth}
  \includegraphics[width=\linewidth]{spectral_band1.jpg}
  \caption{Spectral band 1: Visible blue}
\end{subfigure}
\begin{subfigure}{.3\textwidth}
  \includegraphics[width=\linewidth]{spectral_band4.jpg}
  \caption{Spectral band 4: Near infrared}
\end{subfigure}
\begin{subfigure}{.3\textwidth}
  \includegraphics[width=\linewidth]{spectral_band7.jpg}
  \caption{Spectral band 7: Mid-infrared}
\end{subfigure}

\begin{subfigure}{.3\textwidth}
  \includegraphics[width=\linewidth]{principal_component1.jpg}
  \caption{Principal component 1: 61.5\%}
\end{subfigure}
\begin{subfigure}{.3\textwidth}
  \includegraphics[width=\linewidth]{principal_component2.jpg}
  \caption{Principal component 2: 5.3\%}
\end{subfigure}
\begin{subfigure}{.3\textwidth}
  \includegraphics[width=\linewidth]{principal_component3.jpg}
  \caption{Principal component 3: 1.2\%}
\end{subfigure}
\caption{Landsat imagery of Portland Valley Nevada}
\end{figure}

Images from three Landsat spectral bands are shown in (a)-(c); their total information is ``projected'' into the three principal components in (d)-(f). The first component, (d), ``explains'' 93.5\% of the scene features (or variance) found in the original data. In this way, we could compress all of the original data to the single image (d) with only a 6.5\% loss of scene variance.

PCA can in general be applied to any data that consists of lists of measurements made on a collection of objects or individuals, including data mining, machine learning, image processing, speech recognition, facial recognition, and health informatics. As we'll see next, the way in which these ``special combinations'' of measurements are computed are via the singular vectors of an \textbf{observation matrix}.

\section*{Observation Matrix, Mean, and Covariance}

Let $\mathbf{x}_i \in \mathbb{R}^p$ denote an observation vector obtained from measurement $i$, and suppose that $i=1,\ldots,N$ measurements are obtained. The observation matrix $X \in \mathbb{R}^{p \times N}$ is a $p \times N$ matrix with $i^{th}$ column equal to the $i^{th}$ measurement vector $\mathbf{x}_i$:

\[
X = [\mathbf{x}_1 \; \mathbf{x}_2 \; \cdots \; \mathbf{x}_N] \in \mathbb{R}^{p \times N}
\]

\textbf{Example:} Suppose that $\mathbf{x}_i \in \mathbb{R}^2$ is a two dimensional data given by the weight and height of the $i^{th}$ student at Penn: $\mathbf{x}_i = (w_i, h_i) \in \mathbb{R}^2$. Then if measurements are obtained from $N$ students, the observation matrix $X \in \mathbb{R}^{2 \times N}$ has the form:

\[
X = \begin{bmatrix}
w_1 & w_2 & \cdots & w_N \\
h_1 & h_2 & \cdots & h_N
\end{bmatrix}
\]

The set of observation vectors can be visualized as a two-dimensional scatter plot.

\begin{figure}[h]
\centering
\includegraphics[width=0.5\textwidth]{scatter_plot.png}
\caption{A scatter plot of observation vectors $\mathbf{x}_1,\ldots,\mathbf{x}_N$.}
\end{figure}

\textbf{Example:} The three images (a)-(c) above can be thought of as one image composed of three spectral components, as each image gives information about the same region. We can capture this mathematically by associating a vector in $\mathbb{R}^3$ to each pixel (one coordinate of the vector) that lists the intensity for that pixel in the three spectral bands. Typically the image is 2000 $\times$ 2000 pixels, so there are 4 million pixels in the image. The observation matrix for this data is a matrix with 3 rows and 4 million columns. The data can thus be visualized as a scatter plot in $\mathbb{R}^3$.

visualized as a scatter plot of 4 million points in $\mathbb{R}^3$ (see Figure below for a synthetic example).

\begin{figure}[h]
\centering
\includegraphics[width=0.6\textwidth]{spectral_scatter_plot.png}
\caption{A scatter plot of spectral data for a satellite image.}
\label{fig:spectral_scatter}
\end{figure}

\section*{Mean and Covariance}

To understand PCA, we need to understand some basic concepts from statistics. We will review the mean and covariance of a set of observations. For our purposes, these will simply be things we can compute from the data, but you should be aware that these are well-motivated quantities from a statistical perspective. You will learn more about this in ESE 2010, STAT 4300 or ESE 5020.

Let's start with an observation matrix $X \in \mathbb{R}^{p\times N}$, with columns $\mathbf{x}_1,\ldots,\mathbf{x}_N \in \mathbb{R}^p$. The sample mean $\mathbf{m}$ of the observation vectors is given by

\[
\mathbf{m} = \frac{1}{N}(\mathbf{x}_1 + \cdots + \mathbf{x}_N) = \frac{1}{N}\sum_{j=1}^N \mathbf{x}_j.
\]

Another name for the sample mean is the centroid of the data, which we encountered when we learned about the k-means algorithm.

Since PCA is interested in directions of (maximal) variation in our data, it makes sense to subtract off the mean $\mathbf{m}$, as it captures the average behavior of our data set. To that end, define the centered observations to be

\[
\hat{\mathbf{x}}_j = \mathbf{x}_j - \mathbf{m}, \quad j=1,\ldots,N,
\]

and the centered or de-meaned observation matrix

\[
\hat{X} = [\hat{\mathbf{x}}_1 \; \hat{\mathbf{x}}_2 \; \cdots \; \hat{\mathbf{x}}_N].
\]

For example, Fig. 3 below shows a centered version of the weight/height data illustrated in Fig. 1:

\begin{figure}[h]
\centering
\includegraphics[width=0.4\textwidth]{weight_height_scatter.png}
\caption{Weight-height data in mean-deviation form.}
\label{fig:weight_height_scatter}
\end{figure}

Finally, we define the sample covariance matrix $S \in \mathbb{R}^{p\times p}$ as

\[
S = \frac{1}{N} \hat{X}\hat{X}^T.
\]

Since any matrix of the form $AA^T$ is positive semidefinite (can you see why?), so is S. Note sometimes $\frac{1}{N-1}$ is used instead of $\frac{1}{N}$ for statistical considerations beyond the scope of this course. (It leads to S being an unbiased estimator of the "true" covariance of the data). We will just use $\frac{1}{N}$.

\textbf{Example:} Three measurements are made on each of four individuals in a random sample from a population. The observation vectors are:

\[
\mathbf{x}_1 = \begin{pmatrix} 1 \\ 1 \\ 4 \end{pmatrix}, \quad
\mathbf{x}_2 = \begin{pmatrix} 2 \\ 5 \\ 7 \end{pmatrix}, \quad
\mathbf{x}_3 = \begin{pmatrix} 7 \\ 8 \\ 9 \end{pmatrix}, \quad
\mathbf{x}_4 = \begin{pmatrix} 6 \\ 2 \\ 5 \end{pmatrix}
\]

The sample mean is $\mathbf{m} = \frac{1}{4}(\mathbf{x}_1+\mathbf{x}_2+\mathbf{x}_3+\mathbf{x}_4) = \begin{pmatrix} 4 \\ 4 \\ 6.25 \end{pmatrix}$.

The centered observations $\hat{\mathbf{x}}_j = \mathbf{x}_j - \mathbf{m}$ are then

\[
\hat{\mathbf{x}}_1 = \begin{pmatrix} -3 \\ -3 \\ -2.25 \end{pmatrix}, \quad
\hat{\mathbf{x}}_2 = \begin{pmatrix} -2 \\ 1 \\ 0.75 \end{pmatrix}, \quad
\hat{\mathbf{x}}_3 = \begin{pmatrix} 3 \\ 4 \\ 2.75 \end{pmatrix}, \quad
\hat{\mathbf{x}}_4 = \begin{pmatrix} 2 \\ -2 \\ -1.25 \end{pmatrix},
\]

and the centered observation matrix is

\[
\hat{X} = \begin{pmatrix}
-3 & -2 & 3 & 2 \\
-3 & 1 & 4 & -2 \\
-2.25 & 0.75 & 2.75 & -1.25
\end{pmatrix}.
\]

The sample covariance matrix is

\[
S = \frac{1}{4} \hat{X}\hat{X}^T = \begin{pmatrix}
7.5 & 4.5 & 0 \\
4.5 & 6 & -6 \\
0 & -6 & 24
\end{pmatrix}.
\]

You might be wondering what the entries $s_{ij}$ of the covariance matrix S mean. Let's take a bit of a closer look. We'll consider a small example where the observations $\mathbf{x}_i \in \mathbb{R}^2$ are two dimensional, and assume we have $N=3$ observations. Let the first measurement be $a \in \mathbb{R}$ and the second be $b \in \mathbb{R}$, so that $\mathbf{x}_i = (a_i, b_i) \in \mathbb{R}^2$ and the centered observation is $\hat{\mathbf{x}}_i = (\hat{a}_i, \hat{b}_i) \in \mathbb{R}^2$. Our centered observation matrix is then

\[
\hat{X} = [\hat{a}_1 \; \hat{a}_2 \; \hat{a}_3] = [\hat{\mathbf{a}}^T]
           [\hat{b}_1 \; \hat{b}_2 \; \hat{b}_3]   [\hat{\mathbf{b}}^T],
\]

where we defined $\hat{\mathbf{a}} = (\hat{a}_1, \hat{a}_2, \hat{a}_3) \in \mathbb{R}^3$ and $\hat{\mathbf{b}} = (\hat{b}_1, \hat{b}_2, \hat{b}_3)$ as the vectors in $\mathbb{R}^3$ containing all of the centered first and second measurements, respectively.

Then, we can write our sample covariance matrix as:

\[
S = \frac{1}{3} \hat{X}\hat{X}^T = \frac{1}{3} [\hat{\mathbf{a}}^T] [\hat{\mathbf{a}} \; \hat{\mathbf{b}}] = 
\begin{bmatrix}
\|\hat{\mathbf{a}}\|^2 & \hat{\mathbf{a}}^T\hat{\mathbf{b}} \\
\hat{\mathbf{a}}^T\hat{\mathbf{b}} & \|\hat{\mathbf{b}}\|^2
\end{bmatrix}.
\]

The diagonal entry $s_{11} = \frac{\|\hat{\mathbf{a}}\|^2}{3}$ is called the variance of measurement 1.

Expanding it out:

\[
s_{11} = \|\hat{\mathbf{a}}\|^2 = \frac{1}{3}(\hat{a}_1^2 + \hat{a}_2^2 + \hat{a}_3^2)
= \frac{1}{3}((a_1-m_1)^2 + (a_2-m_1)^2 + (a_3-m_1)^2)
\]

We see that $s_{11}$ captures how much the first measurement $a_i$ deviates from its mean value $m_1$, on average, i.e., it measures how much $a_i$ varies relative to its mean. Similarly, $s_{22} = \|\hat{\mathbf{b}}\|^2$ is the variance of measurement 2.

Now let's look at the off-diagonal term $s_{12} = s_{21} = \frac{\hat{\mathbf{a}}^T\hat{\mathbf{b}}}{3}$. Recall from our work on inner products that $\hat{\mathbf{a}}^T\hat{\mathbf{b}} = \|\hat{\mathbf{a}}\| \|\hat{\mathbf{b}}\| \cos \theta$, where $\theta$ is the angle between $\hat{\mathbf{a}}$ and $\hat{\mathbf{b}}$. We can view

\[
\cos \theta = \frac{\hat{\mathbf{a}}^T\hat{\mathbf{b}}}{\|\hat{\mathbf{a}}\| \|\hat{\mathbf{b}}\|}
\]

as a measure of how well aligned, or correlated, $\hat{\mathbf{a}}$ and $\hat{\mathbf{b}}$ are. If $\hat{\mathbf{a}}$ and $\hat{\mathbf{b}}$ are parallel, $\cos \theta = 1$ or $-1$, and if $\hat{\mathbf{a}}$ and $\hat{\mathbf{b}}$ are perpendicular, $\cos \theta = 0$. This lets us interpret $s_{12} = \frac{\hat{\mathbf{a}}^T\hat{\mathbf{b}}}{3}$, which is proportional to $\cos \theta$, as a measure of how similarly $\hat{\mathbf{a}}$ and $\hat{\mathbf{b}}$ behave. If $\hat{\mathbf{a}}^T\hat{\mathbf{b}}$ is positive, this means $\hat{\mathbf{a}}$ and $\hat{\mathbf{b}}$ tend to move up or down together; if it is negative they tend to move in opposite directions; and if it is small (or zero), $\hat{\mathbf{a}}$ and $\hat{\mathbf{b}}$ tend to move independently of each other. Since $s_{12}$ captures how the 1st and 2nd measurements vary with each other, it is called their covariance.

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/10_Ch_11_PCA_Apps/121-Apps.ipynb)
