---
title: 11.1 Basics of Statistics
subject:  Principal Component Analysis (PCA)
subtitle: 
short_title: 11.1 Basics of Statistics
authors:
  - name: Nikolai Matni
    affiliations:
      - Dept. of Electrical and Systems Engineering
      - University of Pennsylvania
    email: nmatni@seas.upenn.edu
license: CC-BY-4.0
keywords: 
math:
  '\vv': '\mathbf{#1}'
  '\bm': '\begin{bmatrix}'
  '\em': '\end{bmatrix}'
  '\R': '\mathbb{R}'
---

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/10_Ch_11_PCA_Apps/121-Basics_of_Statistics.ipynb)

{doc}`Lecture notes <../lecture_notes/Lecture 19 - Principal Component Analysis with Applications to Imaging and Data Compression.pdf>`

## Reading

Material related to this page, as well as additional exercises, can be found in LAA 7.5 and ALA 8.8.

## Learning Objectives

By the end of this page, you should know:
- 

## Motivation: Satellite Imagery


We start with a motivating application from satellite imagery analysis. The [Landsat](https://landsat.gsfc.nasa.gov/) satellites are a pair of imaging satellites that record images of terrain and coastlines. These satellites cover almost every square mile of the Earth's surface every 16 days.

Satellite sensors acquire seven simultaneous images of any given region, with each sensor recording energy from separate wavelength bands: three in the visible light spectrum and four in the infrared and thermal bands.

Each image is digitized and stored as a rectangular array of numbers, with each number representing the signal intensity at the corresponding _pixel_. Each of the seven images is one channel of a _multichannel or multispectral image_.

The seven Landsat images of a given region typically contain a lot of redundant information, as some features will appear across most channels. However, other features, because of their color or temperature, may only appear in one or two channels. A goal of multispectral image processing is to view the data in a way that extracts information better than studying each image separately.

One approach, called _Principal Component Analysis (PCA)_, seeks to find a special linear combination of the data that takes a weighted combination of all seven images into just one or two images. Importantly, we want these one or two composite images. or _principal components_ to capture as much of the scene variance (features) as possibl; in particular, features should be more visible in the composite images than any of the original individual ones.

This idea, which we'll explore in detail today, is illustrated with some Landsat imagery taken over Railroad Valley Nevada.

:::{figure}../figures/12-railroad.jpg
:label:railroad
:alt:Railroad Satellite Imagery
:width: 600px
:align: center
:::

Images from three Landsat spectral bands are shown in [(a)-(c)](#railroad); the total information in these images is "rearranged" into the three principal components in [(d)-(f)](#railroad). The first component, (d), "explains" 93.5\% of the scene features (or variance) found in the original data. In this way, we could compress all of the original data to the single image (d) with only a 6.5\% loss of scene variance.

PCA can in general be applied to any data that consists of lists of measurements made on a collection of objects or individuals, including data mining, machine learning, image processing, speech recognition, facial recognition, and health informatics. As we'll see next, the way in which these "special combinations" of measurements are computed are via the singular vectors of an _observation matrix_.



## Observation Matrix

Let $\mathbf{x}_j \in \mathbb{R}^p$ denote an observation vector obtained from measurement $j$, and suppose that $j=1,\ldots,N$ measurements are obtained. The _observation matrix $X \in \mathbb{R}^{p \times N}$_ is a $p \times N$ matrix with $j^{th}$ column equal to the $j^{th}$ measurement vector $\mathbf{x}_j$:

\begin{equation}
\label{obs_mat}
X = \bm \mathbf{x}_1 & \mathbf{x}_2 & \cdots & \mathbf{x}_N\em \in \mathbb{R}^{p \times N}
\end{equation}

::::{prf:example}
:label: eg_1
Suppose that $\mathbf{x}_j \in \mathbb{R}^2$ is a two dimensional data given by the weight and height of the $j^{th}$ student at Penn: $\mathbf{x}_j = (w_j, h_j) \in \mathbb{R}^2$. Then if measurements are obtained from $N$ students, the observation matrix $X \in \mathbb{R}^{2 \times N}$ has the form:

$$
X = \begin{bmatrix}
w_1 & w_2 & \cdots & w_N \\
\underbrace{h_1}_{\vv x_1} & \underbrace{h_2}_{\vv x_2} & \cdots & \underbrace{h_N}_{\vv x_N}
\end{bmatrix}
$$

The set of observation vectors can be visualized as a two-dimensional _scatter plot_:

:::{figure}../figures/12-scatter.jpg
:label:scatter
:alt:scatter plot
:width: 300px
:align: center
:::
::::

::::{prf:example}
:label: eg_2 
The three images [(a)-(c)](#railroad) above can be thought of as _one image_ composed of _three spectral components_, as each image gives information about the same region. We can capture this mathematically by associating a vector in $\mathbb{R}^3$ to each pixel (one small area of the image) that lists the intensity for that pixel in the three spectral bands. Typically the image is 2000 $\times$ 2000 pixels, so there are 4 million pixels in the image. The observation matrix for this data is a matrix with 3 rows and 4 million columns. The data can thus be visualized as a scatter plot of 4 million points in $\mathbb{R}^3$ (see [Figure below](#scatter_satellite) for a synthetic example).
:::{figure}../figures/12-scatter_satellite.jpg
:label:scatter_satellite
:alt:scatter plot satellite
:width: 300px
:align: center
:::
::::

## Mean and Covariance

To understand PCA, we need to understand some basic concepts from statistics. We will review the _mean_ and _covariance_ of a set of observations $\vv x_1, \ldots, \vv x_N$. For our purposes, these will simply be things we can compute from the data, but you should be aware that these are well motivated quantities from a statistical perspective.: you will learn more about this in ESE 3010, STAT 4300 or ESE 4020.

Let's start with an observation matrix $X \in \mathbb{R}^{p\times N}$, with columns $\mathbf{x}_1,\ldots,\mathbf{x}_N \in \mathbb{R}^p$. 

:::{prf:definition} Sample Mean/Centroid
:label: mean_defn
The _sample mean $\mathbf{m}$_ of the observation vectors $\vv x_1, \ldots, \vv x_N$ is given by

$$
\mathbf{m} = \frac{1}{N}\left(\mathbf{x}_1 + \cdots + \mathbf{x}_N\right) = \frac{1}{N}\sum_{j=1}^N \mathbf{x}_j.
$$

Another name for the sample mean is the _centroid_ of the data, which we encountered when we learned about the k-means algorithm.
:::

Since PCA is interested in directions of (maximal) variation in our data, it makes sense to subtract off the mean $\mathbf{m}$, as it captures the average behavior of our data set. To that end, define the _centered observations_ to be

$$
\hat{\mathbf{x}}_j = \mathbf{x}_j - \mathbf{m}, \quad j=1,\ldots,N,
$$

and the _centered or de-meaned observation matrix_

$$
\hat{X} = \bm \hat{\mathbf{x}}_1 & \hat{\mathbf{x}}_2 & \cdots & \hat{\mathbf{x}}_N\em.
$$

For example, [Fig. 3](#centered) below shows a centered version of the weight/height data illustrated in [Fig. 1](#scatter):

:::{figure}../figures/12-centered.jpg
:label:centered
:alt:scatter plot centered
:width: 300px
:align: center
:::

:::{prf:definition} Sample Covariance Matrix
:label: var_defn
Finally, we define the _sample covariance matrix $S \in \mathbb{R}^{p\times p}$_ as

$$
S = \frac{1}{N} \hat{X}\hat{X}^T.
$$
:::

Since any matrix of the form $AA^T$ is positive semidefinite (can you see why?), so is $S$. Note sometimes $\frac{1}{N-1}$ is used as normalization; this is motivated for statistical considerations beyond the scope of this course (it leads to $S$ being an unbiased estimator of the "true" covariance of the data). We will just use $\frac{1}{N}$.

:::{prf:example}
:label: eg_covariance
Three measurements are made on each of four individuals in a random sample from a population. The observation vectors are:

$$
\mathbf{x}_1 = \begin{bmatrix} 1 \\ 2 \\ 1 \end{bmatrix}, \quad
\mathbf{x}_2 = \begin{bmatrix} 4 \\ 2 \\ 13 \end{bmatrix}, \quad
\mathbf{x}_3 = \begin{bmatrix} 7 \\ 8 \\ 1 \end{bmatrix}, \quad
\mathbf{x}_4 = \begin{bmatrix} 8 \\ 4 \\ 5 \end{bmatrix}
$$

The sample mean is $\mathbf{m} = \frac{1}{4}\left(\mathbf{x}_1+\mathbf{x}_2+\mathbf{x}_3+\mathbf{x}_4\right) = \begin{pmatrix} 5 \\ 4 \\ 5 \end{pmatrix}$.

The centered observations $\hat{\mathbf{x}}_j = \mathbf{x}_j - \mathbf{m}$ are then

$$
\hat{\mathbf{x}}_1 = \begin{bmatrix} -4 \\ -2 \\ -4 \end{bmatrix}, \quad
\hat{\mathbf{x}}_2 = \begin{bmatrix} -1 \\ -2 \\ 8 \end{bmatrix}, \quad
\hat{\mathbf{x}}_3 = \begin{bmatrix} 2 \\ 4 \\ -4 \end{bmatrix}, \quad
\hat{\mathbf{x}}_4 = \begin{bmatrix} 3 \\ 0 \\ 0 \end{bmatrix},
$$

and the centered observation matrix is

$$
\hat{X} =\begin{bmatrix}
-4 & -1 & 2 & 3 \\
-2 & -2 & 4 & 0 \\
-4 & 8 & -4 & 0
\end{bmatrix}.
$$

The sample covariance matrix is

$$
S = \frac{1}{4} \hat{X}\hat{X}^T = \begin{bmatrix}
7.5 & 4.5 & 0 \\
4.5 & 6 & -6 \\
0 & -6 & 24
\end{bmatrix}.
$$
:::

You might be wondering what the entries $s_{ij}$ of the covariance matrix $S$ mean. Let's take a bit of a closer look. We'll consider a small example where the observations $\mathbf{x}_j \in \mathbb{R}^2$ are two dimensional, and assume we have $N=3$ observations. Let the first measurement be $a \in \mathbb{R}$ and the second $b \in \mathbb{R}$, so that $\mathbf{x}_i = (a_i, b_i) \in \mathbb{R}^2$ and the centered observation is $\hat{\mathbf{x}}_i = (\hat{a}_i, \hat{b}_i) \in \mathbb{R}^2$. Our centered observation matrix is then

$$
\hat{X} = \bm \hat{a}_1 & \hat{a}_2 & \hat{a}_3 \\ 
           \hat{b}_1 & \hat{b}_2 & \hat{b}_3 \em  = \bm \hat{\mathbf{a}}^T \\ \hat{\mathbf{b}}^T \em,
$$

where we defined $\hat{\mathbf{a}} = (\hat{a}_1, \hat{a}_2, \hat{a}_3) \in \mathbb{R}^3$ and $\hat{\mathbf{b}} = (\hat{b}_1, \hat{b}_2, \hat{b}_3)$ as the vectors in $\mathbb{R}^3$ containing all of the centered first and second measurements, respectively.

Then, we can write our sample covariance matrix as:

$$
S = \frac{1}{3} \hat{X}\hat{X}^T = \frac{1}{3} \bm \hat{\mathbf{a}}^T \\ \hat{\mathbf{b}}^T \em \bm \hat{\mathbf{a}} & \hat{\mathbf{b}} \em  =  
\begin{bmatrix}
\frac{\|\hat{\mathbf{a}}\|^2}{3} & \frac{\hat{\mathbf{a}}^T\hat{\mathbf{b}}}{3} \\
\frac{\hat{\mathbf{a}}^T\hat{\mathbf{b}}}{3} & \frac{\|\hat{\mathbf{b}}\|^2}{3}
\end{bmatrix}.
$$

The diagonal entry $s_{11} = \frac{\|\hat{\mathbf{a}}\|^2}{3}$ is called the variance of measurement 1.

Expanding it out:

$$
s_{11} = \frac{\|\hat{\mathbf{a}}\|^2}{3} &= \frac{1}{3}(\hat{a}_1^2 + \hat{a}_2^2 + \hat{a}_3^2) \\
&= \frac{1}{3}((a_1-m_1)^2 + (a_2-m_2)^2 + (a_3-m_3)^2)
$$

we see that $s_{11}$ captures how much the first measurement $a_i$ deviates from its mean value $m_i$, on average, i.e., it measures how much $a_i$ varies relative to its mean. Similarly, $s_{22} = \frac{\|\hat{\mathbf{b}}\|^2}{3}$ is the variance of measurement 2.

Now let's look at the off-diagonal term $s_{12} = s_{21} = \frac{\hat{\mathbf{a}}^T\hat{\mathbf{b}}}{3}$. Recall from our work on inner products that $\hat{\mathbf{a}}^T\hat{\mathbf{b}} = \|\hat{\mathbf{a}}\| \|\hat{\mathbf{b}}\| \cos \theta$, where $\theta$ is the angle between $\hat{\mathbf{a}}$ and $\hat{\mathbf{b}}$. We can view

$$
\cos \theta = \frac{\hat{\mathbf{a}}^T\hat{\mathbf{b}}}{\|\hat{\mathbf{a}}\| \|\hat{\mathbf{b}}\|}
$$

as a measure of how well aligned, or _correlated_: if $\hat{\mathbf{a}}$ and $\hat{\mathbf{b}}$ are parallel, $\cos \theta = 1$ or $-1$, and if $\hat{\mathbf{a}}$ and $\hat{\mathbf{b}}$ are perpendicular, $\cos \theta = 0$. This lets us interpret $s_{12} = \frac{\hat{\mathbf{a}}^T\hat{\mathbf{b}}}{3}$, which is proportional to $\cos \theta$, as a measure of how similarly $\hat{\mathbf{a}}$ and $\hat{\mathbf{b}}$ deviate from their means: if $\hat{\mathbf{a}}^T\hat{\mathbf{b}}$ is positive, this means $\hat{\mathbf{a}}$ and $\hat{\mathbf{b}}$ tend to move up or down together; if it is negative they tend to move in opposite directions; and if it is small (or zero), $\hat{\mathbf{a}}$ and $\hat{\mathbf{b}}$ tend to move independently of each other. Since $s_{12}$ captures how the 1st and 2nd measurements vary with each other, it is called their _covariance_.

Finally, although we worked out the concepts for $\vv x_j \in \mathbb{R}^p$ and $j=1,2,3,$ These concepts extend naturally to the general setting:

- $S_{ii}$ = variance of measurement $i$ across measurements $j=1,\ldots,N$
- $S_{kl}$ = cvariance of measurements $k$ and $l$ across measurements $j=1,\ldots,N$.

:::{prf:example}  Correlated, anticorrelated, and uncorrelated vectors (Fig 3.8 from VLMS)
:label: basics-ex1

Below, we 3 pairs of time series (labelled $a_k$ and $b_k$), exhibiting positive correlation, negative correlation, and little/no correlation respectively.

![alt text](../figures/12-correlated_vectors.png)

:::

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/10_Ch_11_PCA_Apps/121-Basics_of_Statistics.ipynb)
