---
title: 4.5 Least Squares
subject:  Orthogonality
subtitle: approximate linear system
short_title: 4.5 Least Squares
authors:
  - name: Nikolai Matni
    affiliations:
      - Dept. of Electrical and Systems Engineering
      - University of Pennsylvania
    email: nmatni@seas.upenn.edu
license: CC-BY-4.0
keywords: Orthogonal Projection, Decomposition, Least Squares
math:
  '\vv': '\mathbf{#1}'
  '\bm': '\begin{bmatrix}'
  '\em': '\end{bmatrix}'
  '\R': '\mathbb{R}'
---

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/03_Ch_4_Orthogonality/055-least_squares.ipynb)

{doc}`Lecture notes <../lecture_notes/Lecture 08 - Orthogonal Projections and Subspaces, Least Squares Problems and Solutions.pdf>`

## Reading

Material related to this page, as well as additional exercises, can be found in LAA 6.5 and VMLS 12.1.

## Learning Objectives

By the end of this page, you should know:
- the least squares problem
- how the least squares problem relates to a solving an approximate linear system

## Introduction: Inconsistent Linear Equations

Suppose we are presented with an inconsistent set of linear equations $A \vv x \approx \vv b$. This typically coincides with $A \in \mathbb{R}^{m \times n}$ being a "tall matrix", i.e., $m > n$. This corresponds to an overdetermined system of $m$ linear equations in $n$ unknowns. A typical setup assumes this arises is one of data fitting: we are given feature variables $\vv a_i \in \mathbb{R}^n$ and response variables $b_i \in \mathbb{R}$, and we believe that $\vv a_i^{\top} \vv x \approx b_i$ for measurements $i=1,\ldots, m$ and $\vv x \in \mathbb{R}^n$ are our model parameters. We will revisit this application in detail later.

The question then becomes, if no $\vv x \in \mathbb{R}^n$ exists such that $A \vv x = \vv b$ exists, what should we do? A natural idea is to select an $\vv x$ that makes the error or _residual_ $\vv r = A\vv x - \vv b$ as small as possible, i.e., to find the $\vv x$ that _minimizes_ $\|\vv r\| = \|A\vv x - \vv b\|$. Now minimizing the residual or its square gives the same answer, so we may as well minimize
\begin{equation}
\label{residual_eqn}
\|A\vv x - \vv b\|^2 = \|\vv r\|^2 = r_1^2 + \cdots + r_m^2,
\end{equation}

the sum of squares of the residuals. 


## The Least Squares Problem

:::{prf:definition} Least Squares Problem
:label: least-squares-defn
The problem of finding $\hat{\vv  x} \in \mathbb{R}^n$ that minimizes $\|A \vv x - \vv b\|^2$ over all possible choices of $\vv x \in \mathbb{R}^n$ is called the _least-squares problem_, and is written as:
\begin{equation}
\label{least-squares-eqn}
\textrm{minimize} \|A \vv x - \vv b\|^2 \ \textrm{(LS)}
\end{equation}

over the variable $\vv x$. Any $\hat{\vv  x}$ satisfying $\|A\hat{\vv  x} - \vv b\|^2 \leq \|A\vv x - \vv b\|^2$ for all $\vv x$ is a solution of the least-squares problem (LS), and is also called a _least-squares approximate solution of $A\vv x = \vv b$_.
:::

### Solving by Orthogonal Projection 

There are many ways of deriving the solution to [(LS)](#least-squares-defn): you may have seen a vector calculus-based derivation in Math 1410. Here, we will use our new understanding of orthogonal projections to provide an intuitive and elegant _geometric_ derivation.

Our starting point is a _column interpretation_ of the least squares objective: let $\vv a_1, \ldots, \vv a_n \in \mathbb{R}^m$ be the columns of $A$: then the least squares (LS) problem is the problem of finding a linear combination of the columns that is closest to the vector $\vv b \in \mathbb{R}^m$, with coefficients specified by $\vv x$:
$$
\|A\vv x - \vv b\|^2 = \|(x_1a_1 + \cdots + x_na_n) - b\|^2
$$

:::{important}
Another way of stating the above is we are seeking the vector $A \hat{\vv x} \in$Col$(A)$ in the column space of $A$ that is as close to $\vv b$ as possible. Perhaps not surprisingly, it turns out this can be computed by taking the _[orthogonal projection](./054-proj_subspace.ipynb#orth-proj) of $\vv b$ onto Col$(A)$._
:::

:::{figure}../figures/05-least_squares.jpg
:label:least_squares_fig
:alt:Least squares
:width: 400px
:align: center
:::

To prove the above geometrically intuitive fact (see [](#least_squares_fig)), we need to decompose $\vv b$ into its orthogonal projection onto Col$(A)$, which we denote by $\hat{\vv b}$, and the element in its orthogonal complement Col$(A)$, which we denote by $\vv e$. Recall $\vv b, \hat{\vv b}, \vv e\in \mathbb{R}^m$ and Col$(A) \subset \mathbb{R}^m$. 

We then have that
$$
\vv r = A \vv x - \vv b = \left(A \vv x - \hat{\vv b}\right) - \vv e.
$$
Since $A \vv x, \hat{\vv b} \in $Col$(A)$, so is $A \vv x -  \hat{\vv b}$ (why?), and thus we have decomposed $\vv r$ into components lying in Col$(A)$ and Col$(A)^{\perp}$. Using our generalized Pythagorean theorem, it then follows that 
$$
\|A \vv x - \vv b\|^2 = \|\vv r\|^2 = \|A \vv x - \hat{\vv b}\|^2 + \|\vv e\|^2.
$$
The above expression can be made as small as possible be choosing $\hat{\vv x}$ such that $A\hat{\vv x} = \hat{\vv b}$, which always has a solution (why?) leaving the residual error
$\|\vv e\|^2 = \|\vv b - \hat{\vv b}\|^2$, ie, the component of $\vv b$ that is orthogonal to Col$(A)$.

This gives us a nice geometric interpretation of the lest squares solution $\hat{\vv x}$, but how should we compute it? We now recall from [here](../03_Ch_4_Orthogonality/054-proj_subspace.ipynb#thm_orth_fund) that Col$(A)^{\perp} = $Null$(A^{\top})$. So, we therefore have that $\vv e \in $Null$(A^{\top})$. This means that
$$
A^{\top} \vv e=A^{\top} \left(\vv b - \hat{\vv b}\right) = A^{\top} \left(\vv b - A \hat{\vv x}\right) = 0.
$$
or, equivalently that
\begin{equation}
\label{norm_eqn}
A^{\top} A \hat{\vv x} = A^{\top} \vv b. \ (\textrm{NE})
\end{equation}
The above equations are the _normal equations_ associated with the lest squares problem specified by $A$ and $\vv b$. We have just informally argued that the set of least squares solutions $\hat{\vv x}$ coincide with the set of solutions to the [normal equations (NE)](#norm_eqn): this is in fact true, and can be proven (we wont do that here).

Thus, we have reduced solving a least squares problem to our favorite problem, solving a system of linear equations! One question you might have is when do the normal equations (NE) have a _unique solution_? The answer, perhaps unsurprisingly, is when the columns of
are linearly independent, and hence form a basis for Col$(A)$. The following theorem is a useful summary of our discussion thus fur:

:::{prf:theorem}
:label: least_squares_thm
Let $A\in \mathbb{R}^{m \times n}$ be an $m \times n$ matrix. Then the following statements are logically equivalent, i.e., any one being true implies all the other are true):

(i) The least squares problem **minimize $\|A \vv x - \vv b\|^2$** has a unique sdution for any $\vv b \in \mathbb{R}^m$;

(ii) The columns of $A$ are linearly independent;

(iii) the matrix $A^{\top}A$ is invertible.

When these are true, the unique least squares solution is given by
\begin{equation}
\label{least_squares_thm_eqn}
 \hat{\vv x} = \left(A^{\top}A\right)^{-1}A^{\top} \vv b. \ (\textrm{XLS})
\end{equation}
:::

:::{note}
The [formula (XLS)](#least_squares_thm_eqn) is useful mainly for theoretical purposes and for hand calculations when $A^{\top}A$ is a $2 \times 2$ matrix. Computational approaches are typically based on QR factorizations of $A$ (the QR factorization we saw in class for square matrices can be easily extended to tall matrices with more rows than columns).
:::

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/03_Ch_4_Orthogonality/055-least_squares.ipynb)