---
title: 4.6 Least Squares and Data Fitting
subject:  Orthogonality
subtitle: model some observed data
short_title: 4.6 Least Squares and Data Fitting
authors:
  - name: Nikolai Matni
    affiliations:
      - Dept. of Electrical and Systems Engineering
      - University of Pennsylvania
    email: nmatni@seas.upenn.edu
license: CC-BY-4.0
keywords: Data, 
math:
  '\vv': '\mathbf{#1}'
  '\bm': '\begin{bmatrix}'
  '\em': '\end{bmatrix}'
  '\R': '\mathbb{R}'
---

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/03_Ch_4_Orthogonality/056-least_squares_data.ipynb)

{doc}`Lecture notes <../lecture_notes/Lecture 09 - Least Squares Data Fitting.pdf>`

## Reading

Material related to this page, as well as additional exercises, can be found in VMLS 13.

## Learning Objectives

By the end of this page, you should know:
- 

## Introduction

We will introduce one of the most important applications of least squares methods: fitting a mathematical model to some relation given some observed data.

A typical data fitting problem takes the following form: There is some underlying _feature vector or independent variable_ $\vv x \in \mathbb{R}^m$ and a scalar _outcome or response variable_ $y \in \mathbb{R}$ that we believe are (approximately) related by some function $f: \mathbb{R}^m \to \mathbb{R}$ such that

\begin{equation}
\label{y_app_f}
y \approx f(x).  \qquad (M)
\end{equation}

### Data

Our goal is to fit (or learn) a _model_ $f$ given some _data_:

$$
(\vv x^{(1)}, y^{(1)}), (\vv x^{(2)}, y^{(2)}), \ldots, (\vv x^{(N)}, y^{(N)}).
$$

These _data pairs_ $(\vv x^{(i)}, y^{(i)})$ are sometimes also called _observations, examples, samples, or measurements_ depending on context.

:::{note}
The superscript $^{(i)}$ denotes the $i$-th data point. For example, $\vv x^{(i)} \in \mathbb{R}^m$ is the $i^{th}$ independent variable, and $x_j^{(i)}$ is the value of $j^{th}$ feature for example $i$.
:::

### Model Parameterization

Our goal is to choose a model $\hat{f}: \mathbb{R}^m \to \mathbb{R}$ that approximates the [model](#y_app_f) well, that is, $y \approx \hat{f}(x)$. The hat notation is traditionally used to highlight that $\hat{f}$ is an approximation to $f$. Specifically, we will write $\hat{y} = \hat{f}(x)$ to highlight that $\hat{y}$ is an approximate prediction of the outcome $y$.

In order to efficiently search over candidate model functions $\hat{f}$, we need to _parameterize a model class $\mathcal{F}$_ that is easy to work with. A powerful and commonly used model class is the set of _linear in the parameters_ models of the form

\begin{equation}
\label{LP_eqn}
\hat{f}(\vv x) = \theta_1 f_1(\vv x) + \theta_2 f_2(\vv x) + \cdots + \theta_p f_p(\vv x). \qquad (LP) 
\end{equation}

In [(LP)](#LP_eqn), the functions $f_i: \mathbb{R}^m \to \mathbb{R}$ are _basis functions_ or _features_ that we choose before hand. 

:::{warning}
Note that the term basis here is realted to, but different from, our previous use of the term. 
:::

When we solve the data fitting problem, we will look for the _parameters_ $\theta_i$ that, among other things, make the model prediction $\hat{y}_i = \hat{f}(\vv x^{(i)})$ **consistent with the observed data**, i.e., we want $y^{(i)} \approx y^{(i)}$.

### Data fitting:

For the $i$-th observation $y^{(i)}$ and the $i^{th}$ prediction $\hat{y}^{(i)}$, we define the _prediction error_ or _residual_ $r^{(i)} = \hat{y}^{(i)} - y^{(i)}$.

The _least squares data fitting problem_ chooses the model parameters $\theta_i$ that minimize the (average of the) sum of the squares of the prediction errors on the data set:

$$
\frac{(r^{(1)})^2 + \cdots + (r^{(N)})^2}{N}
$$

Next we'll show that this problem can be cast as a least squares problem over the model parameters $\theta_i$. Before doing that though, we want to highlight the conceptual shift we are making.

:::{important} DATA-DRIVEN
Rather than hand crafting our function $\hat{f}$ from scratch, we **solve an optimization problem** to identify the parameters $\theta_i$ that best explain the data, i.e., we _learn_ the model from from the data. Of course, if we know something about the model structure, we should encode this in our choice of feature functions $f_i$. We'll see examples of such _feature engineering_ later.
:::

## Data Fitting as Least Squares

We start by stacking the outcomes $y^{(i)}$, predictions $\hat{y}^{(i)}$, and residuals $r^{(i)}$ as vectors in $\mathbb{R}^N$:

$$
\vv y = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(N)} \end{bmatrix}, \quad 
\hat{\vv y} = \begin{bmatrix} \hat{y}^{(1)} \\ \hat{y}^{(2)} \\ \vdots \\ \hat{y}^{(N)} \end{bmatrix}, \quad 
\vv r = \begin{bmatrix} r^{(1)} \\ r^{(2)} \\ \vdots \\ r^{(N)} \end{bmatrix} = \begin{bmatrix} y^{(1)} - \hat{y}^{(1)} \\ y^{(2)} - \hat{y}^{(2)} \\ \vdots \\ y^{(N)} - \hat{y}^{(N)} \end{bmatrix}
$$

Then we can compactly write the _squared prediction error_ as $\|\vv r\|_2^2$. Next, we compile our model parameters into a vector $\vv \theta \in \mathbb{R}^p$, and build our _feature matrix or measurement matrix_ $A \in \mathbb{R}^{N \times p}$ by setting

$$
A_{ij} = f_j(\vv x^{(i)}), \quad i=1,\ldots,N, \quad j=1,\ldots,p.
$$

The $j$-th column of the matrix $A$ is composed of the $j$-th basis function evaluated on each of the data points $\vv x^{(1)},\ldots,\vv x^{(N)}$:

$$
\vv f_1(\vv x) = \begin{bmatrix} f_1(\vv x^{(1)}) \\ f_1(\vv x^{(2)}) \\ \vdots \\ f_1(\vv x^{(N)}) \end{bmatrix}, \cdots,  \vv f_p(x) = \begin{bmatrix} f_p(\vv x^{(1)}) \\ f_p(\vv x^{(2)}) \\ \vdots \\ f_p(\vv x^{(N)}) \end{bmatrix} 
$$

and $A = [\vv f_1(\vv x) \cdots \vv f_p(\vv x)]$. In matrix-vector notation, we then have

$$
\hat{\vv y} = A\vv \theta = \theta_1 \vv f_1(\vv x) + \cdots + \theta_p \vv f_p(\vv x). 
$$

The least squares data fitting problem then becomes to

$$
\text{minimize } \|\vv r\|^2 \Rightarrow \text{minimize } \|\vv y - A\vv \theta\|^2
$$

over the model parameters $\vv \theta$, which we recognize as a [least squares problem](./055-least_squares.ipynb#least-squares-defn)! Assuming we have chosen basis functions $f_i$ such that the columns of $A$ are linearly independent (what would it mean if this wasn't true?), we have that the least squares solution is

$$
\hat{\vv \theta} = (A^TA)^{-1}A^T\vv y. 
$$

The resulting average least squares error $\frac{\|A\hat{\vv \theta} - \vv y\|^2}{N}$ is called the _Minimum Mean-Square Error (MMSE)_.

## Warm-up: Fitting a Constant Model

We start with the simplest possible model and set the number of features $p=1$ and $f_1(x) = 1$, so that our (admittedly boring) model becomes $\hat{f}(\vv x) = \theta_1$.

First, we construct $A \in \mathbb{R}^{N \times 1}$ by setting $A_{i1} = f_1(\vv x^{(i)}) = 1$. Therefore $A$ is the $N$-dimensional all ones vector $\mathbf{1}_N$. We plug this into our formula for $\hat{\vv \theta}$:

$$
\hat{\vv \theta} = \hat{\theta}_1 = (\vv 1^T\vv 1)^{-1}\vv 1^T\vv y = \frac{1}{N}\sum_{i=1}^N y^{(i)} = \text{average}(\vv y). 
$$

We have just shown that the _mean_ or _average_ of the outcomes $y^{(1)},\ldots,y^{(N)}$ is the best least squares fit of a constant model. In this case, the MMSE is

$$
\frac{1}{N}\sum_{i=1}^N (\text{average}(\vv y) - y^{(i)})^2,
$$

which is called the variance of $\vv y$, and measures how "wiggly" $\vv $ is.

## Univariate Function: Straight Line Fit

We start by considering the univariate function setting where our feature vector $\vv x = x \in \mathbb{R}$ is a scalar, and hence we are looking to approximate a function $f: \mathbb{R} \to \mathbb{R}$. This is a nice way to get intuition because it is easy to plot the data $(x^{(i)}, y^{(i)})$ and the model function $\hat{y} = \hat{f}(x)$.

We'll start with a _straight line fit_ model: we set $p=2$, with $f_1(x) = 1$ and $f_2(x) = x$. In this case our model function is composed of models of the form

$$
\hat{f}(x) = \theta_1 + \theta_2 x. 
$$

Here, we can easily interpret $\theta_1$ as the y-intercept and $\theta_2$ as the slope of the straight line model we are searching for.

In this case, the matrix $A \in \mathbb{R}^{N \times 2}$ and takes the form

$$ A = \begin{bmatrix}
1 & x^{(1)} \\
1 & x^{(2)} \\
\vdots & \vdots \\
1 & x^{(N)}
\end{bmatrix} 
$$

Although we can work out formulas for $\hat{\theta}_1$ and $\hat{\theta}_2$, they are not particularly interesting or informative. Instead, we'll focus on some examples of how to use these ideas. A straight-line fit to 50 data points is given [below](#straight_line).

:::{figure}../figures/05-straight_line.jpg
:label:straight_line
:alt:Straight Line fit
:width: 400px
:align: center
:::

::::{prf:example}Time Series Trend
:label: ex_time_series
In this setting, $y^{(i)}$ is the value of a quantity of interest at time $x^{(i)} = i$. The straight line model $\hat{y}^{(i)} = \hat{\theta}_1 + \hat{\theta}_2 i$ is called a _trend line_, and $\vv y - \hat{\vv y}$ is called the _de-trended time series_, and $\hat{\theta}_2$ is the _trend coefficient_.

When the de-trended time series is positive, it means the time series lies above the straight-line fit; when it is negative, it is below the straight-line fit. In the [figures below](#petroleum), we apply this idea to world petroleum consumption. (Can you identify when major geopolitical events occurred based on the de-trended line?)

:::{figure}../figures/05-petroleum.jpg
:label:petroleum
:alt:Petroleum Data fitting
:width: 600px
:align: center
:::

::::

## Univariate Function: Polynomial Fit


A simple extension beyond the straight-line fit is a _polynomial fit_ where we set the $j^{th}$ feature to be

$$
f_j(x) = x^{j-1}
$$

for $j = 1,\ldots,p$. This leads to a model class composed of polynomials of at most degree $p-1$:

$$
\hat{f}(x) = \theta_1 + \theta_2 x + \theta_3 x^2 + \cdots + \theta_p x^{p-1} 
$$

:::{warning}
Here $x^i$ means a scalar raised to the $i^{th}$ power; $x^{(i)}$ means the $i^{th}$ observed scalar data value.
:::

In this case, our matrix $A \in \mathbb{R}^{N \times p}$ and takes the form

$$ A = \begin{bmatrix}
1 & x^{(1)} & \cdots & (x^{(1)})^{p-1} \\
1 & x^{(2)} & \cdots & (x^{(2)})^{p-1} \\
\vdots & \vdots & \ddots & \vdots \\
1 & x^{(N)} & \cdots & (x^{(N)})^{p-1}
\end{bmatrix} 
$$

which you might recognize as a Vandermonde Matrix, which we encountered earlier in the class when discussing polynomial interpolation. An important property of such matrices is that their columns are linearly independent provided that the numbers $x^{(1)}, \ldots, x^{(N)}$ include at least $p$ different values. The [figures below](#poly_fit) show examples of least squares fits of polynomials of degree 2, 6, 10, and 15 to a set of 100 data points.

:::{figure}../figures/05-poly_fit.jpg
:label:poly_fit
:alt:Polynomial Model Data fitting
:width: 600px
:align: center
:::

:::{important}
An important observation is that since any polynomial of degree less than $r$ is also a polynomial of degree less than $s$ if $r \leq s$, it follows that the MMSE will decrease as we make the polynomial degree larger. This suggests that we should use the largest degree polynomial possible so as to get a model with the smallest MMSE possible. We will see later that this is **NOT TRUE**, and you will explore methods for model selection in recitation and the homework.
:::

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/03_Ch_4_Orthogonality/056-least_squares_data.ipynb)