---
title: 13.1 Unconstrained Optimization
subject:  Optimization
subtitle: 
short_title: 13.1 Unconstrained Optimization
authors:
  - name: Nikolai Matni
    affiliations:
      - Dept. of Electrical and Systems Engineering
      - University of Pennsylvania
    email: nmatni@seas.upenn.edu
license: CC-BY-4.0
keywords: 
math:
  '\vv': '\mathbf{#1}'
  '\bm': '\begin{bmatrix}'
  '\em': '\end{bmatrix}'
  '\R': '\mathbb{R}'
---

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/12_Ch_13_Optimization/131-Unconstrained.ipynb)

{doc}`Lecture notes <../lecture_notes/Lecture 21 - An introduction to unconstrained optimization, gradient descent, and Newton’s method.pdf>`

## Reading

Material related to this page, as well as additional exercises, can be found in Chapter 12 in ROB101 textbook by Jesse Grizzle.

## Learning Objectives

By the end of this page, you should know
- what is a mathematical optimization problem
- the difference between constrained and unconstrained optimization
- an example of optimization problem: least squares
- what are local and global minimum
- what are convex functions and some intuition on how to minimize them

## A Brief Introduction to Optimization

There were a few times during the semester where we tried to find the
"best" vector (or matrix) among a collection of many such vectors or matrices. For example, in least squares, we looked to find the vector $\vv x$ that minimized the expression $\|A\vv x - b\|^2$. In low-rank approximation, we looked to find the matrix $\hat{M}$ that has rank $k$ (or less) that minimized the expression $\|\hat{M} - M\|_F^2$.

These were both specific instances of what is called a _mathematical optimization problem_. We will focus on _unconstrained problems_. You will learn a lot more about optimization problems in ESE 3040, and this lecture is only meant to give you a small taste, and to show how essential linear algebra is in finding solutions.

:::{prf:definition} Optimization
:label: opt_defn
_Optimization_ is the process of finding one (or more) vectors $\vv x \in \mathbb{R}^n$ that minimize a function $f: \mathbb{R}^n \to \mathbb{R}$. This is written as the optimization problem

\begin{equation}
\label{opt_defn_eqn}
\text{minimize} \ f(x)
\end{equation}

Here, $\vv x \in \mathbb{R}^n$. In [](#opt_defn_eqn), the variable $\vv x \in \mathbb{R}^n$ is called the _decision variable_,
and the function $f: \mathbb{R}^n \to \mathbb{R}$ is called the _cost function_ or _objective function_.
:::

Optimization problem [](#opt_defn_eqn) is called _unconstrained_ because we are free to pick any
$\vv x \in \mathbb{R}^n$ we like to minimize $f(\vv x)$. A constrained optimization problem has the additional requirement that $\vv x$ must satisfy some added conditions, e.g., live in the solution set
of $A\vv x = b$. We will not consider such problems in this lecture, but you'll see many in ESE 3040.

The goal of optimization is to find a special decision variable $\vv x^*$ for which the cost function $f$ is as small as possible, i.e., such that

\begin{equation}
\label{opt_goal}
f(\vv x^*) \leq f(\vv x) \quad \text{for all } \vv x \in \mathbb{R}^n
\end{equation}

Such an $\vv x^*$ is called an _optimal solution to problem [](#opt_defn_eqn)_, and is defined as
the _arg min of $f$_:

\begin{equation}
\label{arg_min}
\vv x^* = \arg\min_{\vv x \in \mathbb{R}^n} f(\vv x).
\end{equation}

Equation [](#arg_min) simply says in math that $\vv x^*$ satisfies the definition [](#opt_goal) of an optimal point.[^label]

[^label]: Note that if there are multiple optimal points, one instead writes $
\vv x^* \in \arg\min_{\vv x \in \mathbb{R}^n} f(\vv x)
$ to indicate $\vv x^*$ belongs to the set of optimal points.

:::{prf:example} The least-squares problem
:label: ls_eg

$$
\text{minimize} \ \|A\vv x - b\|^2
$$

over $\vv x \in \mathbb{R}^n$ is an unconstrained optimization problem. The objective function is $f(\vv x) = \|A\vv x - b\|^2$, and the optimal solution is

$$
\vv x^* = (A^TA)^{-1}A^T\vv b = A^{\dagger}\vv b
$$

when $(A^TA)^{-1}$ exists. Otherwise, $\vv x^* \in \arg\min \|A\vv x - b\|^2 \Leftrightarrow A^TA\vv x^* = A^T\vv b$, i.e.,
if and only if $\vv x^*$ satisfies the normal equations $A^TA\vv x = A^T\vv b$.
:::

Despite how simple and innocuous problem [](#opt_defn_eqn) looks, it can be used to encode very rich and very challenging problems. Even for $\vv x \in \mathbb{R}$, we can get ourselves
into trouble. Consider the following two functions that we wish to minimize:

:::{figure}../figures/14-opt_real.jpg
:label:opt_real
:alt:Optimizing scalars
:width: 600px
:align: center
:::

Which do you think is easier to optimize? The left figure, with function $f_1(x)$, is "bowl-shaped" and is smallest at $x^* = 2$. What's "nice" about $f_1$ is that there's also an obvious
algorithm for finding $x^* = 2$: If you imagine yourself as an ant standing on the function $f_1(x)$, all you need to do is "walk downhill" until you eventually find the bottom of the bowl.

In contrast, the right figure with function $f_2(x)$ has many hills and valleys. The optimal value $x^* = 3$ is the one for which $f_2(x)$ is smallest. But now if we again imagine ourselves as an ant standing on the function $f_2(x)$, our strategy of walking downhill will not always work! For example, if we were to start at $x = 1.5$, then walking downhill would bring us to the bottom of the first valley at $x = 1$. Now $x = 1$ is not an optimal point, since $f_2(3) < f_2(1)$, but if we look around at nearby points, then $f_2(1)$ is indeed the smallest. 

:::{prf:definition} Local and Global Minimum
:label: local_global
We call such a point $\tilde{x}$ that satisfies $f(\tilde{x}) \leq f(x)$ for all $x$ close to
$\tilde{x}$, say for $|x - \tilde{x}| \leq \epsilon$ for some $\epsilon > 0$, a _local minimum_, and when we wish to emphasize that a point $x^*$ satisfying (*) is indeed the best possible choice, we call it
a _global minimum_.
:::

As you may have guessed, we really like "bowl-shaped" functions for which our walking downhill strategy finds a global minimum. Such functions satisfy a geometric property called _convexity_. 

:::{prf:definition} Convex Functions
:label: conv_fn
A _convex function_ $f: \mathbb{R}^n \to \mathbb{R}$ is one which satisfies the following property:
\begin{equation}
\label{cvx}
f(\theta \vv x + (1-\theta)\vv y) \leq \theta f(\vv x) + (1-\theta)f(\vv y) \text{ for all } \theta \in [0,1], \vv x, \vv y \in \mathbb{R}^n  \quad (\text{CVX})
\end{equation}
:::
To understand what [(CVX)](#cvx) is saying, it is best to draw what it means for a scalar
function $f: \mathbb{R} \to \mathbb{R}$.

:::{figure}../figures/14-cvx_fn.jpg
:label:cvx_fn
:alt:Convex Function
:width: 400px
:align: center
:::

[(CVX)](#cvx) says that if I pick any two points $f(x)$ and $f(y)$ on the graph, and draw a line segment between these two points, then this line lies above the graph. It turns out that this is exactly the right way to mathematically characterize "bowl-shaped" functions, even when $\vv x \in \mathbb{R}^n$. The important feature of convex functions is that "walking
downhill" will always bring us to a global minimum. We won't say much more about convex functions, but you'll see them again in ESE 3040, and there is a graduate level course, ESE 6050, which focuses entirely on convex optimization problems.

:::{prf:example}
:label: eg_affine
The affine function $f(\vv x) = A\vv x - \vv b$ is convex. To see this, we check that
\begin{equation*}
f(\theta \vv x + (1-\theta)\vv y) \leq \theta f(\vv x) + (1-\theta)f(\vv y) \text{ for all } \vv x, \vv y \in \mathbb{R}^n \text{ and } \theta \in [0,1]
\end{equation*}
But 
$$f(\theta \vv x + (1-\theta)\vv y) = A(\theta \vv x + (1-\theta)\vv y) - \vv b = \theta (A\vv x - \vv b) + (1-\theta)(A\vv y - \vv b) = \theta f(\vv x) + (1-\theta)f(\vv y)
$$
Affine functions are on the "boundary" of being convex, in the sense that if $f(\vv x)$ is an affine function, then $-f(\vv x)$ is also an affine function and hence is convex. Affine functions are the only functions $f$ for which both $f$ and $-f$ are convex!
:::

:::{prf:example}
:label: eg_ls
The least squares objective $f(\vv x) = \|A\vv x - \vv b\|^2$ is convex. One can check this from the definition [(CVX)](#cvx), but this is very tedious.

To gain some intuition, let's consider the scalar setting $f(x) = \|\vv ax - \vv b\|^2$. Expanding out $f(x)$ we see
\begin{equation*}
f(x) = x^2 \|\vv a\|^2 - 2\vv a^T\vv bx + \|b\|^2,
\end{equation*}

which is an upward pointing quadratic since $\|a\|^2 > 0$ for any $\vv a \neq \vv 0$. This same intuition extends to $\vv x \in \mathbb{R}^n$ setting. Expanding out $f(\vv x) = \|A\vv x - \vv b\|^2$, we get

\begin{equation*}
f(\vv x) = \vv x^T A^TA \vv x - 2 \vv x^TA^T \vv b + \|\vv b\|^2.
\end{equation*}

This is a _quadratic_ function, with quadratic term given by the quadratic form $\vv x^T A^TA \vv x$, defined by the positive semidefinite matrix $A^TA$. This means that $f(\vv x)$ is an upward pointing bowl with ellipsoidal level sets (recall from {doc}`Lecture 16 <../lecture_notes/Lecture 16 - Eigenvalues of Symmetric Matrices, Spectral Theorem, Quadratic Forms and Positive Definite Matrices, Optimization Principles for Eigenvalues of Symmetric Matrices.pdf>`), and hence is convex.
:::

A more formal way of making the argument  in [](#eg_ls) is to show that the _Hessian $\nabla^2 f(x)$ of f_ is positive semidefinite. Here $\nabla^2 f(\vv x) = 2A^TA$, which is indeed positive semidefinite. This is the matrix/vector equivalent of saying that $f(x)=a^2 x + bx + c$ is an upward pointing bowl if and only if $a \geq 0$.

If you don't remember what the Hessian $\nabla^2f(\vv x)$ of a function is, or how we computed $\nabla^2f(\vv x) = 2A^TA$ in the above, don't worry, we'll review this in a bit.

## Which way is down?

Let's assume that we are either in a "nice" setting where our objective function is convex (bowl shaped), or that we're happy to settle for a local minimum. How can we figure out which way is down so that we can tell our little ant friend which way they should walk. To get some intuition, we will start with with functions with $\vv x\in\mathbb{R}^2$ and look at some _cost function contour plots_.

Let's start with two familiar examples:

1. $f(\vv x) = \|\vv x\|^2 = x_1^2 + x_2^2$, which is nothing but the Euclidean norm squared of $\vv x$, and
2. $f(\vv x) = \|\vv x-\vv b\|^2 = (x_1-b_1)^2+(x_2-b_2)^2$, which is a very simple least squares objective with $A=I$.

We have plotted both of these functions, and their contour plots, below:

:::{figure}../figures/14-level_set.jpg
:label:level_set
:alt:Level Set
:width: 800px
:align: center
:::

The contour plots show the _level sets of $f(\vv x)$_, which we've seen before. These are the sets $C_{\alpha} = \{\vv x\in\mathbb{R}^2 | f(\vv x)=\alpha\}$ for some constant $\alpha$. Here, these end up being circles, since for example $C_1$ is the set of $x_1,x_2$ such that $x_1^2 + x_2^2 = 1$ or $(x_1-b_1)^2 + (x_2-b_2)^2 = 1$.

Can we use these contour plots to identify which way is down if our ant is currently sitting at point $\vv w\in\mathbb{R}^2$?

To answer this question, we'll need the _gradient $\nabla f(\vv x)$_ of our function. Recall from Math 1410 that for a function $f:\mathbb{R}^n\to\mathbb{R}$, its gradient $\nabla f(\vv x)$ is an n-vector of the partial derivatives of f:

$$
\nabla f(\vv x) = \begin{bmatrix}
\frac{\partial f}{\partial x_1}(\vv x) \\[6pt]
\vdots \\[6pt]
\frac{\partial f}{\partial x_n}(\vv x)
\end{bmatrix} \in \mathbb{R}^n
$$

For our functions on $\mathbb{R}^2$, this reduces to $\nabla f(\vv x) = \left(\frac{\partial f}{\partial x_1}(x_1, x_2), \frac{\partial f}{\partial x_2}(x_1, x_2)\right) \in \mathbb{R}^2$.

In the figure below, we show contour plots for $f_1(\vv x)=\|\vv x-b\|^2=(x_1-1)^2+(x_2-2)^2$ and $f_2(x)=\|A\vv x-\vv b\|^2$, along with the gradients $\nabla f(\vv x)$ (green arrows) at various points.

:::{figure}../figures/14-contour.jpg
:label:contour
:alt:Contours
:width: 800px
:align: center
:::

In both cases, the gradients are pointing in the direction of maximal increase: if we go in the opposite $-\nabla f(\vv x)$ direction, we will move in the direction of maximal decrease!

Let's flex our calculus muscles a bit and compute the gradients of $f(\vv x)$ and $f_2(\vv x)$[^label1]:

$$
\nabla f_1(\vv x) = \begin{bmatrix}
\frac{\partial}{\partial x_1}((x_1-1)^2+(x_2-2)^2) \\[6pt]
\frac{\partial}{\partial x_2}((x_1-1)^2+(x_2-2)^2)
\end{bmatrix} = 2\begin{bmatrix}
x_1-1 \\[6pt]
x_2-2
\end{bmatrix}
$$

and

[^label1]: If you don't know how to compute this gradient, you may wish to review Math1410 notes, but don't worry, we won't ask you to calculate gradients on the homework & exams.

$$
\nabla f_2(\vv x) = \nabla (A\vv x-\vv b)^T(A\vv x-\vv b) = 2A^T(A\vv x-\vv b)
$$

One thing you might notice in the [plots above](#contour) is that the red dots, which we placed on the function minimum, have no green arrows. This is because the gradient at these points is zero. In fact, this is true in general: a point $\vv x$ is a local minimum **only if** $\nabla f(\vv x)=0$.

We won't prove this, but intuitively, this makes sense: if $\nabla f(\vv x)\neq \vv 0$, then this means there is a direction $-\nabla f(\vv x)$ in which I could walk a little bit downhill to decrease $f(\vv x)$, so $\nabla f(\vv x)$ must be zero if we're at a local minima. Let's check that this holds for $\nabla f_1(\vv x)$ and $\nabla f_2(\vv x)$ above:

For $f_1(\vv x)$, $x^* = (1, 2)$, and $\nabla f_1(\vv x^*) = 2\begin{bmatrix} x_1^*-1 \\ x_2^*-2 \end{bmatrix} = \vv 0$, and so that checks out. And for $f_2(\vv x)$, $\vv x^* = (A^TA)^{-1}A^T\vv b$, and

$\nabla f_2(\vv x^*) = 2(A^TA \vv x^* - A^T\vv b) = 2\underbrace{(A^TA)(A^TA)^{-1}}_{I}(A^T\vv b - A^T\vv b) = 2(A^T\vv b - A^T\vv b) = \vv 0$!

In general, while $\nabla f(\vv x^*) = \vv 0$ is necessary for $\vv x^*$ to be a local minima, it is not sufficient, i.e., there may be certain points with $\nabla f(\vv x^*)=\vv 0$ that are not local minima. For example, all red dots in the plot below have vanishing gradients, but only two are minima:

:::{figure}../figures/14-local_minima.jpg
:label:local_minima
:alt:Local Minima
:width: 400px
:align: center
:::

However, if our function $f$ is convex (bowl shaped), we have the following theorem which tells us that walking downhill will always find a **global** minimum:

:::{prf:theorem}
:label: bowl_thm
Let $f:\mathbb{R}^n \to \mathbb{R}$ be convex. Then $\vv x^*$ is globally optimal, i.e.,
$$
f(\vv x^*) \leq f(\vv x) \quad \forall \vv x\in\mathbb{R}^n
$$

if and only if $\nabla f(\vv x^*) = \vv 0$.
:::

Next, we'll use our new found insights to define _gradient descent_, a widely used iterative algorithm for finding local minima of optimization problem [](#opt_defn_eqn).

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/12_Ch_13_Optimization/141-Unconstrained.ipynb)
