# Lecture 3: *Multivariate Optimization*

The goal of this lecture is to generalize our theory of optimization over $\mathbb{R}$ and $\mathbb{R}^2$ to optimization over a general Euclidean space, $\mathbb{R}^d$. 

# Part I: Preliminaries and Definitions

We let 
$$
\mathbb{R}^d =\{(x_1,\:x_2,\:\ldots,x_d):x_i\in\mathbb{R}\text{ for all }i=1,\ldots, d\}
$$
denote $d$-dimensional Euclidean space, which consists of all ordered $d$-tuples of real numbers. Many times we will instead identify $\mathbb{R}^d$ with the set of real-valued column vectors of the form
$$
{\bf x}=\begin{pmatrix}
x_1\\
x_2\\
\vdots\\
x_d
\end{pmatrix}
$$
to facilitate linear operations. The value $x_i$ is called the $i$th **entry** or **component** of ${\bf x}$. In particular, we impose a vector space structure on $\mathbb{R}^d$ by defining the vector addition and scalar multiplication:
$$
\begin{pmatrix}
x_1\\
x_2\\
\vdots\\
x_d
\end{pmatrix} + \begin{pmatrix}
y_1\\
y_2\\
\vdots\\
y_d
\end{pmatrix}=\begin{pmatrix}
x_1+y_1\\
x_2+y_2\\
\vdots\\
x_d+y_d
\end{pmatrix} \text{ and } a\cdot\begin{pmatrix}
x_1\\
x_2\\
\vdots\\
x_d
\end{pmatrix} = \begin{pmatrix}
ax_1\\
ax_2\\
\vdots\\
ax_d
\end{pmatrix}
$$
for all
$$
{\bf x}=\begin{pmatrix}
x_1\\
x_2\\
\vdots\\
x_d
\end{pmatrix}, {\bf y}=\begin{pmatrix}
y_1\\
y_2\\
\vdots\\
y_d
\end{pmatrix}\in\mathbb{R}^d
$$
and all $a\in\mathbb{R}$. In general, we will write ${\bf x}-{\bf y}$ instead of ${\bf x} + ((-1)\cdot {\bf y})$ and $a{\bf x}$ instead of $a\cdot{\bf x}$. One can check that $\mathbb{R}^d$ is a vector space under these operations. Using the **inner product**,

$$
{\bf x}^T{\bf y} = \begin{pmatrix}
x_1&
x_2&
\cdots&
x_d
\end{pmatrix} \begin{pmatrix}
y_1\\
y_2\\
\vdots\\
y_d
\end{pmatrix} = \sum_{i=1}^d x_i y_i,
$$

Important properties of the inner product are

1. **Positivity**: ${\bf x}^T{\bf x}=\Vert x\Vert^2\geq 0$ for all ${\bf x}\in\mathbb{R}^d$ and ${\bf x}^T{\bf x}=0$ if and only if ${\bf x=0}$
2. **Symmetry**: ${\bf x}^T{\bf y} = {\bf y}^T{\bf x}$ for all ${\bf x}, {\bf y}\in\mathbb{R}^d$
3. **Linearity in the first variable**: $\left(a{\bf x}+b{\bf y}\right)^Y{\bf z}=a {\bf x}^T {\bf z} + b{\bf y}^T{\bf z}$ for all $a, b\in\mathbb{R}$ and all ${\bf x}, {\bf y}, {\bf z}\in\mathbb{R}^d$.

Note that symmetry and linearity in the first variable imply **linearity in the second variable**, and hence the inner product is called **bilinear**. We also note that the standard Euclidean norm is given by $\Vert {\bf x}\Vert = \sqrt{{\bf x}^T{\bf x}}$.

The **graph** of $f:\mathbb{R}^d\rightarrow\mathbb{R}$ is defined to be the set of points

$$
(x_1, x_2,\ldots, x_d, f(x_1, x_2, \ldots, x_d))\in\mathbb{R}^{d+1}.
$$

For $d=2$, the graph was a surface in $\mathbb{R}^3$, and so we could visualize it by plotting. For $d>2$, we no longer have access to this graph, but it is oftent\ helpful to use our intuition from the $d=2$ case. 

One intuitive generalization is that reflection principle also applies when we optimize over $\mathbb{R}^d$; a maximizer of $f({\bf x})$ over $\mathbb{R}^d$ is also a minimizer of $-f({\bf x})$ over $\mathbb{R}^d$. Thus, an unconstrained optimization program involving $f:\mathbb{R}^d\rightarrow\mathbb{R}$ is generally expressed as

$$
(P):\:\:\min_{{\bf x}\in\mathbb{R}^d} f({\bf x}).
$$

A **solution** to this program is any ${\bf x}^\ast\in\mathbb{R}^d$ satisfying $f({\bf x}^\ast)\leq f({\bf x})$ for any ${\bf x}\in\mathbb{R}^d$, and two programs are **equivalent** if they have the same solutions. Note that, if $\phi:\mathbb{R}\rightarrow\mathbb{R}$ is **order preserving**, then

$$
\min_{{\bf x}\in\mathbb{R}^d} \phi(f({\bf x}))
$$

is equivalent to $(P)$. Furthermore, the **minimum value** of $f:\mathbb{R}^d\rightarrow\mathbb{R}$ is any value $p$ such that $p\leq f({\bf x})$ for all ${\bf x}\in\mathbb{R}^d$ and if $q\leq f({\bf x})$ for all ${\bf x}\in\mathbb{R}^d$ then $q\leq p$.

## Constrained Multivariate Optimization

We will often consider some subset $X\subset\mathbb{R}^d$ and a function $f:X\rightarrow\mathbb{R}$, where $X$ is determined by other functions. That is, $X$ will be specified by identifying functions $g_1, g_2, \ldots, g_n:\mathbb{R}^d\rightarrow\mathbb{R}$ and $h_1, h_2,\ldots, h_m:\mathbb{R}^d\rightarrow\mathbb{R}$ so that

$$
X = \{{\bf x}\in\mathbb{R}^d: g_i({\bf x})=0\text{ and }h_j({\bf x})\leq 0\text{ for all }i=1,\ldots,n\text{ and }j=1,\ldots, m\}.
$$

1. We call the constraints $g_i({\bf x})=0$ **equality constraints**
2. We call the constraints $h_j({\bf x})\leq 0$ **inequality constraints**

A constrained optimization program is then expressed as

$$
\min f({\bf x})\text{ subject to } g_i({\bf x})=0\text{ and }h_j({\bf x})\leq 0\text{ for all }i=1,\ldots, n\text{ and }j=1,\ldots, m.
$$

The set $X$ is called the **feasible region**. The program is called **feasible** if $X\not=\emptyset$, and is called **infeasible** otherwise. A point ${\bf x}\in X$ is called a **feasible point** or simply **feasible**. If ${\bf x}\in X$ and $h_j({\bf x})<0$ for all $j=1,\ldots, m$, then ${\bf x}$ is called **strictly feasible**. 

### Example: 
If $A$ is an $d$ by $d$ matrix (that is, $A\in M_{d, d}$)
$$
\min \frac{1}{2} {\bf x}^T A {\bf x}\text{ subject to } \Vert x\Vert^2=1,\: x_i\geq 0\text{ for all }i=1,2,\ldots,d.
$$


# Part II: Important considerations

Our concerns for bivariate optimization are the same as in the univariate case.

1. Does $(P)$ have a minimum value?
2. Does $(P)$ have a solution?
3. Does $(P)$ have a **unique** solution?
4. When $(P)$ has a minimum value, how can we find an $\widetilde{\bf x}$ such that $f(\widetilde{\bf x})$ is close to the minimum value?
5. When $(P)$ has a solution ${\bf x}^\ast$, how can we find an $\widetilde{\bf x}$ which is close to ${\bf x}^\ast$?

We will explore how the answers to these questions generalize from the 2D case.




# Part III: Existence of minimum values and minimizers

If there is an $L\in\mathbb{R}$ such that $L\leq f({\bf x})$ for all ${\bf x}\in\mathbb{R}^d$, then $f$ is said to be **bounded below**.

#### Theorem: If $f:\mathbb{R}^d\rightarrow\mathbb{R}$ is bounded below, then $(P)$ has a minimum value.

Our next goal is to generalize the Extreme Value Theorem to general Euclidean spaces. For $\varepsilon>0$, the **open $\varepsilon$-ball** around ${\bf x}\in\mathbb{R}^d$ is the set 

$$
B({\bf x},\varepsilon)=\{{\bf y}\in\mathbb{R}^d: \Vert {\bf x}-{\bf y}\Vert<\varepsilon\}.
$$

Given $X\subset\mathbb{R}^d$, $f:X\rightarrow\mathbb{R}$ is said to be **continuous at** ${\bf x}\in\mathbb{R}^d$ if for every $\varepsilon>0$ there is a $\delta>0$ such that $\vert f({\bf y})-f({\bf x})\vert<\varepsilon$ for all ${\bf y}\in B({\bf x},\delta)\cap X$. $f$ is **continuous** on $X$ (or just **continuous**) if $f$ is continuous at all ${\bf x}\in X$.

For $X\subset\mathbb{R}^d$, a subset $U\subset X$ is called **open** in $X$ if for every ${\bf x}\in U$ there is an $\varepsilon>0$ such that $B({\bf x},\varepsilon)\cap X\subset U$. A subset $Q\subset\mathbb{R}^2$ is called **closed** in $X$ if its complement
$$
\overline{Q}=\{{\bf x}\in X: x\not\in Q\}
$$
is open in $X$. The following theorem is very helpful for constrained optimization over $\mathbb{R}^2$.

#### Theorem: If $g_1, g_2,\ldots, g_m:\mathbb{R}^d\rightarrow\mathbb{R}$ and $h_1,h_2,\ldots,h_n:\mathbb{R}^d\rightarrow\mathbb{R}$ are all continuous functions, then the set $X=\{{\bf x}\in\mathbb{R}^d: g_i({\bf x})=0, h_j({\bf x})=0\text{ for all }i,j\}$ is closed.

A set $X\subset\mathbb{R}^d$ is said to be **bounded** if there is an $R\in\mathbb{R}$ such that $X\subset B({\bf 0}, R)$.

A set $X\subset\mathbb{R}$ is **compact** if it is both closed and bounded. 

#### Theorem (Extreme Value Theorem): If $X\subset\mathbb{R}^d$ is compact and $f:X\rightarrow\mathbb{R}$ is continuous on $X$, then $f$ has a minimizer ${\bf x}^\ast\in X$.

# Part IV: Uniqueness of Solutions and Convexity
Uniqueness of solutions is generally contingent upon convexity of the optimization program. 

A set $X\subset\mathbb{R}^d$ is said to be **convex** if for any ${\bf x}, {\bf y}\in X$ and any $t\in[0,1]$, $(1-t){\bf x} + t{\bf y}\in X$. 

If $X\subset\mathbb{R}^d$ is a convex set:

1. $f:X\rightarrow\mathbb{R}$ is said to be **convex** on $X$ if for every ${\bf x}, {\bf y}\in X$ and every $t\in[0,1]$ we have that
$$
f((1-t){\bf x} + t{\bf y}) \leq (1-t)f({\bf x}) + t f({\bf y}).
$$ 
2. $f:X\rightarrow\mathbb{R}$ is said to be **strictly convex** on $X$ if for every ${\bf x}, {\bf y}\in X$ and every $t\in(0,1)$ we have that
$$
f((1-t){\bf x} + t{\bf y}) < (1-t)f({\bf x}) + t f({\bf y}).
$$ 


We again have that strict convexity implies convexity, which in turn implies continuity.

#### Theorem (Convex Functions are Continuous): If $X\subset\mathbb{R}^d$ is convex and open, and $f:X\rightarrow\mathbb{R}$ is convex on $X$, then $f$ is continuous on $X$.

A function $g:\mathbb{R}^d\rightarrow\mathbb{R}^m$ is called **affine** if $g(t{\bf x}+(1-t){\bf y})=tg({\bf x})+(1-t)g({\bf y})$ for all $t\in\mathbb{R}$ and all ${\bf x}, {\bf y}\in\mathbb{R}^d$. It is easy to show that $g$ is affine if and only if there is an $A\in M_{m, d}$ and a ${\bf b}\in \mathbb{R}^m$ such that $g({\bf x})=A{\bf x} + {\bf b}$. 

#### Theorem (Convex Domains): If $g:\mathbb{R}^d\rightarrow\mathbb{R}^m$ is affine and $h_1,\ldots, h_n:\mathbb{R}^d\rightarrow\mathbb{R}$ are all convex functions, then the set $X=\{{\bf x}\in\mathbb{R}^d:g({\bf x})={\bf 0}, h_j({\bf x})\leq 0\text{ for all }j\}$ is convex.

If $f$ is convex, then $(P)$ is called a **convex program**, and a constrained optimization program is called convex if the objective function is convex, the equality constraints are affine functions, and the inequality constraints are convex functions.

Now, **strict minimizer**/**unique minimizer** of $f$ on $X$ is a point ${\bf x}^\ast$ such that $f({\bf x}^\ast)<f({\bf x})$ for all ${\bf x}\in X\setminus\{{\bf x}^\ast\}$. 

#### Theorem (Fundamental Theorem of Convex Programming): If $X\subset\mathbb{R}^d$ is convex, compact, and $f:X\rightarrow\mathbb{R}$ is  convex on $X$, then the set of minimizers of $f$ on $X$ form a convex set. Moreover, if $f$ strictly convex on $X$, then $f$ has a unique minimizer on $X$.

Our first goal is to generalize the first order conditions for convexity. For convenience, we first define the **standard orthonormal basis** of $\mathbb{R}^d$ as $\{{\bf e}^{(i)}\}_{i=1}^d\subset\mathbb{R}^d$ where the $j$th entry of ${\bf e}^{(i)}$ is given by

$$
{\bf e}^{(i)}_j = \left\{\begin{array}{cl}
1 & \text{ if }i=j\\
0 & \text{ if }i\not=j
\end{array}\right.
$$

for all $i, j\in \{1, 2, \ldots, d\}$.

We say that $f\in C^1(\mathbb{R}^d)$ if for each $i=1,\ldots, d$ and each ${\bf x}\in\mathbb{R}^d$,
$$
\frac{\partial f}{\partial x_i}({\bf x}) = \lim_{\Delta x_i\rightarrow 0} \frac{f({\bf x}+\Delta x_i {\bf e}^{(i)})-f({\bf x})}{\Delta x_i}
$$
is defined, and the functions $\frac{\partial f}{\partial x_i}:\mathbb{R}^d\rightarrow\mathbb{R}$ are all continuous.

Now, generalizing from 2D, we have that the first order Taylor expansion of $f\in C^1(\mathbb{R}^d)$ at ${\bf x}^{(0)}\in\mathbb{R}^d$ is $p_1({\bf x})=f({\bf x}^{(0)})+\nabla f({\bf x}^{(0)})^T({\bf x}-{\bf x}^{(0)})$, where the **gradient** of $f$ at ${\bf x}^{(0)}$ is

$$
\nabla f({\bf x}^{(0)}) = \begin{pmatrix}
\partial_1 f({\bf x}^{(0)})\\
\partial_2 f({\bf x}^{(0)})\\
\vdots\\
\partial_d f({\bf x}^{(0)})
\end{pmatrix} = \begin{pmatrix}
\frac{\partial f}{\partial x_1}({\bf x}^{(0)})\\
\frac{\partial f}{\partial x_2}({\bf x}^{(0)})\\
\vdots\\
\frac{\partial f}{\partial x_d}({\bf x}^{(0)})
\end{pmatrix}.
$$

The first order conditions for convexity in 2D are expressed compactly by $f({\bf x})\geq f({\bf x}^{(0)})+\nabla f({\bf x}^{(0)})^T({\bf x}-{\bf x}^{(0)})$ for all ${\bf x}, {\bf x}^{(0)}$, and therefore the following theorem is a natural generalization.

#### Theorem (First Order Conditions for Convexity): If $X\subset\mathbb{R}^d$ is convex, then $f\in C^1(X)$ is convex if and only if $f({\bf x})\geq f({\bf y}) + \nabla f({\bf y})^T({\bf x}-{\bf y})$ for all ${\bf x}, {\bf y}\in X$. If $f({\bf x})\geq f({\bf y}) + \nabla f({\bf y})^T({\bf x}-{\bf y})$ for all ${\bf x}, {\bf y}\in X$ with ${\bf x}\not={\bf y}$, then $f$ is strictly convex on $X$.

We now generalize the second order conditions for convexity. We say that $f\in C^2(\mathbb{R}^d)$ if $f\in C^1(\mathbb{R}^d)$ and $\frac{\partial f}{\partial x_i}\in C^1(\mathbb{R}^d)$ for each $i=1, 2,\ldots, d$. The second order conditions require convexity for the second order Taylor approximations 

$$
p_2({\bf x}) = f({\bf x}^{(0)}) + \nabla f({\bf x}^{(0)})^T({\bf x}-{\bf x}^{(0)}) + \frac{1}{2}({\bf x}-{\bf x}^{(0)})^T\nabla^2 f({\bf x}^{(0)})({\bf x}-{\bf x}^{(0)}),
$$

for each ${\bf x}^{(0)}\in\mathbb{R}^d$, and where the **Hessian** of $f$ at ${\bf x}^{(0)}$ is 

$$
\nabla^2 f({\bf x}^{(0)})= \begin{pmatrix}
\partial_{1, 1} f({\bf x}^{(0)}) & \partial_{1, 2} f({\bf x}^{(0)}) & \cdots & \partial_{1, d} f({\bf x}^{(0)})\\
\partial_{1, 2} f({\bf x}^{(0)}) & \partial_{2, 2} f({\bf x}^{(0)}) & \cdots & \partial_{2, d} f({\bf x}^{(0)})\\
\vdots & \vdots & \ddots & \vdots\\
\partial_{1, d} f({\bf x}^{(0)}) & \partial_{2, d} f({\bf x}^{(0)}) & \vdots & \partial_{d, d} f({\bf x}^{(0)})
\end{pmatrix}=\begin{pmatrix}
\frac{\partial^2 f}{\partial x_1\partial x_1}({\bf x}^{(0)}) & \frac{\partial^2 f}{\partial x_1\partial x_2}({\bf x}^{(0)}) & \cdots & \frac{\partial^2 f}{\partial x_1\partial x_d}({\bf x}^{(0)})\\
\frac{\partial^2 f}{\partial x_1\partial x_2}({\bf x}^{(0)}) & \frac{\partial^2 f}{\partial x_2\partial x_2} f({\bf x}^{(0)}) & \cdots & \frac{\partial^2 f}{\partial x_2\partial x_d}({\bf x}^{(0)})\\
\vdots & \vdots & \ddots & \vdots\\
\frac{\partial^2 f}{\partial x_1\partial x_d} f({\bf x}^{(0)}) & \frac{\partial^2 f}{\partial x_2\partial x_d}({\bf x}^{(0)}) & \vdots & \frac{\partial^2 f}{\partial x_d\partial x_d}({\bf x}^{(0)})
\end{pmatrix}
$$

Now, the first order Taylor approximation to $p_2$ at ${\bf y}\in\mathbb{R}^d$ is 

\begin{eqnarray}
q_1({\bf x}) &=& f({\bf x}^{(0)}) + \nabla f({\bf x}^{(0)})^T({\bf y}-{\bf x}^{(0)}) + \frac{1}{2}({\bf y}-{\bf x}^{(0)})^T\nabla^2 f({\bf x}^{(0)})({\bf y}-{\bf x}^{(0)}) + \left(\nabla f({\bf x}^{(0)}) + \nabla^2 f({\bf x}^{(0)})({\bf y}-{\bf x}^{(0)})\right)^T({\bf x}-{\bf y})\\
&=& f({\bf x}^{(0)}) + \nabla f({\bf x}^{(0)})^T({\bf x}-{\bf x}^{(0)}) + \frac{1}{2}({\bf y}-{\bf x}^{(0)})^T\nabla^2 f({\bf x}^{(0)})({\bf y}-{\bf x}^{(0)}) + ({\bf y}-{\bf x}^{(0)})^T\nabla^2 f({\bf x}^{(0)})({\bf x}-{\bf y})\\
&=&f({\bf x}^{(0)}) + \nabla f({\bf x}^{(0)})^T({\bf x}-{\bf x}^{(0)}) + \frac{1}{2}({\bf x}-{\bf x}^{(0)})^T\nabla^2 f({\bf x}^{(0)})({\bf x}-{\bf x}^{(0)}) -\frac{1}{2}({\bf x}-{\bf y})^T\nabla^2 f({\bf x}^{(0)})({\bf x}-{\bf y})\\
&=& p_2({\bf x}) -\frac{1}{2}({\bf x}-{\bf y})^T\nabla^2 f({\bf x}^{(0)})({\bf x}-{\bf y})
\end{eqnarray}

The first order conditions for convexity to $p_2$ are equivalent to $p_2({\bf x})\geq q_1({\bf x})$ for all ${\bf x},{\bf y}\in\mathbb{R}^d$, and therefore we must have that

$$
\frac{1}{2} ({\bf x}-{\bf y})^T\nabla^2 f({\bf x}^{(0)})({\bf x}-{\bf y})\geq 0
$$

for all ${\bf x},{\bf y}\in\mathbb{R}^d$. Setting ${\bf z}= \frac{1}{\sqrt{2}}({\bf x}-{\bf y})$, and noting that ${\bf x} = \sqrt{2}{\bf z}$, ${\bf y}={\bf 0}$ results in any particular ${\bf z}\in\mathbb{R}^d$, we have that convexity of the second order Taylor approximation at ${\bf x}^{(0)}$ is equivalent to 

$$
{\bf z}^T\nabla^2 f({\bf x}^{(0)}){\bf z}\geq 0
$$

for all ${\bf z}\in\mathbb{R}^d$. Now, we let $M_{n,n}$ denote the $n$ by $n$ matrices with real entries, and if $A\in M_{n,n}$ is **symmetric** ($A^T=A$), we say that $A$ is **positive semidefinite** if ${\bf z}^T A{\bf z}\geq 0$ for all ${\bf z}\in\mathbb{R}^d$. Thus, the second order conditions may be stated in terms of the positive definiteness of the Hessian $\nabla^2 f({\bf x}^{(0)})$ at every ${\bf x}^{(0)}\in\mathbb{R}^d$:

#### Theorem (Second Order Conditions for Convexity): If $X\subset\mathbb{R}^d$ is convex, then $f\in C^2(X)$ is convex if and only if $\nabla^2f({\bf x}^{(0)})$ is positive semidefinite for all ${\bf x}^{(0)}\in X$. If $\nabla^2f({\bf x}^{(0)})$ is positive definite for all ${\bf x}^{(0)}\in X$, then $f$ is strictly convex. 

If $X\subset\mathbb{R}^d$ is convex, we say that $f\in C^2(X)$ is **strongly convex** if there is a $c>0$ such that ${\bf u}^T\nabla^2 f({\bf x}){\bf u}\geq c$ for all ${\bf u}, {\bf x}\in\mathbb{R}^d$ with $\Vert u\Vert=1$. 

Finally, we need some techniques to determine positive definiteness of a matrix.


## Positive definite matrices

We say that $U\in M_{d, d}$ is **orthogonal** if $U^TU=I$, where $I$ is the $d$ by $d$ identity matrix (all diagonal entries are $1$'s and all other entries are $0$'s).

#### Theorem : For a matrix $U\in M_{d, d}$, the following are equivalent:
1. $U$ is orthogonal
2. the columns of $U$ form an orthonormal basis of $\mathbb{R}^d$
3. the rows of $U$ form an orthonormal basis of $\mathbb{R}^d$
4. $\Vert U{\bf x} \Vert = \Vert {\bf x}\Vert$ for all ${\bf x}\in\mathbb{R}^d$

For a vector ${\bf v}\in\mathbb{R}$, we define

$$
\text{diag}({\bf v}) = \begin{pmatrix}
v_1 & 0 &\cdots & 0 & 0\\
0 & v_2 &\cdots & 0 & 0\\
\vdots & \vdots &\ddots & \vdots &\vdots\\
0 & 0 & \cdots & v_{d-1} & 0\\
0 & 0 & \cdots & 0 & v_d
\end{pmatrix}.
$$

#### Theorem (Spectral Theorem): If $A\in M_{d, d}$ is symmetric, then there exists ${\bf v}\in\mathbb{R}^d$ and an orthogonal matrix $U\in M_{d, d}$ such that $ A = U\text{diag}({\bf v}) U^T$. Moreover, the entries of ${\bf v}$ list the eigenvalues of $A$ and the columns of $U$ are an orthonormal basis of eigenvectors for $A$.


#### Theorem (Eigenvalue Characterization of Positive Definiteness): A symmetric matrix $A$ is positive semidefinite if and only if it has non-negative eigenvalues, and it is positive definite if and only if it has strictly postive eigenvalues.

Computing the eigenvalues of $A$ is much more difficult when $d>2$. To establish an effective computational tool, we need a way to ensure positive definiteness. We will generalize Sylvester's criterion, but we must first introduce notation. 

We know the determinant of a $2$ by $2$ matrix, and the **Laplace expansion** of an $d$ by $d$ matrix $A$ is given by

$$
\det\begin{pmatrix}
a_{1, 1} & a_{1, 2} & \cdots & a_{1, d}\\
a_{2, 1} & a_{2, 2} & \cdots & a_{2, d}\\
\vdots & \vdots & \ddots & \vdots\\
a_{d, 1} & a_{d, 2} & \cdots & a_{d, d}
\end{pmatrix} = (-1)^{1+1}a_{1, 1} \det\begin{pmatrix}
a_{2, 2} & a_{2, 3} & \cdots & a_{2, d}\\
a_{3, 2} & a_{3, 3} & \cdots & a_{3, d}\\
\vdots & \vdots & \ddots & \vdots\\
a_{d, 2} & a_{d, 3} & \cdots & a_{d, d}
\end{pmatrix} + (-1)^{1+2} a_{1, 2} \det\begin{pmatrix}
a_{2, 1} & a_{2, 3} & \cdots & a_{2, d}\\
a_{3, 1} & a_{3, 3} & \cdots & a_{3, d}\\
\vdots & \vdots & \ddots & \vdots\\
a_{d, 1} & a_{d, 3} & \cdots & a_{d, d}
\end{pmatrix} +\cdots + (-1)^{1+n} a_{1, d} \det\begin{pmatrix}
a_{2, 1} & a_{2, 2} & \cdots & a_{2, (d-1)}\\
a_{3, 1} & a_{3, 2} & \cdots & a_{3, (d-1)}\\
\vdots & \vdots & \ddots & \vdots\\
a_{d, 1} & a_{d, 2} & \cdots & a_{d, (d-1)}
\end{pmatrix}
$$

That is, we expand along the *top row* of $A$ in a sum with alternating signs and terms which are the entry of the row times the determinant of the $(d-1)$ by $(d-1)$ submatrix of $A$ obtained by removing the first row and $j$th column of $A$. Now, if $A\in M_{d, d}$ a **minor** of $A$ is the determinant of any $k$ by $k$ submatrix of $A$ for $1\leq k\leq d$. In particular, we let $A_{i, j}$ denote the determinant of the $(d-1)$ by $(d-1)$ submatrix obtained by removing the $i$th row and $j$th column of $A$. Therefore the Laplace expansion may be written as 

$$
\det A = (-1)^{1+1} a_{1, 1} A_{1, 1} + (-1)^{1+2} a_{1, 2} A_{1, 2}+\cdots + (-1)^{1+d}a_{1, n} A_{1, d}.
$$

It also makes sense to generalize the Laplace expansion to expansion along the $i$th row:

$$
\det A = \sum_{j=1}^d (-1)^{i+j} a_{i, j} A_{i, j}
$$

or, the $j$th column:

$$
\det A = \sum_{i=1}^d (-1)^{i+j} a_{i, j} A_{i, j}.
$$

Now, if a $k$ by $k$ minor of $A$ includes $k$ diagonal entries of $A$, the minor is called a **principal minor**. The $k$ by $k$ principal minor of $A$ formed from the first $k$ rows and columns of $A$ is called the a **leading principal minor**.

### Example: 
$$
A = \begin{pmatrix}
2 & 1 & 0\\
1 & 2 & 1\\
0 & 1 & 2
\end{pmatrix}
$$
has $1$ by $1$ principal minors $2$, $2$, $2$; $2$ by $2$ principal minors
$$
\det\begin{pmatrix}
2 & 1\\
1 & 2
\end{pmatrix}=3, \det\begin{pmatrix}
2 & 0\\
0 & 2
\end{pmatrix}=4, \text{ and } \det\begin{pmatrix}
2 & 1\\
1 & 2
\end{pmatrix}=4;
$$
and $3$ by $3$ principal minor
$$
\begin{pmatrix}
2 & 1 & 0\\
1 & 2 & 1\\
0 & 1 & 2
\end{pmatrix} = 2 (3) + (-1)(1)(2) + (0)(1)=4.
$$
The leading principal minors of $A$ are then $2$, $3$, and $4$.

#### Theorem (Sylvester's Criterion): If $A\in M_{d, d}$, then $A$ is positive definite if and only if the leading principal minors of $A$ are strictly positive. Additionally, $A$ is positive semidefinite if and only if all of its principal minors are non-negative.

Thus, to ensure convexity via the second order conditions, we compute the Hessian of $f$ at each ${\bf x}\in\mathbb{R}^d$ and then check the Sylvester's criterion. This is still quite difficult because we have to check a matrix for positive definiteness for all ${\bf x}\in\mathbb{R}^d$. Thus, we often rely upon other methods for determining convexity


## Operations that preserve convexity

#### Theorem (Positive Weighted Sum of Convex is Convex): If $X\subset\mathbb{R}^d$ is convex, $f,g:X\rightarrow\mathbb{R}$ are convex on $X$, and $a, b\geq 0$, then $h:X\rightarrow\mathbb{R}$ defined by $h({\bf x}) = af({\bf x}) + bg({\bf x})$ for all ${\bf x}\in X$ is convex on $X$.

#### Theorem (Pointwise Maximum of Convex is Convex): If $X\subset\mathbb{R}^d$ is convex and $f, g:X\rightarrow\mathbb{R}$ are convex on $X$, then $h:X\rightarrow\mathbb{R}$ defined by $h({\bf x}) = \max(f({\bf x}), g({\bf x})$ for all ${\bf x}$ is also convex on $X$.

Recall that any function of the form $\phi({\bf x}) = A{\bf x} + {\bf b}$ for all ${\bf x}\in\mathbb{R}^d$ where $A\in M_{k, d}$ and ${\bf b}\in\mathbb{R}^k$ is called **affine**.

#### Theorem (Convexity Preservation under Affine Precomposition): Suppose $X\subset\mathbb{R}^k$ is convex, $f:X\rightarrow\mathbb{R}$ is convex on $X$, $A\in M_{k, d}$ , ${\bf b}\in\mathbb{R}^k$, and set $Y = \{{\bf y}\in\mathbb{R}^d: A{\bf y}+{\bf b}\in X\}$. Then $Y$ is convex and $g:Y\rightarrow\mathbb{R}$ defined by $g({\bf y}) = f(A{\bf y}+{\bf b})$ for all ${\bf y}\in Y$ is convex on $Y$.

#### Theorem (Convexity Preservation under Convex Monotone Transformation): Suppose $X\subset\mathbb{R}^d$ is convex, $f:X\rightarrow\mathbb{R}$ is convex on $X$, and that $g:f(X)\rightarrow \mathbb{R}$ is convex and non-decreasing, then $g\circ f: X\rightarrow\mathbb{R}$ is convex on $X$.

## Group Questions

1. Explain why ${\bf x}^\ast$ is a solution to $\min_{{\bf x}\in\mathbb{R}^d} f({\bf x})$ implies $0$ is a solution to $\min_{t\in\mathbb{R}} f({\bf x}^\ast + t {\bf e}^{(i)})$ for all $i=1,2,\dots, d$.
2. If $f, h_j:\mathbb{R}^d\rightarrow\mathbb{R}$ are all convex functions for $j=1,\ldots, n$, explain why $\phi({\bf x})=f({\bf x})-\sum_{j=1}^n\log(-h_j({\bf x}))$ is convex over the set $\{{\bf x}\in\mathbb{R}^d: h_j({\bf x})<0\text{ for all } j\}$.
3. If $C\in M_{d, d}$ is symmetric, prove the generalized difference-between-two-squares formula: ${\bf x}^TC{\bf x}-{\bf y}^TC{\bf y}=({\bf x}+{\bf y})^T C({\bf x}-{\bf y})$ for all ${\bf x},{\bf y}\in\mathbb{R}^d$. 
4. The **Legendre-Fenchel transform** of a function $f:\mathbb{R}^d\rightarrow\mathbb{R}$ is the function $f^\ast({\bf y}) = \max_{{\bf x}\in\mathbb{R}^d} {\bf y}^T{\bf x} - f({\bf x})$. Explain why $f^\ast$ is always a convex function. **Interesting fact:** it can be shown that $f^{\ast\ast}$ is the **convex envelope** of $f$ (that is, it is the largest convex function bounded above by $f$). 
5. Show that $\displaystyle\begin{pmatrix} 2 & -1 & -1\\ -1 & 2 & -1\\ -1 & -1 & 2\end{pmatrix}$ is positive semidefinite but not positive definite.
6. Show that $\displaystyle\begin{pmatrix} 3 & -1 & -1\\ -1 & 3 & -1\\ -1 & -1 & 3\end{pmatrix}$ is positive definite.
7. Find a diagonalization of $\displaystyle\begin{pmatrix} 2 & -1 & -1\\ -1 & 2 & -1\\ -1 & -1 & 2\end{pmatrix}$.
8. For ${\bf x}\not={\bf 0}$, show that $\max_{(a,b)\in\mathbb{R}^2} {\bf x}^T(a{\bf x} + b{\bf y})$ subject to $\Vert a{\bf x}+b{\bf y}\Vert^2=\Vert {\bf y}\Vert^2$ has the solution $a=\Vert {\bf y}\Vert/\Vert {\bf x}\Vert$ and $b=0$. Conclude that ${\bf x}^T{\bf y}\leq \Vert {\bf x}\Vert \Vert {\bf y}\Vert$, and explain why the **Cauchy-Schwarz Inequality**, $\vert {\bf x}^T{\bf y}\vert\leq \Vert {\bf x}\Vert \Vert {\bf y}\Vert$ for all ${\bf x}, {\bf y}\in\mathbb{R}^d$ follows.