# <center>Crash course 2: Convexity</center>
### <center>Alfred Galichon (NYU & Sciences Po) and Giovanni Montanari (NYU)</center>
## <center>'math+econ+code' masterclass on optimal transport and economic applications</center>
<center>© 2018-2022 by Alfred Galichon. Past and present support from NSF grant DMS-1716489, ERC grant CoG-866274, and contributions by Jules Baudet, Pauline Corblet, Gregory Dannay, and James Nesbit are acknowledged.</center>

#### <center>With python code examples</center>

# Convex analysis

## References
* [OTME], Ch. 6
* Rockafellar (1970). Convex analysis. Princeton.

    

A fundamental tool in convex analysis is called the Legendre-Fenchel transform, which is defined in general as follows.

**Definition**. The Legendre-Fenchel transform of $u$ is defined by

$$ u^{\ast}\left(  y\right)  =\sup_{x\in\mathbb{R}^{d}}\left\{  x^{\intercal}y-u\left(  x\right)  \right\}  . $$



**Proposition**. The following holds:<br>
(i) $u^{∗}$ is convex.<br>
(ii) $u_1\leq u_2$ implies $u^*_1\geq u^*_2$.<br>
(iii) (Fenchel's inequality): $u(x)+u^{∗}(y) \geq x^{⊺}y$.<br>
(iv) $u^{∗∗} \leq u$ with equality iff $u$ is convex.

    
As an immediate corollary of (iv), we get the fundamental result:

**Proposition**. If u is convex, then $u=(u^{∗})^{∗}$. The converse holds true.

    

    

These are some examples of Legendre-Fenchel transforms.

One has:

(i) For $u\left(  x\right)  =\left\vert x\right\vert ^{2}/2$, one gets
$u^{\ast}\left(  y\right)  =\left\vert y\right\vert ^{2}/2$.

(ii) For $u\left(  x\right)  =\sum_{i}\lambda_{i}x_{i}^{2}/2$, $\lambda_{i}>0$, one gets $u^{\ast}\left(  y\right)  =\sum_{i}\lambda_{i}^{-1}y_{i}^{2}/2$.

(iii) The entropy function

$$u\left(  x\right)  =
\begin{cases}
\sum_{i=1}^{d}x_{i}\ln x_{i}\text{ for }x\geq0\text{, }\sum_{i=1}^{d}x_{i}=1 \\
+\infty\text{ otherwise}
\end{cases}$$

has a Legendre transform which is the log-partition function, a.k.a. logit
function

$$u^{\ast}\left(  y\right)  =\ln\left(  \sum_{i=1}^{d}e^{y_{i}}\right).$$

## Subdifferentials


We now restate the demand sets of workers and firms in terms of subdifferentials of convex functions. For this, let us recall the basic economic interpretation duality: $u^*(y)$ captures the problem of a firm of type $y$, which hires a worker $x$ who offers the best trade-off between production if hired by $y$ (that is $\Phi\left(  x,y\right) =x^{\intercal}y$) and wage $u\left(  x\right)  $. Thus, firm $y$ will be willing to match with any worker $x$ whithin the set of maximizers of $x^\top y -u(x)$, while worker $x$ will be willing to match with any firm whithin the set of maximizers of $x^\top y -u^*(y)$. 

These set of maximizers are called *subdifferentials* of $u^*$ and $u$.



**Definition.** Let $u:\mathbb{R}^{d}\rightarrow \mathbb{R}$. The subdifferential of $u$ at $x$, denoted $\partial u(x)$, is the set of $y \in R^{d}$ such that $\forall \tilde{x} \in \mathbb{R}^{d}, u(\tilde{x})\geq u(x)+y^{\top}(\tilde{x}-x)$.  

The definition does _not_ require $u$ to be convex; however, if $u$ is convex, the previous Definition immediately implies that

$$\partial u(x) = argmax_{y} \{ x^{⊺}y-u^{∗}(y) \}$$

hence the subdifferential of a convex function is always nonempty (while the subdifferential of a non-convex function can be empty in general).  

When $u$ is differentiable and convex, then

$$\partial u(x) = \{\nabla u(x)\}.$$ 

As an example, when $u\left(  x\right)  =\left\vert x\right\vert $, one has  $\partial u\left(  x\right)  =\left\{  -1\right\}  $ if $x<0$, $\left\{  +1\right\} $ if $x>0$, and $\left[  -1,+1\right]  $ if $x=0$.

        

### Subdifferentials: first properties

It also follows that if $u$ is a convex function, the following statements are
equivalent:

$$\text{(i)}   \text{ }u\left(  x\right)  +u^{\ast}\left(  y\right)
=x^{\intercal}y$$
$$\text{(ii)}   \quad y\in\partial u\left(  x\right) $$
$$\text{(iii)}  \quad x\in\partial u^{\ast}\left(  y\right).$$


Going back to our worker-firm example, this has a straightforward economic
interpretation. If worker $x$ chooses firm $y$, then $y$ maximizes
$x^{\intercal}\tilde{y}-u^{\ast}\left(  \tilde{y}\right)  $ over $\tilde{y}$,
thus $y\in\partial u\left(  x\right)  $. This means that while worker $x$'s
equilibrium wage $u\left(  x\right)  $ is in general greater or equal than the
value $x^{\intercal}y-u^{\ast}\left(  y\right)  $ she can extract from firm
$y$, those two values necessarily coincide if $x$ and $y$ are willing to
match, in which case $u\left(  x\right)  +u^{\ast}\left(  y\right)
=x^{\intercal}y$.


### Subdifferentials and complementary slackness

These considerations allow us to relate the solutions to the primal and dual
problems. Recall that in the finite-dimensional case, the primal and the
dual problems are related by a complementary slackness condition. In the
present case, let $\left(  X,Y\right)  \sim\pi$ be a solution to the primal
problem, and $\left(  u,u^{\ast}\right)  $ be a solution to the dual problem.
Then almost surely $X$ and $Y$ are willing to match, which, by the previous
discussion, implies that

$$
u\left(  X\right)  +u^{\ast}\left(  Y\right)  =X^{\intercal}Y,
$$

or equivalently $Y\in\partial u\left(  X\right)  $ or in turn $X\in\partial
u^{\ast}\left(  Y\right)  $. In other words, the support of $\pi$ is included
in the set $\left\{  \left(  x,y\right)  :u\left(  x\right)  +u^{\ast}\left(
y\right)  =x^{\intercal}y\right\}  $. This condition appears as the correct
generalization of the complementary slackness condition in the
finite-dimensional case. Without surprise, taking the expectation with respect
to $\pi$ of the equality right above yields the equality between the value of
the dual problem on the left-hand side, and the value of the primal problem on
the right-hand side.

### Gradient of convex functions

More can be said when $u$ is differentiable at $x$. In that case, it is not
hard to show that $\partial u(x)=\left\{ \nabla u\left(x\right)\right\}$, i.e. contains only one point, which is $\nabla u\left(x\right)  =\left(  \partial u\left(  x\right)  /\partial x_{i}\right)  _{i}$, the vector of partial derivatives of $u$, or gradient of $u$. Similarly, if $u^{\ast}$ is differentiable at $y$, then $\partial u^{\ast}\left(  y\right)=\left\{  \nabla u^{\ast}\left(  y\right)  \right\}  $. Hence, if $u$ and $v$ are differentiable, then the equivalence mentioned in the first properties implies that $y=\nabla u\left(  x\right)  $ if and only if $x=\nabla u^{\ast}\left(  x\right)  $, that is

$$
\left(  \nabla u\right)  ^{-1}=\nabla u^{\ast}.
$$


Alternatively, this can be seen as a duality between first-order conditions and the envelope theorem. First order conditions in the firm's problem imply that if worker $x$ is chosen by firm $y$, then $\nabla u\left(  x\right)  =y$, but the envelope theorem implies that the gradient in $y$ of the firm's indirect profit $u^{\ast}\left( y\right)  $ is given by $\nabla u^{\ast}\left(  y\right)  =x$, where $x$ is chosen by $y$. Thus the first-order conditions and the envelope theorem are conjugate in the sense of convex analysis.

As an example, when $u\left(  x\right)  =\sum_{i}\lambda_{i}x_{i}^{2}/2$, $\lambda_{i}>0$, recall that $u^{\ast}\left(  y\right)  =\sum_{i}\lambda_{i}^{-1}y_{i}^{2}/2$. Define $\Lambda=diag\left(  \lambda\right)  $. One has $\nabla u\left(x\right)  =\Lambda x$ and $\nabla u^{\ast}\left(  y\right)  =\Lambda^{-1}y$.

### Hessians of convex functions

Assume both $u$ and $u^{\ast}$ are stricly convex and differentiable. Then it
can be show that their Hessians are invertible at all points, and that if
$y=\nabla u\left( x\right) $, then 
$$D^{2}u^{\ast}\left(  y\right)  =\left(  D^{2}u\left(  x\right)  \right)
^{-1}.$$

This can be obtained by differentiating the relationship $\nabla u^{\ast}\left(  y\right)  =\left(  \nabla u\right)  ^{-1}\left(  y\right)  $.

## Exercises

**Exercise 1**  

Compute the Legendre-Fenchel transforms of the following functions:

(i) $u\left(  x\right)  x^{\intercal}\Sigma x/2$, where $\Sigma$ is a positive definite matrix, one has $u^{\ast}\left(  y\right)  =y^{\intercal}\Sigma^{-1}y/2$.  

(ii) Let $p>1$ and $u\left(x\right)=\frac{1}{p}\left\Vert x\right\Vert^{p}$, where $\left\Vert .\right\Vert $ is the Euclidean norm. Then $u^{\ast}\left(  y\right)  =\frac{1}{q}\left\Vert y\right\Vert ^{q}$, where $q>1$ such that $1/p+1/q=1$.  

(iii) $u\left(x\right)  =1\left\{  x\in\left[  0,1\right]  \right\}$.

------

**Exercise 2**  

Give the subdifferentials of the following functions from $\mathbb{R}$ to
$\mathbb{R}$:

(a) $u\left(  x\right)  =\max\left(  x,0\right)$.  

(b) $u\left(  x\right)  =\max\left(  f\left(  x\right)  ,g\left(  x\right)\right)  $, where both $f$ and $g$ are convex and differentiable.  

(c) $u\left(  x\right)  =\max_{1\leq i\leq n}\left\{  a_{i}x+b_{i}\right\}  $ where $a_{1}<a_{2}<...<a_{n}$.  

(d) $u\left(  x\right)  =-x^{2}$.

---

#### More on the entropy function

Consider the entropy function
$$
u\left(  x\right) =\begin{cases}
\sum_{i=1}^{d}x_{i}\ln x_{i}\text{ for }x\geq0\text{, }\sum_{i=1}^{d}x_{i}=1\\
+\infty\text{ otherwise}%
\end{cases}
$$

As it is defined on the simplex, it is not a differentiable function from
$\mathbb{R}^{d}$ to $\mathbb{R}$. Instead, let us take $x_{d}=1-\sum_{i=1}^{d-1}x_{i}$, and let us view $u$ as a function $\tilde{u}$ from $\mathbb{R}^{d-1}$ to $\mathbb{R}$. We define

$$
\tilde{u}\left(  x\right)  =\sum_{i=1}^{d-1}x_{i}\ln x_{i}+\left(1-\sum_{i=1}^{d-1}x_{i}\right)  \ln\left(  1-\sum_{i=1}^{d-1}x_{i}\right)
$$

if $x\geq0$, $\sum_{i=1}^{d-1}x_{i}\leq1$, $\tilde{u}\left(  x\right)=+\infty$ otherwise.

**Exercise 3**

Show that:

(a) The Legendre transform of $\tilde{u}$ is a function of $\mathbb{R}^{d-1}$
to $\mathbb{R}$ given by
$$
\tilde{u}^{\ast}\left(  y\right)  =\ln\left(  \sum_{i=1}^{d-1}e^{y_{i}%
}+1\right)  .
$$

(b) The gradient of $\tilde{u}$ is a vector in $\mathbb{R}^{d-1}$ given by
$$
\nabla\tilde{u}\left(  x\right)  =\left(  \ln\left(  \frac{x_{i}}{1-\sum
_{i=1}^{d-1}x_{i}}\right)  \right)  _{1\leq i\leq d-1}%
$$

(c) The gradient of $\tilde{u}^{\ast}$ is a vector in $\mathbb{R}^{d-1}$ given
by
$$
\nabla\tilde{u}^{\ast}\left(  y\right)  =\left(  \frac{e^{y_{i}}}{\sum
_{i=1}^{d-1}e^{y_{i}}+1}\right)  _{1\leq i\leq d-1}%
$$

(d) Compute $D^{2}\tilde{u}$ and $D^{2}\tilde{u}^{\ast}$.


## Elements of constrained optimization

### The Karush-Kuhn-Tucker theorem

Source: Gordon and Tibshirani's lectures: https://www.cs.cmu.edu/\symbol{126}ggordon/10725-F12/slides/16-kkt.pdf


- Consider the "primal" problem

$$\begin{align*}
\min_{x\in\mathbb{R}^{n}}  &  f\left(  x\right) \\
s.t.~  &  h_{i}\left(  x\right)  \leq0,i=1,...,m\\
&  l_{j}\left(  x\right)  =0,j=1,...,r
\end{align*}$$ 
where the functions $f$ and $h_{i}$ ($1\leq i\leq m$) are convex, and $l_{j}$ ($1\leq j\leq r$) are affine.

- This problem can be written as
$$\min_{x\in\mathbb{R}^{n}}\max_{u_{i}\geq0,v_{j}}\underset{L\left( x,u,v\right)  }{\underbrace{f\left(  x\right)  +\sum_{i=1}^{m}u_{i} h_{i}\left(  x\right)  +\sum_{j=1}^{r}v_{j}l_{j}\left(  x\right)  }}
$$ 
where $L\left(  x,u,v\right)  $ is the Lagrangian.


### Duality

- In general the weak duality inequality
$$
\min_{x}\max_{w}L\left(  x,w\right)  \geq\max_{w}\min_{x}L\left(  x,w\right)
$$
holds. Equality (strong duality) requires Slater's condition: there is $x_{0}$
with $h_{i}\left(  x_{0}\right)  <0$ for all $i$ and $g_{j}\left(x_{0}\right)  =0$ for all $j$.

- The "dual" problem is thus
$$
\max_{u_{i}\geq0,v_{j}}g\left(  u,v\right)
$$
where $g\left(  u,v\right)  :=\min_{x}L\left(  x,u,v\right)$.

### The Karush-Kuhn-Tucker (KKT) conditions

- A pair $\left(  x,u,v\right)  $ satisfies the KKT conditions iff and
only if

$$
\begin{align*}
&  0\in\partial f\left(  x\right)  +\sum_{i=1}^{m}u_{i}\partial h_{i}\left(
x\right)  +\sum_{j=1}^{r}v_{j}\partial l_{j}\left(  x\right) \\
&  u_{i}^{\top}h_{i}\left(  x\right)  =0~\forall i\\
&  h_{i}\left(  x\right)  \leq0~\forall i,~l_{j}\left(  x\right)  =0~\forall
j\\
&  u_{i}\geq0~\forall i
\end{align*}
$$

- In practice, we shall use these conditions when $f$ and all the $h_{i}$ are smooth, in which case the condition rewrites as 

$$
\nabla f\left(  x\right)  +\sum_{i=1}^{m}u_{i}\nabla h_{i}\left(  x\right)
+\sum_{j=1}^{r}v_{j}\nabla l_{j}\left(  x\right)  =0.
$$

This leads to the **KKT Theorem**.

In the setting above, the following two statements are equivalent:


- $x^{\ast}$ is optimal for the primal problem, and $\left(  u^{\ast},v^{\ast}\right)  $ is optimal for the dual problem,
- $\left(  x^{\ast},u^{\ast},v^{\ast}\right)  $ satisfies the KKT conditions.


Note that the KKT theorem can be stated under more general assumptions(relaxing convexity assumptions on $f$ and $h$).



# Convex optimization

### Gradient descent

### Newton descent

### Coordinate update