# The Alternating Direction Method of Multipliers

From the past, with vengeance

## Operator Splitting Quadratic Programming

**We will tackle our QP using the [OSQP](https://osqp.org/) solver by Oxford University**

OSQP is a modern solver for Quadratic Programs in the form:
$$
\arg\min_x \left\{\frac{1}{2} x^T P x + q^T x \mid l \leq Ax \leq u \right\}
$$

The solver:

* is [very fast](https://pypi.org/project/qpsolvers/), especially for problems with sparse matrices
* is available under a (very permissive) Apache 2.0 license
* has API for many programming languages

**The solver relies on the Alternating Direction Method of Multipliers (ADMM)**

* ...Plus [a bunch of clever "tricks"](https://osqp.org/citing/) to improve speed
* Here we will discuss only the basic ADMM, to provide _an intuition_

## The Alternating Direction Method of Multipliers

**The [ADMM](https://dl.acm.org/doi/abs/10.1561/2200000016) solves numerical constrained optimization problems in the form:**

$$\begin{align}
\text{argmin} \ & f(x) + g(z) \\
\text{subject to: } & Ax + Bz = c
\end{align}$$

* Where $f$ and $g$ are assumed to be convex

**The methods relies on a so-called _augmented Lagrangian_**

This is a reformulation where the constraints are turned into _penalty terms_:

$$
\mathcal{L}_{\rho}(x, z, \lambda) = f(x) + g(z) + \lambda^T(Ax+Bz-c) + \frac{1}{2}\rho \|Ax+Bz-c\|_2^2
$$

* The algorithm idea is to _optimize the augmented Lagrangian_
* ...And to encourage constraint satisfaction via the penalty terms
* In practice, this is done by adjusting the _multiplier vector $\lambda$_

## The Alternating Direction Method of Multipliers

**The ADMM operates as follows**

We start from an initial assignment $x^{0}, z^{0}, \lambda^{0}$, then we iterate:

$$\begin{align}
x^{k+1} & = \text{argmin}_x \mathcal{L}_{\rho}(x, z^k, \lambda^k) \\
z^{k+1} & = \text{argmin}_z \mathcal{L}_{\rho}(x^{k+1}, z, \lambda^k) \\
\lambda^{k+1} & = \lambda^k + \rho (Ax^{k+1}+Bz^{k+1}-c)
\end{align}$$

In other words:

* We keep everything fixed and we optimize over $x$ to obtain $x^{k+1}$
* We replace $x^{k}$ with $x^{k+1}$, keep everything fixed and optimize over $z$
* Finally, we update the multiplier vector

**The switch between $x$ and $z$ optimization is the "alternating" part**

...While the use of the multipliers $\lambda$ explains the rest of the name

## Multiplier Update

**Let's try to understand better the multiplier update**

...Which consists in the rule:
$$
\lambda^{k+1} = \lambda^k + \rho (Ax^{k+1}+Bz^{k+1}-c)
$$

* The term $Ax^{k+1}+Bz^{k+1}-c$ is just the current constraint violation
* ...In particular both its _amount_ and _direction_

**If $(Ax^{k+1}+Bz^{k+1})_i > c_i$ for some constraint $i$:**

* Then we _increase_ the corresponding multiplier $\lambda_i$
* So that the penalty term $\lambda_i (Ax^{k+1}+Bz^{k+1}-c)_i$ grows
* This will push the next iteration to reduce the degree of violation

## Multiplier Update

**Let's try to understand better the multiplier update**

...Which consists in the rule:
$$
\lambda^{k+1} = \lambda^k + \rho (Ax^{k+1}+Bz^{k+1}-c)
$$

* The term $Ax^{k+1}+Bz^{k+1}-c$ is just the current constraint violation
* ...In particular both its _amount_ and _direction_

**If $(Ax^{k+1}+Bz^{k+1})_i < c_i$ for some constraint $i$:**

* Then we _decrease_ the corresponding multiplier $\lambda_i$
* So that the penalty term $\lambda_i (Ax^{k+1}+Bz^{k+1}-c)_i$ grows (again)
* This will push the next iteration to reduce the degree of violation (again)

## Multiplier Update

**Let's try to understand better the multiplier update**

...Which consists in the rule:
$$
\lambda^{k+1} = \lambda^k + \rho (Ax^{k+1}+Bz^{k+1}-c)
$$

* The term $Ax^{k+1}+Bz^{k+1}-c$ is just the current constraint violation
* ...In particular both its _amount_ and _direction_

**If $(Ax^{k+1}+Bx^{k+1})_i = c_i$ for some constraint $i$:**

* Then we _keep_ the corresponding multiplier $\lambda_i$ _as it is_
* The constraint is not violated, so there is nothing to do

## Main Advantages of the Method

The method has _two major advantages_:

**1) The $x$ and $z$ variables can be handled _in isolation_**

* This results into simpler problems
* ...And in some cases enables massive parallelization

**2) The ADMM converges under relatively _mild conditions_**

* In the classical formulation, $f$ and $g$ need to be _closed_, _proper_, _convex_ functions
  - They do _not_ need to be differentiable
  - They _can_ take the value $+\infty$
  - We will see why that matters in the next slides
* The second condition is that $\mathcal{L}_{0}(x, z, \lambda)$ should have a saddle point
  - This one is way trickier to check...

The full convergence proof can be found e.g. [here](https://dl.acm.org/doi/abs/10.1561/2200000016)

# The ADMM and QP

That was the whole point, right?

## QP Reformulation

**Let's see these advantages at work on Quadratic Programs**

We need to solve:
$$
\text{argmin}_x \left\{\frac{1}{2}x^T P x + q^T x \mid l \leq Ax \leq u \right\}
$$

...Which we reformulate to:
$$\begin{align}
\text{argmin } & x^TPx + q^Tx \\
\text{subject to: } & z = Ax \\
& l \leq z \leq u
\end{align}$$

* We have introduced _a new variables $z$_
* ...And posted the inequality constraints over that

## QP Reformulation

**Then, we turn the inequality constraints into a function**

$$\begin{align}
\text{argmin } & x^TPx + q^Tx + \chi_{l \leq z \leq u}(z) \\
\text{subject to: } & z = Ax
\end{align}$$

Where:

* $\chi_{l \leq z \leq u}$ is the _characteristic function_ of $l \leq z \leq u$
  - It's value is $+\infty$ when the constraint is violated and 0 elsewhere
  - In this case, it is non-differentiable, but closed, proper, and convex!
* $x^TPx + q^Tx$ is our usual cost term
  - It is differentiable
  - ...And closed, proper, and convex if $P$ is semi-definite positive

**We can now proceed to apply the ADMM!**

## The ADMM Steps for QP

**We need to start from a feasible $x^0, z^0, \lambda^0$:**

* That's easy, we get it by setting $\lambda^0 = 0$, $z^0 = l$, then solving $A x^0 = l$

**The $x$ minimization step for $\hat{z} = z^k$ becomes:**

$$
\text{argmin}_x \ x^TPx + q^Tx + \chi_{l \leq z \leq u}(\hat{z}) + \lambda^T(\hat{z}-Ax) + \frac{1}{2}\rho \|\hat{z}-Ax\|_2^2
$$

And then, since $\hat{z}$ is fixed and feasible:
$$
\text{argmin } x^TPx + q^Tx + \lambda^T(\hat{z}-Ax) + \frac{1}{2}\rho \|\hat{z}-Ax\|_2^2
$$

This is a convex, differentiable, quadratic minimization problem

* It can be tackled via gradient descent
* ...Or by solving a linear system of equations


## The ADMM Steps for QP

**The $z$ minimization step for $\hat{x} = x^{k+1}$ becomes**

$$
\text{argmin}_z \ \hat{x}^T P\hat{x} + q^T \hat{x} + \chi_{l \leq z \leq u}(z) + \lambda^T(z-A \hat{x}) + \frac{1}{2}\rho \|z-A \hat{x}\|_2^2
$$

Since $\hat{x}$ is fixed, this can be reformulated as:

$$\begin{align}
\text{argmin } & \lambda^T z + \frac{1}{2} \rho\|z-A\hat{x}\|_2^2 \\
\text{subject to: } & l \leq z \leq u
\end{align}$$

...And finally separated in to $n$ problems (one per variable) in the form:
$$
\text{argmin}_{z_j} \left\{ \lambda_j z_j + \frac{1}{2}\rho(z_j-A_j\hat{x})^2 \mid l \leq z_j \leq u \right\} 
$$
* These are all very easy to solve

## Some Considerations

**We used the ADMM to break QP into a sequence of simpler problems**

The method can be used in other clever ways:

* Optimization with non-differentiable reguralizers
* Parallel training, by splitting examples into multiple problems
* ...And using constraints to reach a consensus

**The ADMM is best used for convex problems**

* Classical results are for convex problems only
* There are some (local) results non non-convex problems (e.g. [this one](https://ieeexplore.ieee.org/abstract/document/7239586))
* ...But in practice it's less reliable

**About the convergence pace**

* It's very fast in the first iterations, but much slower later
* You can high-quality solutions early, but reaching the optimum takes long
* All in all, it's best to use the ADMM as an approximate method