#### `Equality constrained` optimization

For twice-differentiable `convex` function $f: \mathbf{R}^n \rightarrow \mathbf{R}$, we want to

$$\min f(x), \text{s.t. } Ax=b$$

where $A\in \mathbf{R}^{p \times n}, \, \text{rank }A=p$ (`independent rows`), and $p^*$ is optimal value

We can write out the if and only if `optimality condition`

Start with Lagrangian

$$L(x, \nu)=f(x)+\nu^T(Ax-b)$$

With vanishing gradient of Lagrangian w.r.t. $x$, we have

$$\nabla_xL=\nabla f(x)+A^T\nu=0$$

and together with primal feasibility, we have the optimality condition

$$\boxed{x\in \text{dom }f, \, Ax=b, \nabla f(x)+A^T\nu = 0}$$

for Lagrange multipliers $\nu \in \mathbf{R}^p$

#### `Quadratic` example

$$\min \frac{1}{2}x^TPx+q^Tx+r, \text{s.t. } Ax=b$$

where $P\in S^n_+, \text{rank }A=p$

We can write the optimality condition

$$Ax=b,\, Px+q+A^T\nu=0$$

which is a set of $n+p$ linear equations in $n+p$ variables

$$\begin{bmatrix}P & A^T \\
A & 0\end{bmatrix}\begin{bmatrix}x \\ \nu\end{bmatrix}=\begin{bmatrix}-q \\ b\end{bmatrix}$$

where the coefficient matrix is the `KKT matrix`

The equivalent condition for `nonsingularity` of KKT matrix is

* $\text{rank}(\begin{bmatrix}P \\ A\end{bmatrix})=n$

a) Apparently, KKT full rank $\Longrightarrow$ first $n$ columns are independent

b) To show that first $n$ columns in KKT are independent $\Longrightarrow$ KKT invertible, we assume KKT is `not invertible`

Then, there exists $x, \nu$ that are `not both non-zero` such that

$$Px+A^T\nu=0, Ax=0$$

or (left multiply by $x^T$ in first equation, and transpose and right multiply $\nu$ in second equation)

$$x^TPx+x^TA^T\nu=0, x^TA^T\nu=0$$

Therefore, $x^TPx=0$

Since $P$ is PSD, the only way this happens is $Px=0$

To see this, as PSD matrices are diagonalizable, we can write

$$x^TPx=(Q^Tx)^T\Lambda (Q^Tx)=\sum_{i}\lambda_i(q_i^Tx)^2$$

Because all $\lambda_i\geq 0$, therefore, the only way $x^TPx=0$ is $q_i^Tx=0$ for all $i$ corresponding to $\lambda_i > 0$

Therefore

$$Px=Q\Lambda Q^Tx = \sum_{i}\left[\lambda_i(q_i^Tx)\right]q_i=0$$

So, we have

$$\begin{bmatrix}P\\A\end{bmatrix}x=0$$

Since $\begin{bmatrix}P\\A\end{bmatrix}$ has independent columns, the only way this happens is that $x=0$

With $Px=0$, we have must $A^T\nu=0$

Since $\text{rank }A=p$, then, all of its $p$ rows (or $p$ columns of $A^T$) are independent, therefore $\nu=0$  

As a result, we have $x=\nu=0$, which is a contradiction, thus, KKT is invertible

#### Equality constrained `Newton step`

Recall Newton step in unconstrained optimization is based on 2nd order Taylor approximation of the function at some $x$

$$\begin{align*}
\Delta x_{nt}&=\arg \min_v f(x+v) \\
&\approx \arg \min_v f(x)+\nabla f(x)^Tv+\frac{1}{2}v^T\nabla^2 f(x) v
\end{align*}$$

With equality constraint (assume $x\in \text{dom }f$ and $Ax=b$)

$$A(x+v)=b$$

we can write the `optimality conditions` ($x\in \text{dom }f, \, Ax=b, \nabla f(x)+A^T\nu = 0$) as (note, take derivative w.r.t. $v$, not $x$)

$$\boxed{A(x+v)=b ,\, \nabla f(x+v)+A^Tw \approx \nabla f(x) + \nabla^2f(x)v+A^Tw=0}$$

Using KKT matrix, and with $A(x+v)=b, Ax=b \Longrightarrow Av=0$, we can see that the solution of $v$ that is `Newton step` solves

$$\begin{bmatrix}
\nabla^2 f(x) & A^T \\ A & 0
\end{bmatrix}\begin{bmatrix}
v\\w
\end{bmatrix}=\begin{bmatrix}
-\nabla f(x)\\0
\end{bmatrix}$$

Since solution $v$ is in the `nullspace` of $A$, we know that applying Newton step would always keep $x+\Delta x_{nt}$ `feasible`

#### Equality constrained `Newton decrement`

Similarly, we define Newton decrement as

$$\begin{align*}
\lambda(x)&= \left(\Delta x_{nt}^T \nabla^2 f(x) \Delta x_{nt}\right)^{1/2} \\
& = \left(-\nabla f(x)^T\Delta x_{nt}\right)^{1/2}
\end{align*}$$

which is still the `Newton step measured using quadratic norm under the Hessian`

The `second expression` of $\lambda(x)$ comes from the fact that Newton step satisfies KKT equations, so we have

$$\nabla^2 f(x)\Delta x_{nt}+A^Tw=-\nabla f(x)$$

Multiply $\Delta x_{nt}^T$ from the left on each side of first block equation

$$\Delta x_{nt}^T\nabla^2 f(x)\Delta x_{nt}+\Delta x_{nt}^TA^Tw=-\Delta x_{nt}^T\nabla f(x)$$

From second block equation we have the feasibility condition

$$A\Delta x_{nt}=0$$

Plug into the first block equation

$$\Delta x_{nt}^T\nabla^2 f(x)\Delta x_{nt}=-\Delta x_{nt}^T\nabla f(x)$$

Since right hand side is just a dot product, we have

$$\boxed{\Delta x_{nt}^T\nabla^2 f(x)\Delta x_{nt}=-\nabla f(x)^T \Delta x_{nt}}$$

`However`

$$\lambda(x)\neq\left(\nabla f(x)^T \left(\nabla^2f(x)\right)^{-1}\nabla f(x)\right)^{1/2}$$

since in general constrained case

$$\Delta x_{nt}\neq \left(\nabla^2f(x)\right)^{-1}\nabla f(x)$$

Similarly, we can relate Newton decrement to 2nd order approximation of $f$ at $x$ for equality-constrained case

$$\begin{align*}
f(x)-\inf_{u} \hat{f}(x+u) &= f(x)-\hat{f}(x+\Delta x_{nt}) \\
& = f(x) - \left(f(x)+\nabla f(x)^T\Delta x_{nt}+\frac{1}{2}\Delta x_{nt}^T\nabla^2f(x)\Delta x_{nt}\right) \\
& \Delta x_{nt}^T\nabla^2 f(x)\Delta x_{nt}=-\nabla f(x)^T \Delta x_{nt}\\
& = \Delta x_{nt}^T\nabla^2 f(x)\Delta x_{nt}-\frac{1}{2}\Delta x_{nt}^T\nabla^2f(x)\Delta x_{nt} \\
&=\boxed{\frac{1}{2}\lambda(x)^2}
\end{align*}$$

We see that $\frac{1}{2}\lambda(x)^2$ again provides an estimate of $f(x)-p^*$ `based on 2nd order approximation` of $f$ at $x$

#### `Newton's method` with equality constraints

We now have steps for equality-constrained Newton's method

Start with a `feasible point` $x\in \text{dom }f, Ax=b$
* compute Newton `step` $\Delta x_{nt}$ from KKT equation (i.e., solve for $v$) and Newton `decrement`
$$\lambda(x)=\left(\Delta x_{nt}^T \nabla^2 f(x) \Delta x_{nt}\right)^{1/2}$$
* stopping criterion
$$\frac{1}{2}\lambda(x)^2 \leq \epsilon$$
* line search for `step size`, starting at $t=1$, backtrack $t\leftarrow \beta t$ until
$$\begin{align*}f(x+t\Delta x_{nt})&<f(x)+\alpha t \nabla f(x)^T\Delta x_{nt} \\
& = f(x)-\alpha t \lambda(x)^2
\end{align*}$$
* update
$$x\leftarrow x+t\Delta x_{nt}$$

#### Newton step at `infeasible points`

If we are currently at an infeasible point, meaning $Ax\neq b$, then, we have $Av=-(Ax-b)$, and the KKT equation becomes

$$\begin{bmatrix}
\nabla^2 f(x) & A^T \\ A & 0
\end{bmatrix}\begin{bmatrix}
v\\w
\end{bmatrix}=-\begin{bmatrix}
\nabla f(x)\\Ax-b
\end{bmatrix}$$

From the solution, we can see that if we take the `full` Newton step $\Delta x_{nt}$, $x+\Delta x_{nt}$ will be `feasible`, and the rest iterations will be taken care of based on the previous analysis on Netwon step at feasible points

##### Primal-dual interpretation

We can also more explicitly updates both primal and dual variables (previously, we didn't really touch the dual variable)

If we write out `residual` from the optimality condition, we have

$$r(x, \nu)=\begin{bmatrix}\nabla f(x)+A^T\nu \\ Ax-b \end{bmatrix}$$

Linearize and set it to zero (since we want residual to be zero), we have

$$r(x, \nu)+Dr(x, \nu)\begin{bmatrix} \Delta x_{nt} \\ \Delta v_{nt} \end{bmatrix}=0$$

For the Jacobian, we have

$$Dr(x, \nu)=\begin{bmatrix}\frac{\partial r_1}{\partial x} & \frac{\partial r_1}{\partial \nu}\\ \frac{\partial r_2}{\partial x} & \frac{\partial r_2}{\partial \nu} \end{bmatrix}=\begin{bmatrix}\nabla^2 f(x) & A^T\\ A & 0 \end{bmatrix}$$

Rearrange

$$\begin{bmatrix}\nabla^2 f(x) & A^T\\ A & 0 \end{bmatrix}\begin{bmatrix} \Delta x_{nt} \\ \Delta v_{nt} \end{bmatrix}=-\begin{bmatrix}\nabla f(x)+A^T\nu \\ Ax-b \end{bmatrix}$$

where $w=\nu + \Delta v_{nt}$

Easy to check that if we consider $r(x,\nu)$ as function of $x$ only, the above derivation gives

$$\begin{bmatrix}
\nabla^2 f(x) & A^T \\ A & 0
\end{bmatrix}\begin{bmatrix}
\Delta x_{nt}\\ \nu
\end{bmatrix}=-\begin{bmatrix}
\nabla f(x)\\Ax-b
\end{bmatrix}$$

which is what we have previously

#### Newton's method with infeasible start

Start with a point $x\in \text{dom }f$ and $\nu$
* compute primal and dual Newton `step` $\Delta x_{nt}, \Delta \nu_{nt}$
* line search for `step size`, starting at $t=1$, backtrack $t\leftarrow \beta t$ until
$$\|r(x+t\Delta x_{nt}, \nu+t\Delta \nu_{nt})\|_2\leq (1-\alpha t)\|r(x, \nu)\|_2$$
* update
$$x\leftarrow x+t\Delta x_{nt}, \nu\leftarrow \nu+t\Delta \nu_{nt}$$
* terminate if $Ax=b$ and $\|r(x,\nu)\|_2\leq \epsilon$

The reason we cannot use function value for line search is that in order to get back to feasible set, function value may need to increase, so it is no longer a descent method

#### Solving KKT equations

We can use LDLT factorization to solve KKT system of equations

$$Kx=\begin{bmatrix}H &A^T \\A & 0\end{bmatrix}x=r$$

We do the following

* $K=LDL^T$
* Forward substitution for $Ly=r$
* Scaling for $Dz=y$
* Back substitution for $L^Tx=z$

Toy example

$$K = \begin{bmatrix}2 & 1 & 1 \\ 1 & 2 & 0 \\ 1 &0 & 0 \end{bmatrix}, r=-\begin{bmatrix}1 \\ 2 \\ 3\end{bmatrix}$$

In [None]:
import matplotlib.pyplot as plt
import numpy as np
np.set_printoptions(formatter={'float': '{: 0.4f}'.format})

plt.style.use('dark_background')
# color: https://matplotlib.org/stable/gallery/color/named_colors.htm

In [None]:
def ldlt_factorization(A):
    # Assume A is symmetric
    m = A.shape[0]
    l_mat = A.copy().astype(float)
    d_mat = np.zeros(m)

    for k in range(m):
        d_mat[k] = l_mat[k, k]
        if l_mat[k, k] == 0:
            raise ValueError('Matrix is singular')

        l_mat[k+1:, k+1:] -= np.outer(l_mat[k+1:, k], l_mat[k+1:, k]) / l_mat[k, k]
        l_mat[k:, k] /= l_mat[k, k]

    return np.tril(l_mat), np.diag(d_mat)

def forward_substitution(L, b):
    m, n = L.shape
    x = np.zeros(n)
    for i in range(n):
        x[i] = (b[i] - np.dot(L[i, :i], x[:i])) / L[i, i]
    return x

def back_substitution(R, b):
    m, n = R.shape
    x = np.zeros(n)
    for i in range(n - 1, -1, -1):
        x[i] = (b[i] - np.dot(R[i, i + 1:], x[i + 1:])) / R[i, i]
    return x

In [None]:
k_mat = np.array([[2, 1, 1], [1, 2, 0], [1, 0, 0]])
r = - np.array([1, 2, 3])

try:
    l_mat, d_mat = ldlt_factorization(k_mat)
except Exception as e:
    print(e)

# Forward
y = forward_substitution(l_mat, r)

# Scaling
z = y / np.diag(d_mat)

# Back
x = back_substitution(l_mat.T, z)
print("Solution:\n", x)

# Compare to NumPy
print("\nNumPy solution:\n", np.linalg.solve(k_mat, r))

Solution:
 [-3.0000  0.5000  4.5000]

NumPy solution:
 [-3.0000  0.5000  4.5000]
