# Lecture 8, Optimality conditions and SQP methods

We are still studying the full problem

$$
\begin{align} \
\min \quad &f(x)\\
\text{s.t.} \quad & g_j(x) \geq 0\text{ for all }j=1,\ldots,J\\
& h_k(x) = 0\text{ for all }k=1,\ldots,K\\
&x\in \mathbb R^n.
\end{align}
$$

## What aspects are important for optimality conditions?
* Think about this for a while
* You can first think about the case without constraints and, then, what should be added there


In order to identify which points are optimal, we want to define similar conditions as there are for unconstrained problems through the gradient:

>If $x$ is a  local optimum to function $f$, then $\nabla f(x)=0$.

## KKT conditions



**Theorem (First order Karush-Kuhn-Tucker (KKT) Necessary Conditions)** 

Let $x^*$ be a local minimum for problem
$$
$$
\begin{align} \
\min \quad &f(x)\\
\text{s.t.} \quad & g_j(x) \geq 0\text{ for all }j=1,\ldots,J\\
& h_k(x) = 0\text{ for all }k=1,\ldots,K\\
&x\in \mathbb R^n.
\end{align}
$$
$$

Let us assume that objective and constraint functions are continuosly differentiable at a point $x^*$ and assume that $x^*$ satisfies some regularity conditions (see e.g., https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions#Regularity_conditions_.28or_constraint_qualifications.29 ). Then there exists unique Lagrance multiplier vectors $\mu^*=(\mu_1^*,\ldots,\mu_J^*)$ and $\lambda^* = (\lambda^*_1,\ldots,\lambda_K^*)$ such that

$$
\begin{align}
&\nabla_xL(x^*,\mu^*,\lambda^*) = 0\\
&\mu_j^*\geq0,\text{ for all }j=1,\ldots,J\\
&\mu_j^*g_j(x^*)=0,\text{for all }j=1,\ldots,J,
\end{align}
$$

where $L$ is the *Lagrangian function* $$L(x,\mu,\lambda) = f(x)- \sum_{j=1}^J\mu_jg_j(x) -\sum_{k=1}^K\lambda_kh_k(x)$$.

## An example of constraint qualifications for inequality constraint problems


**Definition (regular point)**

A point $x^*\in S$ is *regular* if the set of gradients of the active inequality constraints 

$$
\{\nabla g_j(x^*) | \text{ constraint } i \text{ is active}\}
$$

is linearly independent. This means that none of them can be expressed as a linear combination of the others. (*In a simple language one might say that they point to different directions; as an example you can think of the basis vectors of $\mathbb R^n$*.)

KKT conditions were developed independently by 
* Karush:"Minima of Functions of Several Variables with Inequalities as Side Constraints". *M.Sc. Dissertation*, Dept. of Mathematics, Univ. of Chicago, 1939
* Kuhn & Tucker: "Nonlinear programming", In: *Proceedings of 2nd Berkeley Symposium*, pp. 481–492, 1951

The coefficients $\mu$ and $\lambda$ are called the *KKT multipliers*.

The first equality 

$$
\nabla_xL(x,\mu,\lambda) = 0
$$

is called the stationary rule and the requirement 

$$
\mu_j^*g_j(x)=0,\text{for all }j=1,\ldots,J
$$

is called the complementarity rule.

## Example

Consider the optimization problem

$$
\begin{align}
\min &\qquad (x_1^2+x^2_2+x^2_3)\\
\text{s.t}&\qquad x_1+x_2+x_3-3\geq 0.
\end{align}
$$

Let us verify the KKT necessary conditions for the local optimum $x^*=(1,1,1)$.

We can see that

$$
L(x,\mu,\lambda) = (x_1^2+x_2^2+x_3^2)-\mu_1(x_1+x_2+x_3-3)
$$

and thus

$$
\nabla_x L(x,\mu,\lambda) = (2x_1-\mu_1,2x_2-\mu_1,2x_3-\mu_1)
$$

and if $\nabla_x L([1,1,1],\mu,\lambda)=0$, then 

$$
2-\mu_1=0 $$
which holds when $$
\mu_1=2.
$$

In addition to this, we can see that $x^*_1+x^*_2+x^*_3-3= 0$. Thus, the completementarity rule holds even though $\mu_1\neq 0$.

## Example 2

Let us check the KKT conditions for a solution that is not a local optimum. Let us have $x^*=(0,1,1)$.

We can easily see that in this case, the conditions are 

$$\left\{
\begin{array}{c}
-\mu_1 = 0\\
2-\mu_1=0
\end{array}
\right.
$$

Clearly, there does not exist a $\mu_1\in \mathbb R$ such that these equalities would hold.

## Example 3

Let us check the KKT conditions for another solution that is not a local optimum. Let us have $x^*=(2,2,2)$.

We can easily see that in this case, the conditions are

$$
4-\mu_1 = 0
$$

Now, $\mu_1=4$ satisfies this equation. However, now

$$
\mu_1(x^*_1+x^*_2+x^*_3-3)=4(6-3) = 12 \neq 0.
$$

Thus, the completementarity rule fails and the KKT conditions are not true.

## Geometric interpretation of the KKT conditions

## Stationary rule

The stationary rule can be written as: There exist $\mu,\lambda'$ so that

$$
-\nabla f(x) = -\sum_{j=1}^K\mu_j\nabla g_j(x) + \sum_{k=1}^K\lambda'_k\nabla h_k(x).
$$

Notice that we have slightly different $\lambda'$.

Now, remember that the $-\nabla v(x)$ gives us the direction of reduction for a function $v$.

Thus, the above equation means that the direction of reduction of the function $-\nabla f(x)$ is countered by the direction of the reduction of the inequality constraints $-\nabla g_j(x)$ and the directions of either growth (or reduction, since $\lambda'$ can be negative) of the equality constraints $\nabla h_k(x)$.

**This means that the function cannot get reduced without reducing the inequality constraints (making the solution infeasible, if already at the bound), or increasing or decreasing the equality constraints (making, thus, the solution again infeasible).**



#### With just one inequality constraint this means that the negative gradients of $f$ and $g$ must point to the same direction.

![alt text](images/KKT_inequality_constraints.svg "KKT with inequality constraint")

#### With equality constraints this means that the negative gradient of the objective function and the gradient of the equality constraint must either point to the same or opposite directions

![alt text](images/KKT_equality_constraints.svg "KKT with inequality constraint")

## Complementarity conditions
Another way of expressing complementarity condition

$$
\mu_jg_j(x) = 0 \text{ for all } j=1,\ldots,J
$$

is to say that both $\mu_j$ and $g_j(x)$ cannot be positive at the same time. Especially, if $\mu_j>0$, then $g_j(x)=0$.

**This means that if we want to use the gradient of a constraint for countering the reduction of the function, then the constraint must be at the boundary.**

## Sequential Quadratic Programming (SQP)

**Idea is to generate a sequence of quadratic optimization problems whose solutions approach the solution of the original problem**

Let us consider problem

$$
\min f(x)\\
\text{s.t. }h_k(x) = 0\text{ for all }k=1,\ldots,K,
$$

where the the objective function and the equality constraints are twice differentiable. 

Inequality constraints can be handled e.g. by using active set methods. See e.g. https://optimization.mccormick.northwestern.edu/index.php/Sequential_quadratic_programming

**Note that constraints can be nonlinear.**

Because we know that the optimal solution of this problem satisfies the KKT conditions, we know that

$$
\left\{\begin{array}{l}
\nabla_xL(x,\lambda)=\nabla_x f(x) + \lambda\nabla_x h(x) = 0\\
h(x) = 0
\end{array}\right.
$$

Let us assume that we have a current estimation for the solution of the equality constraints $(x^k,\lambda^k)$, then according to the Newton's method for root finding (see e.g., https://en.wikipedia.org/wiki/Newton%27s_method#Generalizations ), we have another solution $(x^k,\lambda^k)^T+(p,v)^T$ of the problem by solving system of equations

$$
\nabla_{x,\lambda} S(x^k,\lambda^k)\left[\begin{align}p^T\\v^T\end{align}\right] = -S(x^k,\lambda^k),
$$

where $S(x^k,\lambda^k)=\left[
\begin{array}{c}
\nabla_{x}L(x^k,\lambda^k)\\
h(x^k)
\end{array}
\right]
$. 


This can be written as

$$
\left[
\begin{array}{cc}
\nabla^2_{xx}L(x^k,\lambda^k)&\nabla_x h(x^k)\\
\nabla_x h(x^k)^T & 0
\end{array}
\right]
\left[\begin{array}{c}p^T\\v^T\end{array}\right] =
\left[
\begin{array}{c}
-\nabla_x L(x^k,\lambda^k)\\
-h(x^k)^T
\end{array}
\right].
$$


However, the above is just the solution of the quadratic problem with equality constraints
$$
\min_p \frac12 p^T\nabla^2_{xx}L(x^k,\lambda^k)p+\nabla_xL(x^k,\lambda^k)^Tp\\
\text{s.t. }h_j(x^k) + \nabla h_j(x^k)^Tp = 0. 
$$

## Intuitive interpretation

We are approximating the Lagrange function quadratically around the current solution and the constraints are approximated linearly. SQP methods are also referred to as *projected Lagrangian methods* 
* compare with projected gradient method from lecture 7

**Another view point to building the approximation**: Assume that we have a current solution candidate $(x^k,\lambda^k)$. Using Taylor's series for the constraints ($d = x^*-x^k$) and including only the first order term we get

$h_i(x^*)=h_i(x^k+d)\approx h_i(x^k) + \nabla h_i(x^k)^Td$

Since, $h_i(x^*)=0$ for all $i$ we have

$\nabla h(x^k)d = -h(x^k)$

For approximating the Lagrangian function, we use up to second order terms and get

$L(x^k+d,\lambda^*) \approx L(x^k,\lambda^*) + d^T\nabla_x L(x^k,\lambda^*) + \frac{1}{2}d^T\nabla_{xx}^2L(x^k,\lambda^*)d$

When combining both approximations, we get a quadratic subproblem

$$
\underset{d}{\min}d^T\nabla_x L(x^k,\lambda^k) + \frac{1}{2}d^T \nabla_{xx}^2L(x^k,\lambda^k)d\\
\text{s.t. } \nabla h(x^k)d = -h(x^k)
$$

It can be shown (under some assumptions) that solutions of the quadratic subproblems approach $x^*$ and Lagrange multipliers approach $\lambda^*$.

Note that we can either use the exact Hessian of the Lagrange function (requires second derivatives) or some approximation of it (compare Newton's method vs. quasi-Newton ideas). 

## Implementation

Define an optimization problem, where
* $f(x) = \|x\|^2 = \sum_{i=1}^n x_i^2$
* $h(x) = \sum_{i=1}^nx_i=n$

What is $x^*$?

In [None]:
def f_constrained(x):
    return sum([i**2 for i in x]),[],[sum(x)-len(x)]
#    return sum([i**2 for i in x]),[],[sum(x)-len(x),x[0]**2+x[1]-2]


In [None]:
print(f_constrained([1,0,1]))
print(f_constrained([1,2,3,4]))

In [None]:
import numpy as np
import ad


#if k=0, returns the gradient of lagrangian, if k=1, returns the hessian
def diff_L(f,x,l,k):
    #Define the lagrangian for given f and Lagrangian multiplier vector l
    L = lambda x_: f(x_)[0] + (np.matrix(f(x_)[2])*np.matrix(l).transpose())[0,0]
    return ad.gh(L)[k](x)

#Returns the gradients of the equality constraints
def grad_h(f,x):
    return  [ad.gh(lambda y:
                   f(y)[2][i])[0](x) for i in range(len(f(x)[2]))] 

#Solves the quadratic problem inside the SQP method
def solve_QP(f,x,l):
    left_side_first_row = np.concatenate((\
    np.matrix(diff_L(f,x,l,1)),\
    np.matrix(grad_h(f,x)).transpose()),axis=1)
    left_side_second_row = np.concatenate((\
    np.matrix(grad_h(f,x)),\
    np.matrix(np.zeros((len(f(x)[2]),len(f(x)[2]))))),axis=1)
    right_hand_side = np.concatenate((\
    -1*np.matrix(diff_L(f,x,l,0)).transpose(),
    -np.matrix(f(x)[2]).transpose()),axis = 0)
    left_hand_side = np.concatenate((\
                                    left_side_first_row,\
                                    left_side_second_row),axis = 0)
    temp = np.linalg.solve(left_hand_side,right_hand_side)
    return temp[:len(x)],temp[len(x):] # update for both x and l
    
    

def SQP(f,start,precision):
    x = start
    l = np.ones(len(f(x)[2])) # initialize Lagrange multiplier vector l as a vector of 1s
    f_old = float('inf')
    f_new = f(x)[0]
    while abs(f_old-f_new)>precision:
        print(x)
        f_old = f_new
        (p,v) = solve_QP(f,x,l) # obtain updates for x and l by solving the quadratic subproblem
        x = x+np.array(p.transpose())[0] # update the solution x 
        l = l+v # update the Lagrange multipliers l
        f_new = f(x)[0]
    return x

In [None]:
x0 = [2000.0,1000.0,3000.0,5000.0,6000.0]
SQP(f_constrained,x0,0.0001)