# Optimization

Before we get started, professor Shi. strongly recommends that we should read the following reference books:
  *非线性最优化基础 Fukushimu 科学出版社*

## 1. Gradient Descent Method

Gradient descent is a iterative optimization algorithm for finding the minimum of a function. It's very effective to solving the convex optimization problems. The defination of convex function $f(\bm{w})$ is as follows:
$$
f(t\bm{w}_1 + (1-t)\bm{w}_2) \leq tf(\bm{w}_1) + (1-t)f(\bm{w}_2), \quad \forall \bm{w}_1, \bm{w}_2 \in \mathbb{R}^n,\forall t\in [0, 1] \tag{1.1}
$$
In order to implement the gradient descent method, we follow the following steps:
- Step 1. Initialize the parameters $\bm{w}_0\in W$ and set the learning rate $\eta_{t}$.
- Step 2. Gradient Descent: $\bm{w}_{t+1}' = \bm{w}_t - \eta_{t}\nabla f(\bm{w}_t)$
- Step 3. Projection $\bm{w}_{t+1}=P_{w}\bm{w}_{t+1}'$
- Repeat step 2. and step 3. for T steps
- Step 4. Return $\bar{\bm{w}}_{T}=\frac{1}{T} \sum_{t=1}^{T}\bm{w}_{t}$<br>
  
Step 3. is to guarantee that the parameters $\bm{w}$ is in the feasible set $W$. $P_{W}(\bm{z})= \argmin_{\bm{x}\in W}||\bm{x}-\bm{z}||$.
If we take a constant learning rate $\eta_{t}=\eta$, and the function $f(\bm{w})$ is l-Lipschitz continuous and is bounded, then we have the following theorem:
>**Theorem 1.1**. $f(\bm{w})-\min_{\bm{w}\in W}f(\bm{w})\leq O(\frac{1}{\sqrt{T}})$, where $\eta=\Gamma/(l/ \sqrt{T})$ and $\Gamma$ is the Diameter of $W$ and l is the upper bound of the length of gradient $||\triangledown f||\leq l$ .

## 2. Conjugate Gradient Method
Sometimes linear equations $Ax=b$ can be transform into the form of solving a minimum problem $x=\min_{x}f,f=\frac{1}{2}x^{T}Ax-b^{T}x$ <br>
It's easy to know that the gradient of $f$ is $\triangledown f=Ax-b$ and the Hessian matrix is $A$.<br> Here we suppose that $A$ is a symmetric positive definite matrix. Then we can use the conjugate gradient method to solve the linear equations $Ax=b$. <br>
Before we show the process, we first introduce the following definition:

>**definition2.1** The vectors $p_{0}, p_{1}, \cdots, p_{k}$ are called conjugate with respect to the symmetric positive definite matrix $A$, or A-conjugate. if $p_{i}^{T}Ap_{j}=<p_{i},Ap_{j}>\equiv<p_{i},p_{j}>_{A}=0$ for $i\neq j$.

In order to implement the conjugate gradient method, we follow the following steps:
- Step 1. Initialize the $x_{0}$ and the residual $r_{0}=b-Ax_{0}$, $p_{0}=r_{0}$, $k=0$.
- Step 2. Compute the step size $\alpha_{k}=\frac{< r_{k},p_{k}>}{<p_{k},p_{k}>_{A}}$. Update the variables as follows:
  $$ x_{k+1}=x_{k}+\alpha_{k}p_{k} $$   
  $$ r_{k+1}=r_{k}-\alpha_{k}Ap_{k}$$  
  $$ \beta_{k}=\frac{<r_{k+1},p_{k}>_{A}}{<p_{k},p_{k}>_{A}} $$
  $$ p_{k+1}=r_{k+1}-\beta_{k}p_{k} $$
  $$ k \rightarrow k+1 $$
- Step 3. Repeat step 2. until $||r_{k}||<\epsilon$ .
The details can be found in the `CGM.solve()` function in the code.

#### some explainations

If we can find a set of A-conjugate vectors $p_{1},\dots,p_{n}$, then we can express the solution $x$ as a linear combination of $p_{1},\dots,p_{n}$, i.e., $$x=\sum_{i=1}^{n}\alpha_{i}p_{i}$$.<br>
Then we can get the following equation by substituting the equation into $Ax=b$:
$$
\sum_{i=1}^{n}\alpha_{i}Ap_{i}=b
$$
And we can get the coefficient $\alpha_{j}$ by ordinary inner product with $p_{j}$ on both sides of the equation:
$$
\alpha_{j}=\frac{<b,p_{j}>}{<p_{j},p_{j}>_{A}}
$$

The question is, how do we find the A-conjugate vectors $p_{1},\dots,p_{n}$? <br>
We can find the answer by extending the **Gram-Schmidt method**. Suppose we have found $p_{1},\dots,p_{k}$, and  $y_{k}$ which is not a linear combination of these vectors(guaranting $p_{k+1}\neq 0$), then we can find $p_{k+1}$ by the following equation:
$$
p_{k+1}=y_{k}-\sum_{i=1}^{k}\frac{<y_{k},p_{i}>_{A}}{<p_{i},p_{i}>_{A}}p_{i}
$$


>**Theorem 2.1**. $r_{l}\perp span\{r_{0}, \dots,r_{l-1}\}$,$p_{l}\perp^{A}\{p_{0},p_{1},\dots,p_{l-1}\}$ and $span\{p_{0},\dots,p_{l-1}\}=span\{ r_{0},\dots,r_{l-1}\}=span\{ r_{0},Ar_{0},A^{l-1}r_{0}\}$

proof: Mathematical Induction:<br>
1. base case: $l=1$  Obviously $span\{ p_{0}\}=span\{r_{0}\}$ because $p_{0}=r_{0}$<br>
   $r_{1}\perp span\{r_{0}\} \Leftrightarrow <b-Ax_{1},>$ 

In [5]:
import numpy as np

In [60]:
class CGM():
    def __init__(self,A,b):
        self.A = A
        self.b = b
        self.x = np.zeros_like(b)
    def solve(self,A=None,b=None):
        if A is not None:
            self.A = A
        if b is not None:
            self.b = b
        r = self.b - np.dot(self.A,self.x)
        p = r
        steps = 0
        while np.linalg.norm(r) > 1e-10:
            Ap = np.dot(self.A,p)
            alpha = np.dot(r,p)/np.dot(p,Ap)
            #alpha = np.dot(self.b,p)/np.dot(p,Ap) way2 to calculate alpha
            #alpha = np.dot(r,r)/np.dot(p,Ap) way3 to calculate alpha
            print(f'dot(r,p) = {np.dot(r,p)}')
            print(f'dot(r,r) = {np.dot(r,r)}')
            print(f'dot(b,p) = {np.dot(b,p)}')
            self.x = self.x + alpha*p
            r = r - alpha*Ap
            beta = np.dot(r,r)/np.dot(self.b,self.b)
            p = r + beta*p
            steps+=1
        self.steps = steps
        return self.x

In [61]:
A = np.array([[2,1],
              [1,2]])#A is symmetric and positive definite
b = np.array([3,4])
cgm=CGM(A,b) 

In [63]:
cgm.solve()

array([0.66666667, 1.66666667])

In [64]:
np.linalg.eig(A)

(array([3., 1.]),
 array([[ 0.70710678, -0.70710678],
        [ 0.70710678,  0.70710678]]))

In [41]:
cgm.steps

AttributeError: 'CGM' object has no attribute 'steps'

In [32]:
cgm.A=np.array([[1,1],[2,1]])