In [None]:
import numpy as np

### 1. Optimisation Background

Let us have a quick introduction to optimization. Assume $f: \mathbb{R}^n \mapsto \mathbb{R}$ is a convex function. A convex optimization problem is 

\begin{align}\tag{convex opt}
\mathrm{minimize} \  f(x) \quad \mathrm{subject\;to} \  x \in \mathcal{F} \subset \mathbb{R^n}.
\end{align}

A solution $x^\star$ is called the *global minimizer* if for every $x \in \mathcal{F}$ we have
$$ f(x^\star) \leq f(x). $$

There are well-known results that say such a global minimizer exists when $\mathcal{F}$ is a closed and bounded set. Closed means, roughly, this set includes its limit points (*e.g.*, $(0,1)$ is open but $[0,1]$ is closed), and bounded means this set is not something like $[0, \infty)$ (intuitively, pick a direction in the set, if this set is bounded then you cannot go in this direction forever without leaving the set).

Now assume $\mathcal{F} = \mathbb{R}^n$, which means we are solving an *unconstrained optimization problem*. Calculus tells us that, if $f$ is convex, and if the gradient of $f$ (which is denoted by $\nabla f$) exists and is continuous, then a point $x \in \mathbb{R}^n$ is the global minimizer if and only if $\nabla f(x) = 0$. However, in general, finding a point that satisfy this *first order condition* ($\nabla f(x) = 0$) is not immediate. To this end, there are several algorithms proposed to iteratively update a candidate solution until this optimality condition is met,.

The most used algorithm is named the *gradient descent method*. The algorithm first fixes the iteration number $k=0$ and a starting point $x_0 \in \mathbb{R}^n$. Then, the next candidate solution $x_1$ is constructed as $x_{1} = x_0 - \alpha_0 \cdot\nabla f(x_0) $. Here, $\alpha_0 > 0$ is a constant named the *step size*. We can see that $x_1$ is constructed by taking the previous iteration's solution, $x_0$, and going in the $- \nabla f(x_0)$ direction by a step size of $\alpha_0$. The algorithm keeps iterating for $k= 1,2,\ldots$ by the same rule:
$$x_{k+1} = x_{k}-  \alpha_k \cdot \nabla f(x_{k-1}), \tag{gradient descent}$$
and stops when $\nabla{f}(x_k) = 0$.

**Exercise:**
Solve the following problem analytically.

\begin{align}
\mathrm{minimize} \ f(x) = (x_1 - 2)^2 + (3 \cdot x_2 - 4)^2 \quad \mathrm{subject\;to} \ x \in \mathbb{R}^2.
\end{align}

Then, solve via gradient descent and report the number of iteration it takes for the algorithm to converge. For the algorithm take the starting point $x_0 = (0, \ 0)$ and fix $\alpha_k = 0.01$ for all iterations. Moreover, for a stopping condition, take when the 2-norm of the gradient, $||\nabla f(x_k)||_2$, is upper bounded by $10^{-6}$.

**Answer:**

Analytic: we can derive that $\nabla f(x) = \begin{bmatrix} 2\cdot x_1 - 4 \\ 18\cdot x_2 - 24  \end{bmatrix}$ and when we set this to 0 we will have $x^\star = ( 2, \ 4/3)$.

The gradient descent algorithm is implemented below

In [None]:
def fn(x1, x2):
    return (x1 - 2)**2 + (3*x2 - 4)**2
def grad(x1, x2):
    return np.array([2*x1 - 4, 18*x2 - 24])

In [None]:
x0 = np.zeros(2)
alpha, iteration, condition = 0.01, 0, 0
xk = x0
while condition == 0: #while gradient is not zero
    iteration = iteration + 1
    if iteration % 50 == 0:
        print(xk)
    xk1 = xk - alpha*grad(xk[0], xk[1])
    if np.linalg.norm(grad(xk1[0], xk1[1])) <= 10**-6:
        condition = 1
    xk = xk1

[1.25679657 1.33325357]
[1.72934785 1.33333333]
[1.90143669 1.33333333]
[1.96410623 1.33333333]
[1.98692858 1.33333333]
[1.99523978 1.33333333]
[1.99826647 1.33333333]
[1.9993687  1.33333333]
[1.9997701  1.33333333]
[1.99991628 1.33333333]
[1.99996951 1.33333333]
[1.9999889  1.33333333]
[1.99999596 1.33333333]
[1.99999853 1.33333333]
[1.99999946 1.33333333]


In [None]:
print("Optimal objective value of", round(fn(xk[0], xk[1]),4),\
      "with the solution", np.round(xk,4), "in",  iteration, "iterations.")

Optimal objective value of 0.0 with the solution [2.     1.3333] in 753 iterations.


### 2. Optimization in Neural Networks

Recall that in neural networks the optimization variables are the weights of the network. You may ask:
1. What is the optimization function we are interested in? Is it convex?
2. How can we use gradient descent for neural networks?
3. How do we compute the gradients in a complicated network?
4. How do we initialize weights in a neural network?

It turns out that, although the standard loss functions are convex in their inputs, they are not convex in the optimization variables in the concept of neural networks. This is due to the compositions we apply in neural networks (recall the previous notebook -- namely, the optimization variables are transformed by several compositions before evaluating the decision). 

For neural networks, using gradient descent is perfectly fine, however, we will see a variant of it named *stochastic gradient descent method* which will improve the speed and performance of the algorithm.

Finally, to compute the gradient of the loss function with respect to the weights, we will learn a concept named *backpropagation*. We will concentrate on these topics more now.

#### Loss functions

The loss functions used in neural networks are the same with the ones we used in previous weeks. For example, if we are interested in regression with neural networks, then we may be interested in the loss $L = ||y - \hat{y}||^2_2$ where $y$ is the vector of the true target values, and $\hat{y}$ is our estimation that we learned from the predictors. So, if we have $i = 1,\ldots, n$ training instances, and let $L_i := (y_i - \hat{y}_i)^2$, then our loss function can be written as:
$$ L := L_1 + L_2 + \ldots + L_n = (y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \ldots +(y_n - \hat{y}_n)^2.$$
This function is convex in $\hat{y_i}$ for all $i = 1,\ldots,n$, but **this does not mean the optimization problem is convex**. The issue is that, in neural networks we cannot directly optimize $\hat{y_i}$. We need to learn a function that uses the predictors of the input and gives us an estimation $\hat{y_i}$ by optimizing some network weights, and the loss function is typically **not** convex in these weights. Let us work on an example below.

**Question**
Find the prediction $\hat{y}$ for the input $x= ( x_1 = 2, \ x_2 = -3)$ of the following neural network with a single hidden layer.
<img src="forward.jpeg" alt="Drawing" style="width: 300px;"/>

**Answer**
We first compute the outputs of the neurons $r_1$ and $r_2$ on the hidden layer, and then proceed to the output $s$.

- The input to $r_1$ is
$(-1,3,-0.1)\cdot (1,2,-3)
= -1\cdot 1+3\cdot 2+(-0.1)\cdot(-3)
= 5.3$,
so its output is $\max(0,5.3)=5.3$.
- The input to $r_2$ is
$(0.2,-1,0.5)\cdot(1,2,-3) = -3.3$,
so its output is $\max(0,-3.3)=0$.
- The input to $s$ is
$(-0.2,0.4)\cdot(5.3,0) = -1.06$.
Its output, applying the sigmoid function $e^{-1.06}/(1+e^{-1.06})$, is $0.2573$.

If this is a binary classification setting this means that the neural network returns probability
$0.2573$ for class 1 and $0.7427$ for class 0.


**Question** Use `torch` to answer the question above by using the scripts we derived in the previous notebook.

**Answer** Omitted for space purposes.

#### Convexity
You may have realized that, in the above example, the value we returned, $\hat{y}$ is the result of a complicated function $\hat{f}(x;W)$, where $W$ is the collection of weights, and $x$ is the predictors of the input. In the above example, our predition function looks like the following:
$$ \hat{f}(x;W) = s\left[ -0.2 \cdot r( 3 \cdot x_1 - 0.1 \cdot x_2 - 1)  + 0.4 \cdot r(-x_1 + 0.5\cdot x_2 + 0.2) \right]$$
where $s[z]:= e^z / (1 + e^z)$ is the sigmoid function and $r(z) = \max\{0,z\}$ is the ReLU function. If we put these functions in the definition explicitly, then the function will look even more complicated. And, remember that this is just a very small setting with a single hidden layer and two-dimensional input, where in reality we have many hidden layers, many nodes, several activation functions, and high-dimensional inputs. This function is **not** convex in the "weights", so if we change the weights (that are written in grey color) of this network (*e.g.*, $-3, -1, -0.1, 0.5, \ldots$), and keep them as optimization variables, then the loss function $||\hat{f}(x;W) -  y||$ will not be convex in the elements of $W$ anymore. This will leadthe gradient descent to give a result that is not *globally optimal*, rather we would hope to have a "good enough" solution.

#### Gradient Descent
Although we discussed the gradient descent method will not give the globally optimum solution, we are still interested in finding a good set of weights to the above network. In general, if we keep the weights as variables, we can represent the network as the following:
<img src="forward_weights.jpeg" alt="Drawing" style="width: 300px;"/>
Now our goal is to optimize the weights $w_{ji}$ by using a collection of training instances, $(x_1, x_2, y) \in X_{Tr}$.

We will need two things:
1. *Initialization of the weights:* It is a common practice to randomly assign weights. As the randomness may result in a poor solution, in general we would be interested in trying several starting weights and starting an optimization procedure in each setting.
2. *Computing of the gradients:* How do we, for example, compute $\dfrac{\partial (y - \hat{f}(x;W))^2}{\partial{w_{21}}}$ where $x = (x_1, \ x_2)$ and $y$ give a single training instance? For this, we will use the chain rule from calculus, that roughly states:
$$\dfrac{\partial f}{\partial x} = \dfrac{\partial g}{\partial h} \cdot \dfrac{\partial h}{\partial x}.$$
Using this iteratively is called the "backpropagation" step to compute the gradients.

In the next notebook, we will see how backpropagation works in practice to optimize weights of a network. Furthermore, as we minimize the loss over a training set rather than for a single point, our loss will look like $$ \sum_{(x,y) \in X_{Tr}}(y - \hat{f}(x;W))^2 $$
and since it will be costly to consider every instance $(x,y)\in X_{Tr}$, in each step of gradient descent, we will instead consider a random selection of them. This algorithm is named *stochastic gradient descent* and is the rule-of-thumb in optimization for neural networks.