# **Lab 6: Optimization and learning**
**Matteus Berg**

# **Abstract**

In this lab we implement the gradient descent method for the unconstrained minimization problem. The method is verified for its accuracy and convergence by testing. The tests show that the implementation correctly approximates critical points when the tolerance is below a certain threshold.

#**About the code**

A short statement on who is the author of the file, and if the code is distributed under a certain license.

In [None]:
"""This program is a template for lab reports in the course"""
"""DD2363 Methods in Scientific Computing, """
"""KTH Royal Institute of Technology, Stockholm, Sweden."""

# Copyright (C) 2020 Johan Hoffman (jhoffman@kth.se)

# This file is part of the course DD2365 Advanced Computation in Fluid Mechanics
# KTH Royal Institute of Technology, Stockholm, Sweden
#
# This is free software: you can redistribute it and/or modify
# it under the terms of the GNU Lesser General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.

# This template is maintained by Johan Hoffman
# Please report problems to jhoffman@kth.se

'KTH Royal Institute of Technology, Stockholm, Sweden.'

# **Set up environment**

To have access to the neccessary modules you have to run this cell. If you need additional modules, this is where you add them.

In [None]:
# Load neccessary modules.
from google.colab import files

import time
import numpy as np
import math

#try:
#    from dolfin import *; from mshr import *
#except ImportError as e:
#    !apt-get install -y -qq software-properties-common
#    !add-apt-repository -y ppa:fenics-packages/fenics
#    !apt-get update -qq
#    !apt install -y --no-install-recommends fenics
#    from dolfin import *; from mshr import *

#import dolfin.common.plotting as fenicsplot

from matplotlib import pyplot as plt
from matplotlib import tri
from matplotlib import axes
from mpl_toolkits.mplot3d import Axes3D

# **Introduction**

This lab report aims to implement the gradient descent method to solve the unconstrained minimization problem. The minimization problem consists of finding a critical minimum point for a function $f$, known as an objective function. If the problem is unconstrained, it means that the search space is $R^n$ and not a subamount of $R^n$. Gradient descent is an iterative method. The iteration equation is described on page 327 of the book as

$x^{k+1} = x^{(k)} - \alpha^{(k)}\nabla f(x^{(k)})$

This report implements the gradient descent method pseudocode that can be found on page 327 of the book. The method is implemented to support a function $f : D \rightarrow R$. Where $D \subseteq R^N$. That is, the function may have a vector as input, but its output is always scalar. Gradient descent should stop iterating at a point $x^*$ that produces a gradient whose magnitude is smaller than the tolerance. That is, $|| \nabla f(x^*) || < TOL$.

$\alpha$ should be defined so that the right step length is taken. Taking too small steps will lead to a performance decrease. However, taking too large steps will lead to the method overshooting the critical point. Calculating the correct step length is a difficult problem, and in general the calculations need to be adjusted depending on what the objective function looks like. For this implementation, $\alpha$ is calculated using an iterative line search method. The line search method iterates until it gets an interval smaller than the tolerance. Number of iterations for line search decide the size of $\alpha$ for that gradient descent step.


# **Method**


### Gradient calculation
 The gradient $\nabla f(x)$ for each step is calculated by the compute_gradient function. We have a point $p_0$ that we want to calculate the gradient for. First we determine a step size $step$ for the gradient, this is equal to $TOL * 0.1$. For a point $x_k$ we have $n$ dimensions. For each dimension $d$, we compute three points $p_1$, $p_2$, $p_3$: $p_1(x_1, ..., x_d - step, ..., x_n)$, $p_2(x_1, ..., x_d, ..., x_n)$, $p_3(x_1, ..., x_d + step, ..., x_n)$. Observe that $p_2$ will always be equal to $p_0$ Using these three points we calculate the gradient for $d$ in the point $p0$ using the central difference method. The separate gradients are then merged together to get the complete $n$-dimensional gradient array $\nabla f(x)$.


### $\alpha$-calculation
$\alpha$ is calculated in the line_search method. We want to search for the critical point $p_c$ from the starting point $p_0$ In each iteration, three points are present:
1. $s_1$ the lower bound of the interval
2. $s_2$ the upper bound of the interval
3. $s_3$ the midpoint between $s_1$ and $s_2$

"the interval" refers to the search interval for the method. It is an $n$-dimensional array, each element representing the interval length for one of the variables that $f$ takes as input. Before we begin the first iteration, we need to get a starting search interval which captures the critical point. This is done by setting each interval element to its respective gradient value in $p_0$: $interval = \nabla f(p_0)$. Now the iterations can begin. For each iteration:
1. For every dimension $i$:
    1. if gradient is negative
        1. set $s_2[i]$ to $s_1[i] + interval[i]$
    2. if gradient is positive or zero
        1. set $s_2[i]$ to $s_1[i]$
        2. set $s_1[i]$ to $s_1[i] - interval[i]$
2. calculate $s_3$
3. Out of the three points, pick the two points with lowest function values and set them to $s_1$ and $s_2$:
```python
if f(s1) > f(s3):
      s1 = s3
    if (not np.array_equal(s1,s3)) and (f(s2) > f(s3)):
      s2 = s3
```

4. update gradient to $\nabla f(s_1)$
5. update interval to $abs(s2 - s1)$
6. increment iterations variable

The algorithm stops iterating when the norm of the gradient is lower than the tolerance. With this algorithm we are guaranteed to half the search space each iteration. So the algorithm has complexity $O(log_2(n))$. The higher the iteration value, the higher the $\alpha$ value returned. So in short, the further away $p_0$ is from the critical point, the higher the $\alpha$ value will be.

In [None]:
# tolerance
TOL = 0.001

# implemented pseudocode from page 327 of the book
def gradient_descent(f, x0):
  x = x0
  Df = compute_gradient(f, x)
  while np.linalg.norm(Df) > TOL:
    Df = compute_gradient(f, x)
    alpha = line_search(f, x)
    #print("x = x - alpha*Df :", x - alpha*Df, "=", x, "-", alpha, "*", Df)
    x = x - alpha*Df
  return x

# line interval search
# this algorithm only converges if it manages to
# capture a critical point in its starting interval
def line_search(f, x):
  n = x.size
  interval = np.zeros(n)
  numIterations = 1
  s1 = np.zeros(n)
  s2 = np.zeros(n)
  s3 = np.zeros(n)
  grad = compute_gradient(f, x)
  interval = grad

  for i in range(n):
    if(grad[i] < 0):
      s1[i] = x[i]
      s2[i] = x[i] + interval[i]
    else:
      s2[i] = x[i]
      s1[i] = x[i] - interval[i]

  while np.linalg.norm(s2 - s1) > TOL:
    numIterations = numIterations + 1
    grad = compute_gradient(f, s1)
    interval = abs(s2 - s1)
    for i in range(n):
      if(grad[i] < 0):
        s1[i] = s1[i]
        s2[i] = s1[i] + interval[i]
      else:
        s2[i] = s1[i]
        s1[i] = s1[i] - interval[i]
    s3 = (s1+s2)/2

    if f(s1) > f(s3):
      s1 = s3
    if (not np.array_equal(s1,s3)) and (f(s2) > f(s3)):
      s2 = s3

  # even if no iterations of while loop is made, numIterations will be 1
  # which means we won't return zero
  return numIterations*TOL*10


def compute_gradient(f, x0):
  alpha = TOL*0.1
  n = x0.size
  # n trios of points ; one out of three points
  f_values = np.zeros((n, 3))
  # (n trios of points ; one out of three points ; one coordinate in one point)
  x = np.zeros((n, 3, n))
  # n gradient values
  grad = np.zeros(n)
  diffx = np.array([x0-alpha, x0, x0+alpha])
  # generate x-values
  for i in range(n):
    for j in range(3):
      for k in range(n):
        if i == k:
          x[i][j][k] = diffx[j][i]
        else:
          x[i][j][k] = x0[k]

  #print(x)
  # generate function values
  for i in range(n):
    for j in range(3):
      f_values[i][j] = f(x[i][j])

  # calculate gradient with central difference
  for i in range(n):
    # creates approximated gradient for three points. We are only interessted
    # in the middle point
    gradients = np.gradient(f_values[i], alpha)
    # give middle point (xi) as gradient value to grad
    grad[i] = gradients[1]

  # return gradient
  return grad



# **Results**

In [None]:
x1 = np.array([15])
f1 = lambda x : x*x
x2 = np.array([0, 0])
f2 = lambda x : x[0]*x[0] + x[1]*x[1] + 2*x[0] + 4*x[1] + 5

TOL = 0.01
print("Testing gradient descent for f in R1. Tolerance=0.01")
print("x0 = 15 ; f(x) = x^2")
print("Expected result:", 0)
print("Actual result:", gradient_descent(f1, x1))
TOL = 0.001
print(" ")
print("Testing gradient descent for f in R1. Tolerance=0.001")
print("x0 = 15 ; f(x) = x^2")
print("Expected result:", 0)
print("Actual result:", gradient_descent(f1, x1))
TOL = 0.0001
print(" ")
print("Testing gradient descent for f in R1. Tolerance=0.0001")
print("x0 = 15 ; f(x) = x^2")
print("Expected result:", 0)
print("Actual result:", gradient_descent(f1, x1))
TOL = 0.001
print(" ")
print("Actual result:", gradient_descent(f1, x1))
print("Testing gradient descent for f in R2. Tolerance=0.001")
print("x0 = [0, 0] ; f(x,y) = x^2 + y^2 + 2x + 4y + 5")
print("Expected result: [", -1, ", ", -2, "]")
print("Actual result:", gradient_descent(f2, x2))


Testing gradient descent for f in R1. Tolerance=0.01
x0 = 15 ; f(x) = x^2
Expected result: 0
Actual result: [-2.59929314e+14]
 
Testing gradient descent for f in R1. Tolerance=0.001
x0 = 15 ; f(x) = x^2
Expected result: 0
Actual result: [0.00048515]
 
Testing gradient descent for f in R1. Tolerance=0.0001
x0 = 15 ; f(x) = x^2
Expected result: 0
Actual result: [4.98484388e-05]
 
Actual result: [0.00048515]
Testing gradient descent for f in R2. Tolerance=0.001
x0 = [0, 0] ; f(x,y) = x^2 + y^2 + 2x + 4y + 5
Expected result: [ -1 ,  -2 ]
Actual result: [-0.99978609 -1.99957217]


Running the results code, we see that the algorithm outputs correct estimations with the chosen tolerance level. It does this both for a function domain of one and several variables. When the tolerance is raised to $10^{-2}$, the method diverges.

# **Discussion**

The algorithm behaved as expected and converges for both $D = R^1$ and $D = R^2$ when $TOL \leq 10^{-3}$. However, when the tolerance is raised to $10^{-2}$ the method diverges. This is most probably due to the line_search method calculating too high $\alpha$ values when $TOL$ is that high. The line_search method would in that case have to be modified and not depend as much on the value of $TOL$ when calculating $\alpha$.