# **Lab 7: Optimization and learning**
**Dániel Szabó**

# **Abstract**

This laboratory is about approximating the solutions of minimization problems. In a minimization problem, we are given a function $f: D\to\mathbb{R}$, where $D\subseteq\mathbb{R}^n$ is the search space. The goal is to find a point $x^*\in D$ that satisfies $f(x^*)\le f(x)$ for all $x\in D$. In this report, two methods are presented for approximating such a solution: the fist one is gradient descent, and the other one is called Newton's method.

#**About the code**

In [48]:
"""DD2363 Methods in Scientific Computing, """
"""KTH Royal Institute of Technology, Stockholm, Sweden."""

# Copyright (C) 2021 Dániel Szabó (dszabo@kth.se)

# This file is part of the course DD2365 Methods in Scientific Computing
# KTH Royal Institute of Technology, Stockholm, Sweden
#
# This is free software: you can redistribute it and/or modify
# it under the terms of the GNU Lesser General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.

# This template is maintained by Johan Hoffman
# Please report problems to jhoffman@kth.se

'KTH Royal Institute of Technology, Stockholm, Sweden.'

# **Set up environment**

In [49]:
# Load neccessary modules.
import numpy as np
import random

random.seed(0)

# **Introduction**

The first task is to implement the gradient descent method, and the second one is to implement Newton's method.

The gradient descent (Algorithm 15.1. in the lecture notes) is an iterative method for finding a local minimum of an input funcion $f$. In each iteration, its next guess is in the opposite direction of the gradient of $f$ evaluated at the current point.
$$x^{(k+1)}=x^{(k)}-\alpha\nabla f(x^{(k)})$$
where $\alpha$ is the step size that may depend on $f, \nabla f(x^{(k)})$ and $x^{(k)}$. Note that an initial guess $x^{(0)}$ is needed. The iterations stop when the norm of the gradient is small enough.

Newton's method (Algorithm 15.3. in the lecture notes) is also an iterative method, but it uses not only the first order derivatives (i.e. the gradient) of $f$, but also the second order derivatives that are the elements of the Hessian matrix $Hf$. The iterations are as follows.
$$x^{(k+1)}=x^{(k)}-(Hf(x^{(k)}))^{-1}\nabla f(x^{(k)})$$
The iterations stop when the norm of the gradient is small enough.

# **Method**

Method "grad" calculates the approximate gradient $\nabla f$ of the given function $f:\mathbb{R}^n\to \mathbb{R}$ at the given point $x\in\mathbb{R}^n$. It does so using the definition of the derivative: for a small positive real number $h$, the $i$'th element of the gradient (for all $i\in[n]$) is approximated by $\frac{f(x+he_i)-f(x)}{h}$, where $e_i$ is the $i$'th standard basis vector of $\mathbb{R}^n$.

Method "hessian" calculates the approximate Hessian of the given function $f:\mathbb{R}^n\to \mathbb{R}$ at the given point $x\in\mathbb{R}^n$. It does so using the definition of the derivative: for a small positive real number $h$, the $i$'th row of the Hessian (for all $i\in[n]$) is approximated by vector $\frac{\nabla f(x+he_i)-\nabla f(x)}{h}$, where $e_i$ is the $i$'th standard basis vector of $\mathbb{R}^n$.

In [50]:
def grad(f, x, h = 10**-8):
    if not np.isscalar(f(x)):
        raise Exception("The function should return a scalar value.")

    n = np.shape(x)[0]
    ret = np.zeros(n)
    for i in range(n):
        x1 = list(x)
        x1[i] += h
        ret[i] = (f(x1)-f(x))/h
    return ret

def hessian(f, x, h = 10**-4):
    if not np.isscalar(f(x)):
        raise Exception("The function should return a scalar value.")

    n = np.shape(x)[0]
    ret = np.zeros((n,n))
    for i in range(n):
        x1 = list(x)
        x1[i] += h
        ret[i,:] = (grad(f,x1)-grad(f,x))/h
    return ret

Method "get_alpha" calculates the step length for the gradient descent method. It is a simple algorithm that receives $f$, $\nabla f(x)$ and $x$ as input; checks some (by default 100) potential $\alpha$ values in interval $[0,1]$, and returns the best, i.e. the one minimizing $f(x-\alpha\nabla f(x))$.

Method "gradient_descent" works as described in the Introduction section and in Algoritm 15.1. of the lecture notes. The gradient is calculated by our method "grad". In some cases, because of finite precision, we may reach a state where $\alpha$ is said to be 0 but the gradient is not small enough to stop the iterations. In this case, the algorithm would end up in an infinite loop. In order to avoid this, in each iteration it is checked if $\alpha$ is zero, and if it is, the current result is returned together with printing a message.

In [51]:
def get_alpha(f, gradf, x, prec = 10**-2):
    if not np.isscalar(f(x)):
        raise Exception("The function should return a scalar value.")
    if np.shape(x)!=np.shape(gradf):
        raise Exception("The gradient and x should have the same shape.")

    values = np.arange(0, 1, prec)
    min_alpha = 0
    for alpha in values:
        fx = f(x-alpha*gradf)
        if(fx < f(x-min_alpha*gradf)):
            min_alpha = alpha
    return min_alpha

def gradient_descent(f, x0, eps = 10**-7):
    if not np.isscalar(f(x0)):
        raise Exception("The function should return a scalar value.")

    x = x0
    gradf = grad(f,x)
    while np.linalg.norm(gradf) > eps:
        gradf = grad(f,x)
        alpha = get_alpha(f, gradf, x)
        if alpha==0:
            print("The gradient is not small enough, but alpha is 0")
            return x
        x = x - alpha*gradf
    return x

Method "newton" works as described in the Introduction section and in Algoritm 15.3. of the lecture notes. The gradient is calculated by our method "grad" and the Hessian by method "hessian". In some cases, because of finite precision, we may reach a state where the Hessian is said to be singular (thus, its inverse cannot be calculated) but the gradient is not small enough to stop the iterations. In this case, the algorithm would end up in an infinite loop. In order to avoid this, in each iteration it is checked if $Hf(x^{(k)})$ is full-rank, and if it is not, the current result is returned together with printing a message.

In [52]:
def newton(f, x0, eps = 10**-5):
    if not np.isscalar(f(x0)):
        raise Exception("The function should return a scalar value.")

    x = x0
    gradf = grad(f,x)
    while np.linalg.norm(gradf) > eps:
        gradf = grad(f,x)
        hessf = hessian(f,x)
        if np.linalg.matrix_rank(hessf)<len(x0):
            print("The gradient is not small enough, but the Hessian is singular")
            return x
        dx = np.linalg.solve(hessf,-gradf)
        x += dx
    return x

# **Results**

For the verification of the tasks we need a function $f:\mathbb{R}^n\to\mathbb{R}$. Method "f" can implement an arbitrary function, now $f(x)=\sin(x_1)/x_1 + (x_2-c)^2$, where $c$ is a random number drawn from the standard normal distribution. The exact minimum point $x^*$ satisfies $\tan(x_1)=x_1$ so $x_1\approx\pm4.4934094579090641753$; and $x_2=c$.

The other example, "g" is from Exercise 15.7 of the lecture notes: $g(x)=2+x_1^4+(1+x_2)^2$. The exact solution is $x_1=0$, $x_2=-1$.

A faulty function "f_fail" is also provided for testing the input validation: it returns a vector, not a scalar value.

The results show that generally the approximations are close to the exact solutions. There is a larger error in the first coordinate of the solution to $g$. The reason for this is that the derivative of function $x^4$ is very small when $x$ is close to 0 (if we have $x_1^2$ instead of $x_1^4$ as the second term of $g(x)$, then the results are much better).

In [53]:
c = random.gauss(0,1)

def f(x):
    return np.sin(x[0])/x[0] + (x[1]-c)**2

def g(x):
    return 2+x[0]**4+(1+x[1])*(1+x[1])

def f_fail(x):
    return [x[0],x[1]]


exactf = [4.4934094579090641753, c]
sol_f1 = gradient_descent(f, [5,1])
sol_f2 = newton(f,[5,1])
np.testing.assert_array_almost_equal(exactf, sol_f1, decimal=6)
np.testing.assert_array_almost_equal(exactf, sol_f2, decimal=6)
# print(exactf)
# print(sol_f1)
# print(sol_f2)

exactg = [0, -1]
sol_g1 = gradient_descent(g, [1,1])
sol_g2 = newton(g,[1,1])
np.testing.assert_array_almost_equal(exactg, sol_g1, decimal=2)
np.testing.assert_array_almost_equal(exactg, sol_g2, decimal=2)
# print(exactg)
# print(sol_g1)
# print(sol_g2)

# Invalid input test: wrong function
np.testing.assert_raises(Exception, gradient_descent, f_fail, [1,1])

# **Discussion**

The methods, implemented for solving the tasks, actually succeeded in solving them, as it is confirmed by the test results.

It is possible to test the methods for arbitrary functions (that return a scalar value), even $\mathbb{R}\to\mathbb{R}$ functions, but in this case the initial value shall be given as an array of length 1.

The initial guess $x_0$ plays an important role. First, if a wrong value is chosen, then the algorithms may find only a local minimum or not find a minimum at all. Second, it tells the algorithms the dimension of the input space: it is the number of elements in $x_0$.

The precisions of both methods depend on the function that is to be minimized. For example, when the derivative is very small, it may be difficult to use the gradient for finding the minimum exactly, as computations with tiny numbers are likely to be inaccurate. This is the reason why the default value for the small constant $h$ in method "hessian" was chosen to be relatively large: when it was smaller, the Hessian of $g$ got even less accurate.

One could use an adaptive method when calculating the best possible value for $\alpha$: first using larger steps, and then refining those intervals, which are likely to contain the best value.