Gradient Descent

Table of contents

Algorithm Description
- Gradient Descent Rule
Function Declarations
- Optimization Control Parameters
Examples
- Sphere Function
- Booth's Function

Algorithm Description

The Gradient Descent (GD) algorithm is used solve to optimization problems of the form

\min_{x \in X} f(x)

where f is convex and (at least) once differentiable.

The updating rule for GD is described below. Let x^{(i)} denote the candidate solution vector at stage i of the algorithm.

Compute the descent direction d^{(i)} using one of the methods described below.

Update the candidate solution vector using:
```
x^{(i+1)} = x^{(i)} - d^{(i)}
```

The algorithm stops when at least one of the following conditions are met:

the norm of the gradient vector, \| \nabla f \|, is less than grad_err_tol;

the relative change between x^{(i+1)} and x^{(i)} is less than rel_sol_change_tol;

the total number of iterations exceeds iter_max.

Gradient Descent Rule

gd_settings.method = 0 Vanilla GD:
```
d^{(i)} = \alpha \times [ \nabla_x f( x^{(i)} ) ]
```
where \alpha, the step size (also known the learning rate), is set by par_step_size.
gd_settings.method = 1 GD with momentum:
```
d^{(i)} = \mu \times d^{(i-1)} + \alpha \times [ \nabla_x f( x^{(i)} ) ]
```
where \mu, the momentum parameter, is set by par_momentum.

gd_settings.method = 2 Nesterov accelerated gradient descent (NAG)

d^{(i)} = \mu \times d^{(i-1)} + \alpha \times \nabla f( x^{(i)} -  \mu \times d^{(i-1)})

gd_settings.method = 3 AdaGrad:

\begin{aligned}
d^{(i)} &= [ \nabla_x f( x^{(i)} ) ] \odot \dfrac{1}{\sqrt{v^{(i)}} + \epsilon} \\
v^{(i)} &= v^{(i-1)} + [ \nabla_x f( x^{(i)} ) ] \odot [ \nabla_x f( x^{(i)} ) ]
\end{aligned}

gd_settings.method = 4 RMSProp:

\begin{aligned}
d^{(i)} &= [ \nabla_x f( x^{(i)} ) ] \odot \dfrac{1}{\sqrt{v^{(i)}} + \epsilon} \\
v^{(i)} &= \rho \times v^{(i-1)} + (1-\rho) \times [ \nabla_x f( x^{(i)} ) ] \odot [ \nabla_x f( x^{(i)} ) ]
\end{aligned}

gd_settings.method = 5 AdaDelta:

\begin{aligned}
d^{(i)} &= [ \nabla_x f( x^{(i)} ) ] \odot \dfrac{\sqrt{m^{(i)}} + \epsilon}{\sqrt{v^{(i)}} + \epsilon} \\
m^{(i)} &= \rho \times m^{(i-1)} + (1-\rho) \times [ d^{(i-1)} ] \odot [ d^{(i-1)} ] \\
v^{(i)} &= \rho \times v^{(i-1)} + (1-\rho) \times [ \nabla_x f( x^{(i)} ) ] \odot [ \nabla_x f( x^{(i)} ) ]
\end{aligned}

gd_settings.method = 6 Adam (adaptive moment estimation) and AdaMax.

\begin{aligned}
m^{(i)} &= \beta_1 \times m^{(i-1)} + (1-\beta_1) \times [ \nabla_x f( x^{(i-1)} ) ] \\
v^{(i)} &= \beta_2 \times v^{(i-1)} + (1-\beta_2) \times [ \nabla_x f( x^{(i)} ) ] \odot [ \nabla_x f( x^{(i)} ) ] \\
& \ \ \ \ \ \ \hat{m} = \dfrac{m^{(i)}}{1 - \beta_1^i}, \ \ \hat{v} = \dfrac{v^{(i)}}{1 - \beta_2^i}
\end{aligned}

where m^{(0)} = \mathbf{0}, and \beta_1 and \beta_2 are set by par_adam_beta_1 and par_adam_beta_2, respectively.

If ada_max = false, then the descent direction is computed as

d^{(i)} = \alpha \times \dfrac{\hat{m}}{\sqrt{\hat{v}} + \epsilon}

If ada_max = true, then the updating rule for v^{(i)} is no longer based on the L_2 norm; instead

v^{(i)} = \max \left\{ \beta_2 \times v^{(i-1)}, | \nabla_x f( x^{(i)} ) | \right\}

The descent direction is computed using

d^{(i)} = \alpha \times \dfrac{\hat{m}}{ v^{(i)} + \epsilon}

gd_settings.method = 7 Nadam (adaptive moment estimation) and NadaMax

\begin{aligned}
m^{(i)} &= \beta_1 \times m^{(i-1)} + (1-\beta_1) \times [ \nabla_x f( x^{(i-1)} ) ] \\
v^{(i)} &= \beta_2 \times v^{(i-1)} + (1-\beta_2) \times [ \nabla_x f( x^{(i)} ) ] \odot [ \nabla_x f( x^{(i)} ) ] \\
& \ \ \hat{m} = \dfrac{m^{(i)}}{1 - \beta_1^i}, \ \ \hat{v} = \dfrac{v^{(i)}}{1 - \beta_2^i}, \ \ \hat{g} = \dfrac{ \nabla_x f(x^{(i)}) }{1 - \beta_1^i}
\end{aligned}

where m^{(0)} = \mathbf{0}, and \beta_1 and \beta_2 are set by par_adam_beta_1 and par_adam_beta_2, respectively.

If ada_max = false, then the descent direction is computed as

d^{(i)} = \alpha \times [ \nabla_x f( x^{(i)} ) ] \odot \dfrac{\beta_1 \hat{m} + (1 - \beta_1) \hat{g} }{\sqrt{\hat{v}} + \epsilon}

If ada_max = true, then the updating rule for v^{(i)} is no longer based on the L_2 norm; instead

v^{(i)} = \max \left\{ \beta_2 \times v^{(i-1)}, | \nabla_x f( x^{(i)} ) | \right\}

The descent direction is computed using

d^{(i)} = \alpha \times [ \nabla_x f( x^{(i)} ) ] \odot \dfrac{\beta_1 \hat{m} + (1 - \beta_1) \hat{g} }{v^{(i)} + \epsilon}

Function Declarations

.. doxygenfunction:: gd(ColVec_t& init_out_vals, std::function<fp_t (const ColVec_t& vals_inp, ColVec_t* grad_out, void* opt_data)> opt_objfn, void* opt_data)
   :project: optimlib

.. doxygenfunction:: gd(ColVec_t& init_out_vals, std::function<fp_t (const ColVec_t& vals_inp, ColVec_t* grad_out, void* opt_data)> opt_objfn, void* opt_data, algo_settings_t& settings)
   :project: optimlib

Optimization Control Parameters

The basic control parameters are:

fp_t grad_err_tol: the error tolerance value controlling how small the L_2 norm of the gradient vector \| \nabla f \| should be before 'convergence' is declared.
fp_t rel_sol_change_tol: the error tolerance value controlling how small the proportional change in the solution vector should be before 'convergence' is declared.

The relative change is computed using:
```
\left\| \dfrac{x^{(i)} - x^{(i-1)}}{ |x^{(i-1)}| + \epsilon } \right\|_1
```
where \epsilon is a small number added for numerical stability.
size_t iter_max: the maximum number of iterations/updates before the algorithm exits.
bool vals_bound: whether the search space of the algorithm is bounded. If true, then
- ColVec_t lower_bounds: defines the lower bounds of the search space.
- ColVec_t upper_bounds: defines the upper bounds of the search space.

In addition to these:

int print_level: Set the level of detail for printing updates on optimization progress.
- Level 0: Nothing (default).
- Level 1: Print the current iteration count and error values.
- Level 2: Level 1 plus the current candidate solution values, x^{(i+1)}.
- Level 3: Level 2 plus the direction vector, d^{(i)}, and the gradient vector, \nabla_x f(x^{(i+1)}).
- Level 4: Level 3 plus information about the chosen gradient descent rule.

Examples

Sphere Function

Code to run this example is given below.

.. toggle-header::
    :header: **Armadillo (Click to show/hide)**

    .. code:: cpp

        #define OPTIM_ENABLE_ARMA_WRAPPERS
        #include "optim.hpp"

        inline
        double
        sphere_fn(const arma::vec& vals_inp, arma::vec* grad_out, void* opt_data)
        {
            double obj_val = arma::dot(vals_inp,vals_inp);

            if (grad_out) {
                *grad_out = 2.0*vals_inp;
            }

            return obj_val;
        }

        int main()
        {
            const int test_dim = 5;

            arma::vec x = arma::ones(test_dim,1); // initial values (1,1,...,1)

            bool success = optim::gd(x, sphere_fn, nullptr);

            if (success) {
                std::cout << "gd: sphere test completed successfully." << "\n";
            } else {
                std::cout << "gd: sphere test completed unsuccessfully." << "\n";
            }

            arma::cout << "gd: solution to sphere test:\n" << x << arma::endl;

            return 0;
        }

.. toggle-header::
    :header: **Eigen (Click to show/hide)**

    .. code:: cpp

        #define OPTIM_ENABLE_EIGEN_WRAPPERS
        #include "optim.hpp"

        inline
        double
        sphere_fn(const Eigen::VectorXd& vals_inp, Eigen::VectorXd* grad_out, void* opt_data)
        {
            double obj_val = vals_inp.dot(vals_inp);

            if (grad_out) {
                *grad_out = 2.0*vals_inp;
            }

            return obj_val;
        }

        int main()
        {
            const int test_dim = 5;

            Eigen::VectorXd x = Eigen::VectorXd::Ones(test_dim); // initial values (1,1,...,1)

            bool success = optim::gd(x, sphere_fn, nullptr);

            if (success) {
                std::cout << "gd: sphere test completed successfully." << "\n";
            } else {
                std::cout << "gd: sphere test completed unsuccessfully." << "\n";
            }

            std::cout << "gd: solution to sphere test:\n" << x << std::endl;

            return 0;
        }

Booth's Function

Code to run this example is given below.

.. toggle-header::
    :header: **Armadillo Code (Click to show/hide)**

    .. code:: cpp

        #define OPTIM_ENABLE_ARMA_WRAPPERS
        #include "optim.hpp"

        inline
        double
        booth_fn(const arma::vec& vals_inp, arma::vec* grad_out, void* opt_data)
        {
            double x_1 = vals_inp(0);
            double x_2 = vals_inp(1);

            double obj_val = std::pow(x_1 + 2*x_2 - 7.0,2) + std::pow(2*x_1 + x_2 - 5.0,2);

            if (grad_out) {
                (*grad_out)(0) = 10*x_1 + 8*x_2   2*(- 7.0) + 4*(x_2 - 5.0);
                (*grad_out)(1) = 2*(x_1 + 2*x_2 - 7.0)*2 + 2*(2*x_1 + x_2 - 5.0);
            }

            return obj_val;
        }

        int main()
        {
            arma::vec x_2 = arma::zeros(2,1); // initial values (0,0)

            bool success_2 = optim::gd(x, booth_fn, nullptr);

            if (success_2) {
                std::cout << "gd: Booth test completed successfully." << "\n";
            } else {
                std::cout << "gd: Booth test completed unsuccessfully." << "\n";
            }

            arma::cout << "gd: solution to Booth test:\n" << x_2 << arma::endl;

            return 0;
        }

.. toggle-header::
    :header: **Eigen Code (Click to show/hide)**

    .. code:: cpp

        #define OPTIM_ENABLE_EIGEN_WRAPPERS
        #include "optim.hpp"

        inline
        double
        booth_fn(const Eigen::VectorXd& vals_inp, Eigen::VectorXd* grad_out, void* opt_data)
        {
            double x_1 = vals_inp(0);
            double x_2 = vals_inp(1);

            double obj_val = std::pow(x_1 + 2*x_2 - 7.0,2) + std::pow(2*x_1 + x_2 - 5.0,2);

            if (grad_out) {
                (*grad_out)(0) = 2*(x_1 + 2*x_2 - 7.0) + 2*(2*x_1 + x_2 - 5.0)*2;
                (*grad_out)(1) = 2*(x_1 + 2*x_2 - 7.0)*2 + 2*(2*x_1 + x_2 - 5.0);
            }

            return obj_val;
        }

        int main()
        {
            Eigen::VectorXd x = Eigen::VectorXd::Zero(test_dim); // initial values (0,0)

            bool success_2 = optim::gd(x, booth_fn, nullptr);

            if (success_2) {
                std::cout << "gd: Booth test completed successfully." << "\n";
            } else {
                std::cout << "gd: Booth test completed unsuccessfully." << "\n";
            }

            std::cout << "gd: solution to Booth test:\n" << x_2 << std::endl;

            return 0;
        }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gd.rst

gd.rst

Gradient Descent

Algorithm Description

Gradient Descent Rule

Function Declarations

Optimization Control Parameters

Examples

Sphere Function

Booth's Function

Files

gd.rst

Latest commit

History

gd.rst

File metadata and controls

Gradient Descent

Algorithm Description

Gradient Descent Rule

Function Declarations

Optimization Control Parameters

Examples

Sphere Function

Booth's Function