Table of contents
The Gradient Descent (GD) algorithm is used solve to optimization problems of the form
\min_{x \in X} f(x)
where f is convex and (at least) once differentiable.
The updating rule for GD is described below. Let x^{(i)} denote the candidate solution vector at stage i of the algorithm.
- Compute the descent direction d^{(i)} using one of the methods described below.
Update the candidate solution vector using:
x^{(i+1)} = x^{(i)} - d^{(i)}
The algorithm stops when at least one of the following conditions are met:
- the norm of the gradient vector, \| \nabla f \|, is less than
grad_err_tol
;- the relative change between x^{(i+1)} and x^{(i)} is less than
rel_sol_change_tol
;- the total number of iterations exceeds
iter_max
.
gd_settings.method = 0
Vanilla GD:d^{(i)} = \alpha \times [ \nabla_x f( x^{(i)} ) ]
where \alpha, the step size (also known the learning rate), is set by
par_step_size
.gd_settings.method = 1
GD with momentum:d^{(i)} = \mu \times d^{(i-1)} + \alpha \times [ \nabla_x f( x^{(i)} ) ]
where \mu, the momentum parameter, is set by
par_momentum
.gd_settings.method = 2
Nesterov accelerated gradient descent (NAG)d^{(i)} = \mu \times d^{(i-1)} + \alpha \times \nabla f( x^{(i)} - \mu \times d^{(i-1)})
gd_settings.method = 3
AdaGrad:\begin{aligned} d^{(i)} &= [ \nabla_x f( x^{(i)} ) ] \odot \dfrac{1}{\sqrt{v^{(i)}} + \epsilon} \\ v^{(i)} &= v^{(i-1)} + [ \nabla_x f( x^{(i)} ) ] \odot [ \nabla_x f( x^{(i)} ) ] \end{aligned}
gd_settings.method = 4
RMSProp:\begin{aligned} d^{(i)} &= [ \nabla_x f( x^{(i)} ) ] \odot \dfrac{1}{\sqrt{v^{(i)}} + \epsilon} \\ v^{(i)} &= \rho \times v^{(i-1)} + (1-\rho) \times [ \nabla_x f( x^{(i)} ) ] \odot [ \nabla_x f( x^{(i)} ) ] \end{aligned}
gd_settings.method = 5
AdaDelta:\begin{aligned} d^{(i)} &= [ \nabla_x f( x^{(i)} ) ] \odot \dfrac{\sqrt{m^{(i)}} + \epsilon}{\sqrt{v^{(i)}} + \epsilon} \\ m^{(i)} &= \rho \times m^{(i-1)} + (1-\rho) \times [ d^{(i-1)} ] \odot [ d^{(i-1)} ] \\ v^{(i)} &= \rho \times v^{(i-1)} + (1-\rho) \times [ \nabla_x f( x^{(i)} ) ] \odot [ \nabla_x f( x^{(i)} ) ] \end{aligned}
gd_settings.method = 6
Adam (adaptive moment estimation) and AdaMax.\begin{aligned} m^{(i)} &= \beta_1 \times m^{(i-1)} + (1-\beta_1) \times [ \nabla_x f( x^{(i-1)} ) ] \\ v^{(i)} &= \beta_2 \times v^{(i-1)} + (1-\beta_2) \times [ \nabla_x f( x^{(i)} ) ] \odot [ \nabla_x f( x^{(i)} ) ] \\ & \ \ \ \ \ \ \hat{m} = \dfrac{m^{(i)}}{1 - \beta_1^i}, \ \ \hat{v} = \dfrac{v^{(i)}}{1 - \beta_2^i} \end{aligned}
where m^{(0)} = \mathbf{0}, and \beta_1 and \beta_2 are set by
par_adam_beta_1
andpar_adam_beta_2
, respectively.If
ada_max = false
, then the descent direction is computed asd^{(i)} = \alpha \times \dfrac{\hat{m}}{\sqrt{\hat{v}} + \epsilon}
If
ada_max = true
, then the updating rule for v^{(i)} is no longer based on the L_2 norm; insteadv^{(i)} = \max \left\{ \beta_2 \times v^{(i-1)}, | \nabla_x f( x^{(i)} ) | \right\}
The descent direction is computed using
d^{(i)} = \alpha \times \dfrac{\hat{m}}{ v^{(i)} + \epsilon}
gd_settings.method = 7
Nadam (adaptive moment estimation) and NadaMax\begin{aligned} m^{(i)} &= \beta_1 \times m^{(i-1)} + (1-\beta_1) \times [ \nabla_x f( x^{(i-1)} ) ] \\ v^{(i)} &= \beta_2 \times v^{(i-1)} + (1-\beta_2) \times [ \nabla_x f( x^{(i)} ) ] \odot [ \nabla_x f( x^{(i)} ) ] \\ & \ \ \hat{m} = \dfrac{m^{(i)}}{1 - \beta_1^i}, \ \ \hat{v} = \dfrac{v^{(i)}}{1 - \beta_2^i}, \ \ \hat{g} = \dfrac{ \nabla_x f(x^{(i)}) }{1 - \beta_1^i} \end{aligned}
where m^{(0)} = \mathbf{0}, and \beta_1 and \beta_2 are set by
par_adam_beta_1
andpar_adam_beta_2
, respectively.If
ada_max = false
, then the descent direction is computed asd^{(i)} = \alpha \times [ \nabla_x f( x^{(i)} ) ] \odot \dfrac{\beta_1 \hat{m} + (1 - \beta_1) \hat{g} }{\sqrt{\hat{v}} + \epsilon}
If
ada_max = true
, then the updating rule for v^{(i)} is no longer based on the L_2 norm; insteadv^{(i)} = \max \left\{ \beta_2 \times v^{(i-1)}, | \nabla_x f( x^{(i)} ) | \right\}
The descent direction is computed using
d^{(i)} = \alpha \times [ \nabla_x f( x^{(i)} ) ] \odot \dfrac{\beta_1 \hat{m} + (1 - \beta_1) \hat{g} }{v^{(i)} + \epsilon}
.. doxygenfunction:: gd(ColVec_t& init_out_vals, std::function<fp_t (const ColVec_t& vals_inp, ColVec_t* grad_out, void* opt_data)> opt_objfn, void* opt_data) :project: optimlib
.. doxygenfunction:: gd(ColVec_t& init_out_vals, std::function<fp_t (const ColVec_t& vals_inp, ColVec_t* grad_out, void* opt_data)> opt_objfn, void* opt_data, algo_settings_t& settings) :project: optimlib
The basic control parameters are:
fp_t grad_err_tol
: the error tolerance value controlling how small the L_2 norm of the gradient vector \| \nabla f \| should be before 'convergence' is declared.fp_t rel_sol_change_tol
: the error tolerance value controlling how small the proportional change in the solution vector should be before 'convergence' is declared.The relative change is computed using:
\left\| \dfrac{x^{(i)} - x^{(i-1)}}{ |x^{(i-1)}| + \epsilon } \right\|_1
where \epsilon is a small number added for numerical stability.
size_t iter_max
: the maximum number of iterations/updates before the algorithm exits.bool vals_bound
: whether the search space of the algorithm is bounded. Iftrue
, thenColVec_t lower_bounds
: defines the lower bounds of the search space.ColVec_t upper_bounds
: defines the upper bounds of the search space.
In addition to these:
int print_level
: Set the level of detail for printing updates on optimization progress.- Level
0
: Nothing (default). - Level
1
: Print the current iteration count and error values. - Level
2
: Level 1 plus the current candidate solution values, x^{(i+1)}. - Level
3
: Level 2 plus the direction vector, d^{(i)}, and the gradient vector, \nabla_x f(x^{(i+1)}). - Level
4
: Level 3 plus information about the chosen gradient descent rule.
- Level
Code to run this example is given below.
.. toggle-header:: :header: **Armadillo (Click to show/hide)** .. code:: cpp #define OPTIM_ENABLE_ARMA_WRAPPERS #include "optim.hpp" inline double sphere_fn(const arma::vec& vals_inp, arma::vec* grad_out, void* opt_data) { double obj_val = arma::dot(vals_inp,vals_inp); if (grad_out) { *grad_out = 2.0*vals_inp; } return obj_val; } int main() { const int test_dim = 5; arma::vec x = arma::ones(test_dim,1); // initial values (1,1,...,1) bool success = optim::gd(x, sphere_fn, nullptr); if (success) { std::cout << "gd: sphere test completed successfully." << "\n"; } else { std::cout << "gd: sphere test completed unsuccessfully." << "\n"; } arma::cout << "gd: solution to sphere test:\n" << x << arma::endl; return 0; }
.. toggle-header:: :header: **Eigen (Click to show/hide)** .. code:: cpp #define OPTIM_ENABLE_EIGEN_WRAPPERS #include "optim.hpp" inline double sphere_fn(const Eigen::VectorXd& vals_inp, Eigen::VectorXd* grad_out, void* opt_data) { double obj_val = vals_inp.dot(vals_inp); if (grad_out) { *grad_out = 2.0*vals_inp; } return obj_val; } int main() { const int test_dim = 5; Eigen::VectorXd x = Eigen::VectorXd::Ones(test_dim); // initial values (1,1,...,1) bool success = optim::gd(x, sphere_fn, nullptr); if (success) { std::cout << "gd: sphere test completed successfully." << "\n"; } else { std::cout << "gd: sphere test completed unsuccessfully." << "\n"; } std::cout << "gd: solution to sphere test:\n" << x << std::endl; return 0; }
Code to run this example is given below.
.. toggle-header:: :header: **Armadillo Code (Click to show/hide)** .. code:: cpp #define OPTIM_ENABLE_ARMA_WRAPPERS #include "optim.hpp" inline double booth_fn(const arma::vec& vals_inp, arma::vec* grad_out, void* opt_data) { double x_1 = vals_inp(0); double x_2 = vals_inp(1); double obj_val = std::pow(x_1 + 2*x_2 - 7.0,2) + std::pow(2*x_1 + x_2 - 5.0,2); if (grad_out) { (*grad_out)(0) = 10*x_1 + 8*x_2 2*(- 7.0) + 4*(x_2 - 5.0); (*grad_out)(1) = 2*(x_1 + 2*x_2 - 7.0)*2 + 2*(2*x_1 + x_2 - 5.0); } return obj_val; } int main() { arma::vec x_2 = arma::zeros(2,1); // initial values (0,0) bool success_2 = optim::gd(x, booth_fn, nullptr); if (success_2) { std::cout << "gd: Booth test completed successfully." << "\n"; } else { std::cout << "gd: Booth test completed unsuccessfully." << "\n"; } arma::cout << "gd: solution to Booth test:\n" << x_2 << arma::endl; return 0; }
.. toggle-header:: :header: **Eigen Code (Click to show/hide)** .. code:: cpp #define OPTIM_ENABLE_EIGEN_WRAPPERS #include "optim.hpp" inline double booth_fn(const Eigen::VectorXd& vals_inp, Eigen::VectorXd* grad_out, void* opt_data) { double x_1 = vals_inp(0); double x_2 = vals_inp(1); double obj_val = std::pow(x_1 + 2*x_2 - 7.0,2) + std::pow(2*x_1 + x_2 - 5.0,2); if (grad_out) { (*grad_out)(0) = 2*(x_1 + 2*x_2 - 7.0) + 2*(2*x_1 + x_2 - 5.0)*2; (*grad_out)(1) = 2*(x_1 + 2*x_2 - 7.0)*2 + 2*(2*x_1 + x_2 - 5.0); } return obj_val; } int main() { Eigen::VectorXd x = Eigen::VectorXd::Zero(test_dim); // initial values (0,0) bool success_2 = optim::gd(x, booth_fn, nullptr); if (success_2) { std::cout << "gd: Booth test completed successfully." << "\n"; } else { std::cout << "gd: Booth test completed unsuccessfully." << "\n"; } std::cout << "gd: solution to Booth test:\n" << x_2 << std::endl; return 0; }