A blisfully ignorant Julia package for gradient optimization.
A package for gradient optimization that tries to know or assume as little as possible about your optimization problem. So far gradient descent, conjugate gradient and the L-BFGS method have been implemented. One starts by defining the algorithm that one wants to use by creating an instance algorithm
of either
GradientDescent(; params...)
ConjugateGradient(; flavor = ..., params...)
LBFGS(m::Int; params...)
All of them take a number of parameters, namely
maxiter
: number of iterations (defaults totypemax(Int)
, so essentially unbounded)gradtol
: convergence criterion, stop when the 2-norm of the gradient is smaller thangradtol
(default value =gradtol = 1e-8
)linesearch
: which linesearch algorithm to be used, currently there is only one choice, namelyHagerZhangLineSearch(;...)
(see below).verbosity
: Verbosity level, the amount of information that will be printed, either<=0
(default value) for no information,1
for a single STDOUT output at the end of the algorithm, or>=2
for a one-line summary after every iteration step.
Furthermore, LBFGS
takes a single positional argument m::Int
, the number of previous steps to take into account in the construction of the approximate (inverse) Hessian. It also takes a keyword argument acceptfirst
, which determines whether the first guess for alpha
in the line search can be accepted. The default value is true
, which typically leads to less function evaluations (otherwise at least two function evaluations per iteration are required), despite a more erratic convergence of the gradient norm.
ConjugateGradient
also has one additional keyword argument, flavor
, which can be any of the following:
HagerZhang(; η::Real = 0.4, θ::Real = 1.0)
: defaultHestenesStiefel()
PolakRibiere()
DaiYuan()
The linesearch
argument currently takes the value
HagerZhangLineSearch(; δ::Real = .1, # parameter for first Wolfe condition
σ::Real = .9, # paremeter for second Wolfe condition
ϵ::Real = 1e-6, # parameter for approximate Wolfe condition, accept fluctation of ϵ on the function value
θ::Real = 1/2, # used in bisection
γ::Real = 2/3, # determines when a bisection step is performed
ρ::Real = 5., # expansion parameter for initial bracket interval
verbosity::Int = 0)
The linesearch has an independent verbosity
flag to control the output of information being printed to STDOUT
, but by default its value is equal to verbosity-2
of the optimization algorithm. So ConjugateGradient(; verbosity = 3)
is equivalent to
having verbosity=1
in the linesearch algorithm.
This optimization algorithm can then be applied by calling
x, fx, gx, numfg, normgradhistory = optimize(fg, x₀, algorithm; kwargs...)
Here, the optimization problem (objective function) is specified as a function fval, gval = fg(x)
that returns both the function value and its gradient at a given point x
. The function value fval
is assumed to be a real number of some type T<:Real
. Both x
and the gradient gval
can be of any type, including tuples and named tuples. As a user, you should then also specify the following functions via keyword arguments
Pη = precondition(x, η)
: apply a preconditioner to the current gradient or tangent vectorη
at the positionx
x, f, g = finalize!(x, f, g, numiter)
: after every step (i.e. upon completion of the linesearch), allows to modify the position and corresponding function value or gradient, or to do other things like printing out statistics. Note that this step happens before computing new directions in Conjugate Gradient and LBFGS, so iff
andg
are modified, this is at the user's own risk (e.g. Wolfe conditions might no longer be satisfied, ...).x, ξ = retract(x₀, η, α)
: take a step in directionη
(same type as gradients) starting from pointx₀
and with step lengthα
, returns the newx(α) = Rₓ₀(α * η)
and the local tangent to this path at that position, i.e.ξ = D Rₓ₀(α * η)[η]
(informally,ξ = dx(α) / dα
).s = inner(x, ξ1, ξ2)
: compute the inner product between two gradients or similar objects at positionx
. Thex
dependence is useful for optimization on manifolds, where this function represents the metric; in particular it should be symmetricinner(x, ξ1, ξ2) == inner(x, ξ2, ξ1)
and real-valued.η = scale!(η, β)
: compute the equivalent ofη*β
, possibly in place, but we always use the return value. This is mostly used asscale!(g, -1)
to compute the negative gradient as part of the step direction.η = add!(η, ξ, β)
: compute the equivalent ofη + ξ*β
, possibly overwritingη
in place, but we always use the return valueξ = transport!(ξ, x, η, α, x′)
: transport tangent vectorξ
along the retraction ofx
in the directionη
(same type as a gradient) with step lengthα
, can be in place but the return value is used. Transport also receivesx′ = retract(x, η, α)[1]
as final argument, which has been computed before and can contain useful data that does not need to be recomputed
Note that the gradient g
of the objective function should satisfy d f(x(α)) / d α = inner(x(α), ξ(α), g(x(α)))
. There is a utility function optimtest
to facilitate testing this compatibility relation for your given choice of fg
, retract
and inner
.
The GradientDescent
algorithm only requires the first three, ConjugateGradient
and LBFGS
require all five functions. Default values are provided to make the optimization algorithms work with standard optimization problems where x
is a vector or Array
, i.e. they are given by
_retract(x, η, α) = (x + α * η, η)
_inner(x, ξ1, ξ2) = ξ1 === ξ2 ? LinearAlgebra.norm(ξ1)^2 : LinearAlgebra.dot(ξ1, ξ2)
_transport!(ξ, x, η, α) = ξ
_add!(η, ξ, β) = LinearAlgebra.axpy!(β, ξ, η)
_scale!(η, β) = LinearAlgebra.rmul!(η, β)
Finally, there is one keyword argument isometrictransport::Bool
to indicate whether the transport of vectors preserves their inner product, i.e. whether
inner(x, ξ1, ξ2) == inner(retract(x, η, α), transport!(ξ1, x, η, α), transport!(ξ2, x, η, α))
The default value is false, unless the default transport (_transport!
) and inner product (_inner
) are used. However, convergence of conjugate gradient and LBFGS is more robust (or theoretically proven) in the case of isometric transport. Note that isometric transport might not be the same as retraction transport, and thus, in particular
ξ != transport(η, x, η, α, x′)
. However, when isometric transport is provided, we complement it with an isometric rotation such that ξ = D Rₓ₀(α * η)[η]
and transport(η, x, η, α)
are parallel in the case of LBFGS
. This is the so-called locking condition of Huang, Gallivan and Absil, and the approach is described in section 4.1.