Ideas to improve multistage stochastic integer programs #246

odow · 2019-08-26T13:49:36Z

Motivation

The implementation of SDDiP in SDDP.jl is bad. If you use it, you will probably run into convergence issues! No need to open an issue, I'm aware of the problem. (If you have challenging multistage stochastic integer problems, please send them to me!)

At this point it isn't apparent whether SDDiP is a bad algorithm or a poor implementation in SDDP.jl.

My guess is that it is a bad algorithm almost results in complete enumeration for most problems, but in some cases we can do okay if we implement the following algorithmic improvements.

Clever switching between duality handlers

We should write a handler that:
- starts with ContinuousConicDuality()
- if MIP exists, periodically adds a StrengthenendConicDuality()
- Switches to StrengthenendConicDuality() if the improvement was more than could be expected in the same time as ContinuousConicDuality()
- Does the same for StrengthenendConicDuality -> LagrangianDuality

This should mean that it can be the default for all cases. LP has no performance loss. MIP can only be better, and we don't waste time with the strengthened solve if the problem is naturally integer, or with the Lagrangian solve if it takes too long to solve.

Bounded states

We need bounds on x.in to ensure the dual has a bounded solution. But it isn't obvious what or how to do this.

Advanced Lagrangian solves

The current implementation of the cutting plane solve is very naive. From Jim's slides: https://www.ima.umn.edu/materials/2015-2016/ND8.1-12.16/25411/Luedtke-mip-decomp.pdf

This paper by Jim and his student looks good for improving the SDDiP cuts: https://arxiv.org/pdf/2106.04023.pdf
- The key idea is to bound the Lagrangian duals by some set P, formed by the convex hull of the LP duals found to date. (This reduces the solution time of the Lagrangian cutting plane solve, at the expense of poorer duals early in the solve.)

Other

Argonne folks are working on dual-decomposition based stuff: Integer variables when constructing a PolicyGraph #441, https://github.com/hideakiv/DRMSMIP.jl, Implementation of multistage stochastic programs kibaekkim/DualDecomposition.jl#36
A user sent me a model (in my emails as "Log file (SDDP issue)") that performs very badly in the default settings of SDDP.jl. It's also not a good candidate for SDDiP because the integrality is in the control variables, not the state variables.

Done

remove the Bellman function stuff and just hard-code one value function with a single cutting loop.
- Here was a previous attempt: Experiments with SDDiP #265. Before writing new code, go read it because it had a number of good ideas.
Decouple state binarization from SDDiP cutting loop. They are independent ideas.
- SDDiP still works with a mix of integer and continuous state variables, it just may result in a suboptimal policy. Pure binary is a special case with guaranteed convergence.
Add strengthened benders cuts: Add cut patterns to SDDiP #257
Unify tolerances: SDDiP doesn't respect tolerances #264
Write documentation for the features introduced by WIP: SDDiP #237.
Add a second solve to find a sparse dual vector #455 fixes this by adding a solve for the smallest L1-norm dual solution Magnanti-Wong stabilisation to some interior point (e.g., from solving the LP). This is particularly important for the 0-1 case where all the points we evaluate are at the edge of the feasible region => very steep duals are optimal.
Give up on convergence, and present SDDP as a good heuristic for finding optimal control policies.
- We can find feasible (near optimal) policies for risk-averse infinite horizon, stochastic optimal control problems with continuous and discrete variables! This should be something the community advocates, instead of saying we can't prove optimality and things are too hard. Current state-of-the-art in POMDP-type fields is just to come up with a heuristic and say "wow, it works on a test problem."

Write a theory tutorial

I should write a tutorial for the documentation.

I find how SDDiP is presented as a concept in the literature confusing.

We have these subproblems:

and we're trying to find a feasible dual solution and corresponding dual objective value for the slope and intercept of the cut respectively.

ContinuousConic relaxes the integrality and solves the conic dual https://odow.github.io/SDDP.jl/stable/apireference/#SDDP.ContinuousConicDuality
Strengthened conic duality uses the same dual solution vector, but strengthens it by computing a tighter dual objective value given that dual solution https://odow.github.io/SDDP.jl/stable/apireference/#SDDP.StrengthenedConicDuality
LagrangianDuality relaxes the fishing constraint (x-x=0) to compute the Lagrangian dual solution and corresponding objective value: https://odow.github.io/SDDP.jl/stable/apireference/#SDDP.LagrangianDuality

You should probably be able to find a feasible solution with the default ContinuousConicDuality option as well. I changed how we handle integer variables in the forward pass during training.

The text was updated successfully, but these errors were encountered:

odow · 2021-02-01T02:55:46Z

@haoxiangyang89 #265 had some of the ideas we discussed the other day.

odow · 2021-09-07T19:55:43Z

@guyichen09 (cc @mortondp) a use-case for Bayesian Optimization/multi-armed bandit stuff:

Each iteration of SDDP.jl, we have three choices for our duality handler

ContinuousConicDuality (a.k.a. continuous benders)
StrengthenedConicDuality (a.k.a. strengthened benders)
LagrangianDuality (a.k.a. SDDiP)

Each duality handler will lead to cuts that improve our lower bound by some amount, but take a different amount of time. You could normalize each iteration to a reward function of (change in objective / time taken). This is stochastic, and it changes over time.

I want to write a BanditDuality that intelligently picks which handler to run over the iterations. Seems like this is pretty do-able?

guyichen09 · 2021-09-07T21:26:34Z

Hi Oscar, I have an idea for this BanditDuailty. I think that the multi-armed bandit algorithm might be a good method for this. We could use a modified version of Thompson Sampling. We would need to modify the reward function to take into account both the change in objective and time taken. What do you think? Best, Guyi

…

On Tue, Sep 7, 2021 at 2:56 PM Oscar Dowson ***@***.***> wrote: @guyichen09 <https://github.com/guyichen09> (cc @mortondp <https://github.com/mortondp>) a use-case for Bayesian Optimization/multi-armed bandit stuff: Each iteration of SDDP.jl, we have three choices for our duality handler - ContinuousConicDuality (a.k.a. continuous benders) - StrengthenedConicDuality (a.k.a. strengthened benders) - LagrangianDuality (a.k.a. SDDiP) Each duality handler will lead to cuts that improve our lower bound by some amount, but take a different amount of time. You could normalize each iteration to a reward function of (change in objective / time taken). This is stochastic, and it changes over time. I want to write a BanditDuality that intelligently picks which handler to run over the iterations. Seems like this is pretty do-able? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#246 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APSLDE4CZ6X35M6DDVH47QLUAZU4TANCNFSM4IPP3Y5A> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

odow · 2021-09-07T21:42:06Z

Yeah my suggestion for the reward would be

function reward(log::Vector{Log})
    d_bound = abs(log[end].bound - log[end-1].bound)
    dt = log[end].time - log[end-1].time
    return d_bound / dt
end

And the log is

SDDP.jl/src/algorithm.jl

Line 938 in ad7a418

log = Log[]

There's a little bit of plumbing to do, but not too much work.

We should pass prepare_backward_pass(model, options.duality_handler) to prepare_backward_pass(model, options.duality_handler, options.log)

SDDP.jl/src/algorithm.jl

Line 472 in ad7a418

restore_duality = prepare_backward_pass(model, options.duality_handler)

which gets updated here:

SDDP.jl/src/plugins/duality_handlers.jl

Lines 64 to 71 in ad7a418

    
           function prepare_backward_pass( 
        
               model::PolicyGraph, 
        
               duality_handler::AbstractDualityHandler, 
        
           ) 
        
               undo = Function[] 
        
               for (_, node) in model.nodes 
        
                   if node.has_integrality 
        
                       push!(undo, prepare_backward_pass(node, duality_handler))

here

SDDP.jl/src/plugins/duality_handlers.jl

Lines 131 to 136 in ad7a418

    
           function prepare_backward_pass(node::Node, ::ContinuousConicDuality) 
        
               if !node.has_integrality 
        
                   return () -> nothing 
        
               end 
        
               return JuMP.relax_integrality(node.subproblem) 
        
           end

and here

SDDP.jl/src/plugins/headers.jl

Lines 185 to 192 in ad7a418

    
           """ 
        
               prepare_backward_pass(node::Node, handler::AbstractDualityHandler) 
        
           Performs any setup needed by the duality handler prior to the backward pass. 
        
           Returns a function that, when called with no arguments, undoes the setup. 
        
           """ 
        
           prepare_backward_pass(::Node, ::AbstractDualityHandler) = () -> nothing

Then we can write a new method for BanditDuality.

It should probably start with a minimum of say, 10 iterations of ContinuousConicDuality to build a prior. We could then do a hack and use initial priors of say, 2/3 the variance for StrengthenedConicDuality and 1/3 the variance for LagrangianDuality (i.e., just mean-revert the realizations by some amount).

odow · 2021-09-18T05:38:28Z

We could use this bayesian learning in a few other contexts:

Trajectory depth in cyclic graphs

The default sampling scheme for the forward pass has some tunable parameters:

SDDP.jl/src/plugins/sampling_schemes.jl

Lines 38 to 43 in 493b48e

    
           function InSampleMonteCarlo(; 
        
               max_depth::Int = 0, 
        
               terminate_on_cycle::Bool = false, 
        
               terminate_on_dummy_leaf::Bool = true, 
        
               rollout_limit::Function = i -> typemax(Int), 
        
           )

In particular, for cyclic graphs, you might want to learn between

A = InSampleMonteCarlo(
    max_depth = length(model.nodes),
    terminate_on_dummy_leaf = false,
)

B = InSampleMonteCarlo(
    max_depth = 5 * length(model.nodes),
    terminate_on_dummy_leaf = false,
)

C = InSampleMonteCarlo()

Revisiting forward passes

Rather than just looping through forward passes:

SDDP.jl/src/plugins/forward_passes.jl

Lines 136 to 158 in 493b48e

    
           """ 
        
               RevisitingForwardPass( 
        
                   period::Int = 500; 
        
                   sub_pass::AbstractForwardPass = DefaultForwardPass() 
        
               ) 
        
           A forward pass scheme that generate `period` new forward passes (using 
        
           `sub_pass`), then revisits all previously explored forward passes. This can 
        
           be useful to encourage convergence at a diversity of points in the 
        
           state-space. 
        
           Set `period = typemax(Int)` to disable. 
        
           For example, if `period = 2`, then the forward passes will be revisited as 
        
           follows: `1, 2, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 1, 2, ...`. 
        
           """ 
        
           function RevisitingForwardPass( 
        
               period::Int = 500; 
        
               sub_pass::AbstractForwardPass = DefaultForwardPass(), 
        
           ) 
        
               @assert period > 0 
        
               return RevisitingForwardPass(period, sub_pass, Any[], 0, 0) 
        
           end

we should learn which ones to revisit according to some score. So you look at the delta achieved by each forward pass, as well as the expected delta of sampling a new trajectory. And then pick the one that has the best.

I think this is both computationally efficient, and helpful from a marketing aspect, because we can claim machine learning is being used to drive decision making within the algorithm.

odow · 2021-09-19T09:51:47Z

@guyichen09 here's a potentially open question: is there any work on learning when the reward is not-stationary?

In our case, we know that the reward for each arm will tend towards 0 as it gets pulled. The history is also action-dependent, which means assumptions about the distribution of reward being independent of the algorithm are not satisfied.

Some papers:

mortondp · 2021-09-19T16:24:34Z

I’m with you. Good idea. When Daniel was looking at DRO / SDDP we looked at sampling forward paths according to the nominal distribution or according to the DRO distribution. In theory, the latter can get you in trouble with respect to not converging due to zero-probability branches. We did a convex combo / mixture. Lower bounds were significantly tighter when putting enough weight on DRO paths, meaning that I think your idea would make sense, at least according to some metrics. We also looked at out-of-sample quality of the resulting policy. That difference was less dramatic, but there may have been something there, too. Can’t remember for sure. Dave On Sep 18, 2021, at 12:38 AM, Oscar Dowson ***@***.******@***.***>> wrote: We could use this bayesian learning in a few other contexts: Trajectory depth in cyclic graphs The default sampling scheme for the forward pass has some tunable parameters: https://github.com/odow/SDDP.jl/blob/493b48e256a88cb1cc898497d01a4d3245d3ffd0/src/plugins/sampling_schemes.jl#L38-L43<https://urldefense.com/v3/__https://github.com/odow/SDDP.jl/blob/493b48e256a88cb1cc898497d01a4d3245d3ffd0/src/plugins/sampling_schemes.jl*L38-L43__;Iw!!Dq0X2DkFhyF93HkjWTBQKhk!CmdXgj4D2n2FtwVS04NfPQw0OYHLBGX88OFuL6YlBabRpa4NuF-vNmRLDT4MwHTLxMONAhXHRw$> In particular, for cyclic graphs, you might want to learn between A = InSampleMonteCarlo( max_depth = length(model.nodes), terminate_on_dummy_leaf = false, ) B = InSampleMonteCarlo( max_depth = 5 * length(model.nodes), terminate_on_dummy_leaf = false, ) C = InSampleMonteCarlo() Revisiting forward passes Rather than just looping through forward passes: https://github.com/odow/SDDP.jl/blob/493b48e256a88cb1cc898497d01a4d3245d3ffd0/src/plugins/forward_passes.jl#L136-L158<https://urldefense.com/v3/__https://github.com/odow/SDDP.jl/blob/493b48e256a88cb1cc898497d01a4d3245d3ffd0/src/plugins/forward_passes.jl*L136-L158__;Iw!!Dq0X2DkFhyF93HkjWTBQKhk!CmdXgj4D2n2FtwVS04NfPQw0OYHLBGX88OFuL6YlBabRpa4NuF-vNmRLDT4MwHTLxMP2JH-UPA$> we should learn which ones to revisit according to some score. So you look at the delta achieved by each forward pass, as well as the expected delta of sampling a new trajectory. And then pick the one that has the best. I think this is both computationally efficient, and helpful from a marketing aspect, because we can claim machine learning is being used to drive decision making within the algorithm. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/odow/SDDP.jl/issues/246*issuecomment-922196850__;Iw!!Dq0X2DkFhyF93HkjWTBQKhk!CmdXgj4D2n2FtwVS04NfPQw0OYHLBGX88OFuL6YlBabRpa4NuF-vNmRLDT4MwHTLxMP9GOiq6g$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AJBB7ENN5YS5NFH32IGZF5LUCQQV7ANCNFSM4IPP3Y5A__;!!Dq0X2DkFhyF93HkjWTBQKhk!CmdXgj4D2n2FtwVS04NfPQw0OYHLBGX88OFuL6YlBabRpa4NuF-vNmRLDT4MwHTLxMNv02dObg$>. Triage notifications on the go with GitHub Mobile for iOS<https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!Dq0X2DkFhyF93HkjWTBQKhk!CmdXgj4D2n2FtwVS04NfPQw0OYHLBGX88OFuL6YlBabRpa4NuF-vNmRLDT4MwHTLxMO1wGwTQA$> or Android<https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!Dq0X2DkFhyF93HkjWTBQKhk!CmdXgj4D2n2FtwVS04NfPQw0OYHLBGX88OFuL6YlBabRpa4NuF-vNmRLDT4MwHTLxMNQG9lXCw$>.

guyichen09 · 2021-09-24T16:19:29Z

Hi Oscar, I am not sure if there is any work about non-stationary rewards. During my internship, I found an algorithm using Thompson Sampling for the combinatorial multi-armed bandit problem. In this algorithm, the reward is deterministic in each pull during an iteration; however, from entire algorithm iteration to iteration, the reward is changing and tends to go to zero. They use a Bernoulli random variable to account for that. The specific algorithm is attached below. [image: Screen Shot 2021-09-24 at 11.16.09 AM.png] Line 9 - 10 is where they deal with the reward. I think it is a possible option for what we are looking for. I will also research more on this topic. Best, Guyi

…

On Sun, Sep 19, 2021 at 11:24 AM mortondp ***@***.***> wrote: I’m with you. Good idea. When Daniel was looking at DRO / SDDP we looked at sampling forward paths according to the nominal distribution or according to the DRO distribution. In theory, the latter can get you in trouble with respect to not converging due to zero-probability branches. We did a convex combo / mixture. Lower bounds were significantly tighter when putting enough weight on DRO paths, meaning that I think your idea would make sense, at least according to some metrics. We also looked at out-of-sample quality of the resulting policy. That difference was less dramatic, but there may have been something there, too. Can’t remember for sure. Dave On Sep 18, 2021, at 12:38 AM, Oscar Dowson ***@***.******@***.***>> wrote: We could use this bayesian learning in a few other contexts: Trajectory depth in cyclic graphs The default sampling scheme for the forward pass has some tunable parameters: https://github.com/odow/SDDP.jl/blob/493b48e256a88cb1cc898497d01a4d3245d3ffd0/src/plugins/sampling_schemes.jl#L38-L43 < https://urldefense.com/v3/__https://github.com/odow/SDDP.jl/blob/493b48e256a88cb1cc898497d01a4d3245d3ffd0/src/plugins/sampling_schemes.jl*L38-L43__;Iw!!Dq0X2DkFhyF93HkjWTBQKhk!CmdXgj4D2n2FtwVS04NfPQw0OYHLBGX88OFuL6YlBabRpa4NuF-vNmRLDT4MwHTLxMONAhXHRw$> In particular, for cyclic graphs, you might want to learn between A = InSampleMonteCarlo( max_depth = length(model.nodes), terminate_on_dummy_leaf = false, ) B = InSampleMonteCarlo( max_depth = 5 * length(model.nodes), terminate_on_dummy_leaf = false, ) C = InSampleMonteCarlo() Revisiting forward passes Rather than just looping through forward passes: https://github.com/odow/SDDP.jl/blob/493b48e256a88cb1cc898497d01a4d3245d3ffd0/src/plugins/forward_passes.jl#L136-L158 < https://urldefense.com/v3/__https://github.com/odow/SDDP.jl/blob/493b48e256a88cb1cc898497d01a4d3245d3ffd0/src/plugins/forward_passes.jl*L136-L158__;Iw!!Dq0X2DkFhyF93HkjWTBQKhk!CmdXgj4D2n2FtwVS04NfPQw0OYHLBGX88OFuL6YlBabRpa4NuF-vNmRLDT4MwHTLxMP2JH-UPA$> we should learn which ones to revisit according to some score. So you look at the delta achieved by each forward pass, as well as the expected delta of sampling a new trajectory. And then pick the one that has the best. I think this is both computationally efficient, and helpful from a marketing aspect, because we can claim machine learning is being used to drive decision making within the algorithm. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub< https://urldefense.com/v3/__https://github.com/odow/SDDP.jl/issues/246*issuecomment-922196850__;Iw!!Dq0X2DkFhyF93HkjWTBQKhk!CmdXgj4D2n2FtwVS04NfPQw0OYHLBGX88OFuL6YlBabRpa4NuF-vNmRLDT4MwHTLxMP9GOiq6g$>, or unsubscribe< https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AJBB7ENN5YS5NFH32IGZF5LUCQQV7ANCNFSM4IPP3Y5A__;!!Dq0X2DkFhyF93HkjWTBQKhk!CmdXgj4D2n2FtwVS04NfPQw0OYHLBGX88OFuL6YlBabRpa4NuF-vNmRLDT4MwHTLxMNv02dObg$>. Triage notifications on the go with GitHub Mobile for iOS< https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!Dq0X2DkFhyF93HkjWTBQKhk!CmdXgj4D2n2FtwVS04NfPQw0OYHLBGX88OFuL6YlBabRpa4NuF-vNmRLDT4MwHTLxMO1wGwTQA$> or Android< https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!Dq0X2DkFhyF93HkjWTBQKhk!CmdXgj4D2n2FtwVS04NfPQw0OYHLBGX88OFuL6YlBabRpa4NuF-vNmRLDT4MwHTLxMNQG9lXCw$>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#246 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APSLDE4A5GOLMVTIPJR56R3UCYFEZANCNFSM4IPP3Y5A> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

odow · 2021-11-08T07:37:21Z

A few updates for this:

The BanditDuality is implemented:

SDDP.jl/src/plugins/duality_handlers.jl

Lines 310 to 412 in 23fcaa9

    
           mutable struct _BanditArm{T} 
        
               handler::T 
        
               rewards::Vector{Float64} 
        
           end 
        
           """ 
        
               BanditDuality() 
        
           Formulates the problem of choosing a duality handler as a multi-armed bandit 
        
           problem. The arms to choose between are: 
        
            * [`ContinuousConicDuality`](@ref) 
        
            * [`StrengthenedConicDuality`](@ref) 
        
            * [`LagrangianDuality`](@ref) 
        
           Our problem isn't a typical multi-armed bandit for a two reasons: 
        
            1. The reward distribution is non-stationary (each arm converges to 0 as it 
        
               keeps getting pulled. 
        
            2. The distribution of rewards is dependent on the history of the arms that 
        
               were chosen. 
        
           We choose a very simple heuristic: pick the arm with the best mean + 1 standard 
        
           deviation. That should ensure we consistently pick the arm with the best 
        
           likelihood of improving the value function. 
        
           In future, we should consider discounting the rewards of earlier iterations, and 
        
           focus more on the more-recent rewards. 
        
           """ 
        
           mutable struct BanditDuality <: AbstractDualityHandler 
        
               arms::Vector{_BanditArm} 
        
               last_arm_index::Int 
        
               function BanditDuality(args::AbstractDualityHandler...) 
        
                   return new(_BanditArm[_BanditArm(arg, Float64[]) for arg in args], 1) 
        
               end 
        
           end 
        
           function Base.show(io::IO, handler::BanditDuality) 
        
               print(io, "BanditDuality with arms:") 
        
               for arm in handler.arms 
        
                   print(io, "\n * ", arm.handler) 
        
               end 
        
               return 
        
           end 
        
           function BanditDuality() 
        
               return BanditDuality(ContinuousConicDuality(), StrengthenedConicDuality()) 
        
           end 
        
           function _choose_best_arm(handler::BanditDuality) 
        
               _, index = findmax( 
        
                   map(handler.arms) do arm 
        
                       return Statistics.mean(arm.rewards) + Statistics.std(arm.rewards) 
        
                   end, 
        
               ) 
        
               handler.last_arm_index = index 
        
               return handler.arms[index] 
        
           end 
        
           function _update_rewards(handler::BanditDuality, log::Vector{Log}) 
        
               # The bound is monotonic, so instead of worring about whether we are 
        
               # maximizing or minimizing, let's just compute: 
        
               #          |bound_t - bound_{t-1}| 
        
               # reward = ----------------------- 
        
               #            time_t - time_{t-1} 
        
               t, t′ = log[end], log[end-1] 
        
               reward = abs(t.bound - t′.bound) / (t.time - t′.time) 
        
               # This check is needed because we should probably keep using the first 
        
               # handler until we start to improve the bound. This can take quite a few 
        
               # iterations in some models. (Until we start to improve, the reward will be 
        
               # zero, so we'd never revisit it. 
        
               const_bound = isapprox(log[1].bound, log[end].bound; atol = 1e-6) 
        
               # To start with, we should add the reward to all arms to construct a prior 
        
               # distribution for the arms. The 10 is somewhat arbitrary. 
        
               if length(log) < 10 || const_bound 
        
                   for arm in handler.arms 
        
                       push!(arm.rewards, reward) 
        
                   end 
        
               else 
        
                   push!(handler.arms[handler.last_arm_index].rewards, reward) 
        
               end 
        
               return 
        
           end 
        
           function prepare_backward_pass( 
        
               model::PolicyGraph, 
        
               handler::BanditDuality, 
        
               options::Options, 
        
           ) 
        
               if length(options.log) > 1 
        
                   _update_rewards(handler, options.log) 
        
               end 
        
               arm = _choose_best_arm(handler) 
        
               return prepare_backward_pass(model, arm.handler, options) 
        
           end 
        
           function get_dual_solution(node::Node, handler::BanditDuality) 
        
               return get_dual_solution(node, handler.arms[handler.last_arm_index].handler) 
        
           end 
        
           function duality_log_key(handler::BanditDuality) 
        
               return duality_log_key(handler.arms[handler.last_arm_index].handler) 
        
           end

I spent a fair bit of time getting a better understanding of the Lagrangian problem. We now just use a modified BFGS to solve it. I don't know why people do other more complicated things. The nice thing about BFGS is that each step can maintain dual feasibility, so we can terminate with a suboptimal dual, and it doesn't have any of the numerical issues associated with solving a LP or QP on the side. In general though, I don't think we should ever really use Lagrangian. It sucks.

SDDP.jl/src/plugins/local_improvement_search.jl

Lines 23 to 130 in 23fcaa9

    
           """ 
        
               minimize(f::Function, x₀::Vector{Float64}) 
        
           Minimizes a convex function `f` using first-order information. 
        
           The algorithm is a modified version of BFGS, with a specialized back-tracking 
        
           inexact line-search. 
        
           Compared to off-the-shelf implementations, it has a number of features tailored 
        
           to this purpose: 
        
            * Infeasibility is indicated by the function returning `nothing`. No other 
        
              constraint information is given. 
        
            * Sub-optimal solutions are okay, so we should focus on improving the feasible 
        
              starting point, instead of finding the global minimizer. 
        
            * `f` can be piecewise-linear convex with non-differentiable points. 
        
           ## Arguments 
        
            * `f(::Vector{Float64})`: takes a vector `x` and returns one of the following: 
        
              * `nothing` if `x` is infeasible 
        
              * `(f, Δf)::Tuple{Float64,Vector{Float64}`:  a tuple of the function 
        
                evaluation and first-order gradient information. 
        
            * `x₀::Vector{Float64}`: a feasible starting point. 
        
           """ 
        
           function minimize(f::F, bfgs::BFGS, x₀::Vector{Float64}) where {F<:Function} 
        
               # Initial estimte for the Hessian matrix in BFGS 
        
               B = zeros(length(x₀), length(x₀)) 
        
               for i in 1:size(B, 1) 
        
                   B[i, i] = 1.0 
        
               end 
        
               # We assume that the initial iterate is feasible 
        
               xₖ = x₀ 
        
               fₖ, ∇fₖ = f(xₖ)::Tuple{Float64,Vector{Float64}} 
        
               # Initial step-length 
        
               αₖ = 1.0 
        
               # Evaluation counter 
        
               evals = Ref(0) 
        
               while true 
        
                   # Search direction. We could be clever here and maintain B⁻¹, but we're 
        
                   # only ever going to be solving this for very small |x| << 100 problems, 
        
                   # so taking the linear solve every time is okay. (The MIP solve is much 
        
                   # more of a bottleneck.) 
        
                   pₖ = B \ -∇fₖ 
        
                   # Run line search in direction `pₖ` 
        
                   αₖ, fₖ₊₁, ∇fₖ₊₁ = _line_search(f, fₖ, ∇fₖ, xₖ, pₖ, αₖ, evals) 
        
                   if _norm(αₖ * pₖ) / max(1.0, _norm(xₖ)) < 1e-3 
        
                       # Small steps! Probably at the edge of the feasible region. 
        
                       # Return the current iterate. 
        
                       # 
        
                       # Note that "1e-3" isn't thaaaat small. But we hit a very annoying 
        
                       # feature of solvers: their feasibility checks are only approximate. 
        
                       # This tolerance is needed to pass the `test_kelleys_ip_xxx` tests. 
        
                       # Decreasing the tolerance leads to a _worse_ estimate for the dual, 
        
                       # because we abuse the solvers feasibility tolerance, and end up 
        
                       # returning a solution that is on the edge of numerical dual 
        
                       # feasibility. 
        
                       return fₖ, xₖ 
        
                   elseif _norm(∇fₖ₊₁) < 1e-6 
        
                       # Zero(ish) gradient. Return what must be a local maxima. 
        
                       return fₖ₊₁, xₖ + αₖ * pₖ 
        
                   elseif evals[] > bfgs.evaluation_limit 
        
                       # We have evaluated the function too many times. Return our current 
        
                       # best. 
        
                       return fₖ₊₁, xₖ + αₖ * pₖ 
        
                   end 
        
                   # BFGS update. 
        
                   sₖ = αₖ * pₖ 
        
                   yₖ = ∇fₖ₊₁ - ∇fₖ 
        
                   # A slight tweak to normal BFGS: because we're dealing with non-smooth 
        
                   # problems, ||yₖ|| might be 0.0, i.e., we just moved along a facet from 
        
                   # from an interior point to a vertex, so the gradient stays the same. 
        
                   if _norm(yₖ) > 1e-12 
        
                       B .= 
        
                           B .+ (yₖ * yₖ') / (yₖ' * sₖ) - 
        
                           (B * sₖ * sₖ' * B') / (sₖ' * B * sₖ) 
        
                   end 
        
                   fₖ, ∇fₖ, xₖ = fₖ₊₁, ∇fₖ₊₁, xₖ + sₖ 
        
               end 
        
           end 
        
           function _line_search( 
        
               f::F, 
        
               fₖ::Float64, 
        
               ∇fₖ::Vector{Float64}, 
        
               x::Vector{Float64}, 
        
               p::Vector{Float64}, 
        
               α::Float64, 
        
               evals::Ref{Int}, 
        
           ) where {F<:Function} 
        
               while _norm(α * p) / max(1.0, _norm(x)) > 1e-3 
        
                   xₖ = x + α * p 
        
                   ret = f(xₖ) 
        
                   evals[] += 1 
        
                   if ret === nothing 
        
                       α /= 2  # Infeasible. So take a smaller step 
        
                       continue 
        
                   end 
        
                   fₖ₊₁, ∇fₖ₊₁ = ret 
        
                   if p' * ∇fₖ₊₁ < 1e-6 
        
                       # Still a descent direction, so take a step. 
        
                       return α, fₖ₊₁, ∇fₖ₊₁ 
        
                   end 
        
                   #  Step is an ascent, so use Newton's method to find the intersection 
        
                   α = (fₖ₊₁ - fₖ - p' * ∇fₖ₊₁ * α) / (p' * ∇fₖ - p' * ∇fₖ₊₁) 
        
               end 
        
               return 0.0, fₖ, ∇fₖ 
        
           end

odow · 2022-12-14T00:22:17Z

Closing because I think this is a dead end. The cost of Lagrangian duality just doesn't improve things.

I've seen a few papers recently claiming to "do SDDiP" but they have mixed-integer-continuous state and control variables, and then sneakily say that "we just use the continuous dual, it might give a worse bound but it works well." And it does. So we don't need this.

odow added the documentation label Aug 26, 2019

odow mentioned this issue Sep 12, 2019

Add cut patterns to SDDiP #257

Closed

odow added the stochastic-integer Related to stochastic mixed-integer programs label Dec 16, 2019

odow changed the title ~~Write SDDiP docs~~ Refactor SDDiP Jan 26, 2021

This was referenced Jan 26, 2021

SDDiP doesn't respect tolerances #264

Closed

Variable index allocation problem for silent variables GAMS-dev/gams.jl#6

Closed

odow changed the title ~~Refactor SDDiP~~ Ideas to improve SDDiP Jun 12, 2021

odow mentioned this issue Jul 7, 2021

Integer variables when constructing a PolicyGraph #441

Closed

odow changed the title ~~Ideas to improve SDDiP~~ Ideas to improve multistage stochastic integer programs Jul 18, 2021

odow mentioned this issue Sep 19, 2021

Add BanditDuality handler #471

Merged

3 tasks

odow mentioned this issue Oct 26, 2021

Add extra cut to Lagrangian problem #481

Merged

odow closed this as completed Dec 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas to improve multistage stochastic integer programs #246

Ideas to improve multistage stochastic integer programs #246

odow commented Aug 26, 2019 •

edited

Loading

odow commented Feb 1, 2021

odow commented Sep 7, 2021

guyichen09 commented Sep 7, 2021 via email

odow commented Sep 7, 2021

odow commented Sep 18, 2021

odow commented Sep 19, 2021

mortondp commented Sep 19, 2021 via email

guyichen09 commented Sep 24, 2021 via email

odow commented Nov 8, 2021

odow commented Dec 14, 2022

Ideas to improve multistage stochastic integer programs #246

Ideas to improve multistage stochastic integer programs #246

Comments

odow commented Aug 26, 2019 • edited Loading

Motivation

Clever switching between duality handlers

Bounded states

Advanced Lagrangian solves

Other

Done

Write a theory tutorial

odow commented Feb 1, 2021

odow commented Sep 7, 2021

guyichen09 commented Sep 7, 2021 via email

odow commented Sep 7, 2021

odow commented Sep 18, 2021

Trajectory depth in cyclic graphs

Revisiting forward passes

odow commented Sep 19, 2021

mortondp commented Sep 19, 2021 via email

guyichen09 commented Sep 24, 2021 via email

odow commented Nov 8, 2021

odow commented Dec 14, 2022

odow commented Aug 26, 2019 •

edited

Loading