Train model with Asynchronous mode #492

anavarro01 · 2021-11-25T20:44:17Z

Hi!
I'm trying to try my model with the Asynchronous mode, based on the docs.
At the first lines of the script, i'm running this:

dir_to_enviromentl = Base.active_project()
using Distributed
Distributed.addprocs(5)
@everywhere begin
    import Pkg
    Pkg.activate(dir_to_enviromentl)
    using Gurobi, SDDP
end

Notice that i've added the statement "using Gurobi, SDDP" because i found it at other post.
Actually, i am creating my model by this way:

graph = SDDP.LinearGraph(10)
gurobi_env = Gurobi.Env()
model = SDDP.PolicyGraph(
            graph,
            sense = :Min,
            optimizer = optimizer_with_attributes(() -> Gurobi.Optimizer(gurobi_env))) do sp, t
end

When i try to train my model:

SDDP.train(model; iteration_limit = 10, print_level = 1, add_to_existing_cuts = true, parallel_scheme = SDDP.Asynchronous())

I can only see Outputs from the "worker 1", and nothing from the other workers. Also, at the end of the iterations of train, i get the next error:

ERROR: LoadError: On worker 2:
Gurobi Error 10002:
trap_error at /home/agnavarro/.julia/packages/SDDP/Cp4Bp/src/plugins/parallel_schemes.jl:159
slave_loop at /home/agnavarro/.julia/packages/SDDP/Cp4Bp/src/plugins/parallel_schemes.jl:155
#103 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:290
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:79
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:88
#96 at ./task.jl:356

I can't figure out where is the problem here.

Thanks in advance!

odow · 2021-11-25T21:26:11Z

You needed to read a little further down the page to use Gurobi:
https://odow.github.io/SDDP.jl/latest/guides/improve_computational_performance/#Initialization-hooks

anavarro01 · 2021-11-26T12:49:31Z

Thanks, it worked perfectly.

anavarro01 · 2021-12-06T15:05:21Z

Hi,

Sorry for re open this issue, but i have a question and opening a new issue might be innecesary.
I'm training my model in asynchronus mode, and i get this output:

 Iteration    Simulation       Bound         Time (s)    Proc. ID   # Solves
        1    3.664092e+07   7.371414e+05   4.740712e+02          1       2484
        2    1.954982e+07   9.953921e+06   7.951662e+02          1       4680
        3    4.591550e+07   9.953921e+06   8.032407e+02          2       4740
        4    4.866310e+07   9.953921e+06   8.118478e+02          3       4800
        5    4.878851e+07   9.953921e+06   8.195227e+02          4       4860
        6    4.355507e+07   9.953921e+06   8.272206e+02          5       4920
        7    1.969249e+07   1.026331e+07   1.138915e+03          1       7116
        8    1.469963e+07   1.026350e+07   1.458276e+03          1       9312
        9    1.564037e+07   1.200186e+07   1.808019e+03          1      11508
       10    1.341304e+07   1.203708e+07   2.155772e+03          1      13704
       11    1.816808e+07   1.205720e+07   2.490398e+03          1      15900

It seems that it uses the 5 workers jsut once, and train only with one of them the rest of iterations. I'v treined the model with 30 iterations, and only the first five used more than de first worker. In the docs it says "SDDP.jl will start in serial mode while the initialization takes place. Therefore, in the log you will see that the initial iterations take place on the master thread (Proc. ID = 1), and it is only after while that the solve switches to full parallelism", bit i'm getting something different.
It's this working ok? or it's something strange?

Thanks in advance!

odow · 2021-12-06T19:29:26Z

Why does it take 474 seconds do do one iteration? What machine is this on? It looks like you're probably running out of RAM and so the other 4 workers are actually slower than just running in serial mode.

anavarro01 · 2021-12-06T19:40:39Z

I don't know why it takes 474 seconds to do one iteration. The problem it's a big hydro-thermal scheduling, with 56 stages (representing like 9 years), more than 200 buses, 150 nodes on the water network.
I don't think that im running out of ram. I'm running in a Intel Xeon E5-2630 with 40 cores and 64GB of ram. Looking in my dashboard, each worker it's not using more than 5GB of ram, and the main process 8GB.

odow · 2021-12-06T19:51:57Z

The parallel scheme is not well optimized, and it requires a lot of data movement between the processors. I assume for this model the set-up and data movement overhead outweighs the benefit of running in parallel. How do the times look if you run in serial?

anavarro01 · 2021-12-07T00:41:53Z

This is the output in serial mode

 Iteration    Simulation       Bound         Time (s)    Proc. ID   # Solves
        1    4.379315e+07   6.238694e+05   5.137349e+02          1       2196
        2    2.398887e+07   1.024725e+07   9.844802e+02          1       4392
        3    1.365226e+07   1.025685e+07   1.464048e+03          1       6588
        4    2.005606e+07   1.188252e+07   1.904386e+03          1       8784
        5    1.751424e+07   1.188255e+07   2.471620e+03          1      10980
        6    1.362879e+07   1.259155e+07   2.880270e+03          1      13176
        7    1.396052e+07   1.261973e+07   3.248855e+03          1      15372
        8    1.366801e+07   1.261982e+07   3.658474e+03          1      17568
        9    1.412258e+07   1.262390e+07   4.027489e+03          1      19764
       10    1.476644e+07   1.262392e+07   4.426796e+03          1      21960
       11    1.387798e+07   1.298193e+07   4.769414e+03          1      24156
       12    1.344110e+07   1.299375e+07   5.248231e+03          1      26352
       13    1.444331e+07   1.299466e+07   5.842923e+03          1      28548
       14    1.375559e+07   1.336322e+07   6.328565e+03          1      30744
       15    1.335609e+07   1.336351e+07   6.852967e+03          1      32940
       16    1.357505e+07   1.340325e+07   7.329363e+03          1      35136
       17    1.329446e+07   1.340336e+07   7.866351e+03          1      37332
       18    1.822702e+07   1.340367e+07   8.441247e+03          1      39528
       19    1.401540e+07   1.340401e+07   8.839535e+03          1      41724
       20    1.349358e+07   1.340825e+07   9.145367e+03          1      43920

It converges in fewer iterations, as we would expect. Per iteration is a little slower (iteration from the thread 1).
Also, i've tested the asynchronus mode with 40 iterations, and it use each worker just once and make all the other iterations with the thread one.

odow · 2021-12-07T00:46:39Z

There's something not quite right about your model. Does it really take 200ms per LP solve? What solver are you using? Is it an LP? Can you provide the full log from SDDP.jl (including all the stuff at the top)?

anavarro01 · 2021-12-07T00:53:33Z

I'm using Gurobi 9.1 and it's an LP. The full SDDP log is this:

------------------------------------------------------------------------------
                      SDDP.jl (c) Oscar Dowson, 2017-21

Problem
  Nodes           : 36
  State variables : 25
  Scenarios       : 1.03144e+64
  Existing cuts   : false
  Subproblem structure                                              : (min, max)
    Variables                                                       : (21107, 21107)
    GenericAffExpr{Float64,VariableRef} in MOI.GreaterThan{Float64} : (3352, 3354)
    VariableRef in MOI.LessThan{Float64}                            : (1, 1)
    GenericAffExpr{Float64,VariableRef} in MOI.Interval{Float64}    : (5585, 5585)
    VariableRef in MOI.GreaterThan{Float64}                         : (19568, 19568)
    GenericAffExpr{Float64,VariableRef} in MOI.LessThan{Float64}    : (5050, 5051)
    GenericAffExpr{Float64,VariableRef} in MOI.EqualTo{Float64}     : (6506, 6516)
Options
  Solver          : serial mode
  Risk measure    : SDDP.Expectation()
  Sampling scheme : SDDP.InSampleMonteCarlo

Numerical stability report
  Non-zero Matrix range     [8e-05, 1e+03]
  Non-zero Objective range  [9e-02, 1e+07]
  Non-zero Bounds range     [5e+05, 5e+05]
  Non-zero RHS range        [1e-01, 4e+04]
WARNING: numerical stability issues detected
  - Matrix range contains small coefficients
Very large or small absolute values of coefficients
can cause numerical stability issues. Consider
reformulating the model.

 Iteration    Simulation       Bound         Time (s)    Proc. ID   # Solves
        1    4.379315e+07   6.238694e+05   5.137349e+02          1       2196
        2    2.398887e+07   1.024725e+07   9.844802e+02          1       4392
        3    1.365226e+07   1.025685e+07   1.464048e+03          1       6588
        4    2.005606e+07   1.188252e+07   1.904386e+03          1       8784
        5    1.751424e+07   1.188255e+07   2.471620e+03          1      10980
        6    1.362879e+07   1.259155e+07   2.880270e+03          1      13176
        7    1.396052e+07   1.261973e+07   3.248855e+03          1      15372
        8    1.366801e+07   1.261982e+07   3.658474e+03          1      17568
        9    1.412258e+07   1.262390e+07   4.027489e+03          1      19764
       10    1.476644e+07   1.262392e+07   4.426796e+03          1      21960
       11    1.387798e+07   1.298193e+07   4.769414e+03          1      24156
       12    1.344110e+07   1.299375e+07   5.248231e+03          1      26352
       13    1.444331e+07   1.299466e+07   5.842923e+03          1      28548
       14    1.375559e+07   1.336322e+07   6.328565e+03          1      30744
       15    1.335609e+07   1.336351e+07   6.852967e+03          1      32940
       16    1.357505e+07   1.340325e+07   7.329363e+03          1      35136
       17    1.329446e+07   1.340336e+07   7.866351e+03          1      37332
       18    1.822702e+07   1.340367e+07   8.441247e+03          1      39528
       19    1.401540e+07   1.340401e+07   8.839535e+03          1      41724
       20    1.349358e+07   1.340825e+07   9.145367e+03          1      43920

Terminating training
  Status         : iteration_limit
  Total time (s) : 9.145367e+03
  Total solves   : 43920
  Best bound     :  1.340825e+07
  Simulation CI  :  1.653153e+07 ± 3.064649e+06
------------------------------------------------------------------------------

I'm using gurobi parameter numeric focus = 3.
I suspect that the warning can be the problem, but im not sure.

odow · 2021-12-07T01:53:59Z

So a few things:

Setting the numeric focus will slow things down a lot
What are the 1e-5 terms in your constraints? You're running into numerical issues because you have big terms and small terms in the same problem. That can lead to all sorts of trouble. Consider reformulating your model to accept less accuracy in some variables (you don't need to measure water in a reservoir with m^3, for example; use million m^3 instead).
You have a large number of variables in each subproblem. Are they contributing to the model in a useful way? Consider simplifications to the network.
Consider adding realistic upper bounds to your variables. This can help quite a bit.

anavarro01 · 2021-12-07T21:32:12Z

Thanks for the answer.

I've tested not using the numeric focus and the time is just marginally better.
I've fixed the 1e-5 terms in my constraints. They was from de water network of my model. The new model has the matrix values between 1e-3 and 1e+3. I'm testing the times of this case
Yes, i have a large number of variables on each subproblem. I have another case with a simplified network, but for this case i need a detailed model from the power and hydro system
About the upperbounds of the variables, you mean adding them like this? thermal_generation[g in SetGens], lower_bound = 0, upper_bound =GenMax , start = 0) ? Actually im using just the lower bound. If i add the upper bound he model can be faster?

Also, i've tested the asynchronous mode with a slightly modified model than the begining (note this in the matrix range). I know that i have a better model in term of range rigth now, but i'm running that model.
The thing it's that SDDP just uses de 4 extra workers just once (with redundant cuts i think) and never uses them again. I was looking the CPU usage of each worker, and the extra workers barely used more than 5% of a core after the 6th iteration (when de solver used them). In contrast, the main worker allways used like 90 or 100% of a core. Also, in term of RAM usage, the main worker used lige 8GB of ram anf the rest like 3 or 4 (there was like 30GB of free RAM at that moment).
Here is the output.

                      SDDP.jl (c) Oscar Dowson, 2017-21

Problem
  Nodes           : 36
  State variables : 25
  Scenarios       : 1.03144e+64
  Existing cuts   : false
  Subproblem structure                                              : (min, max)
    Variables                                                       : (21107, 21107)
    GenericAffExpr{Float64,VariableRef} in MOI.GreaterThan{Float64} : (3352, 3354)
    VariableRef in MOI.LessThan{Float64}                            : (1, 1)
    GenericAffExpr{Float64,VariableRef} in MOI.Interval{Float64}    : (5585, 5585)
    VariableRef in MOI.GreaterThan{Float64}                         : (19568, 19568)
    GenericAffExpr{Float64,VariableRef} in MOI.LessThan{Float64}    : (5050, 5051)
    GenericAffExpr{Float64,VariableRef} in MOI.EqualTo{Float64}     : (6506, 6516)
Options
  Solver          : Asynchronous mode with 4 workers.
  Risk measure    : SDDP.Expectation()
  Sampling scheme : SDDP.InSampleMonteCarlo

Numerical stability report
  Non-zero Matrix range     [7e-04, 1e+04]
  Non-zero Objective range  [9e-02, 1e+06]
  Non-zero Bounds range     [5e+05, 5e+05]
  Non-zero RHS range        [1e-01, 4e+04]
No problems detected

 Iteration    Simulation       Bound         Time (s)    Proc. ID   # Solves
        1    5.423759e+07   8.267030e+05   4.416711e+02          1       2196
        2    2.940131e+07   8.782821e+06   7.309163e+02          1       4392
        3    5.217217e+07   8.782821e+06   7.387564e+02          2       4452
        4    4.352296e+07   8.782821e+06   7.465079e+02          3       4512
        5    5.251501e+07   8.782821e+06   7.541364e+02          5       4572
        6    4.686811e+07   8.782821e+06   7.621236e+02          4       4632
        7    1.719751e+07   1.037191e+07   1.096181e+03          1       6828
        8    1.844196e+07   1.037192e+07   1.422216e+03          1       9024
        9    1.547676e+07   1.118443e+07   1.725801e+03          1      11220
       10    1.574095e+07   1.118467e+07   2.099916e+03          1      13416
       11    1.661916e+07   1.118481e+07   2.460442e+03          1      15612
       12    1.545823e+07   1.118489e+07   2.782426e+03          1      17808
       13    1.357699e+07   1.271447e+07   3.092544e+03          1      20004
       14    1.477563e+07   1.271460e+07   3.466616e+03          1      22200
       15    1.953569e+07   1.271463e+07   3.803677e+03          1      24396
       16    1.790396e+07   1.274343e+07   4.153817e+03          1      26592
       17    1.501449e+07   1.274408e+07   4.489968e+03          1      28788
       18    1.861698e+07   1.274434e+07   4.840596e+03          1      30984
       19    1.214280e+07   1.284335e+07   5.198810e+03          1      33180
       20    1.743427e+07   1.284566e+07   5.553337e+03          1      35376
       21    1.358681e+07   1.287126e+07   5.906797e+03          1      37572
       22    1.456575e+07   1.287127e+07   6.252890e+03          1      39768
       23    1.869709e+07   1.298196e+07   6.575987e+03          1      41964
       24    1.438251e+07   1.298244e+07   6.954527e+03          1      44160
       25    1.394985e+07   1.298285e+07   7.338875e+03          1      46356
       26    1.427660e+07   1.298287e+07   7.682910e+03          1      48552
       27    1.350326e+07   1.298307e+07   8.120624e+03          1      50748
       28    1.367676e+07   1.308422e+07   8.451969e+03          1      52944
       29    1.396336e+07   1.308630e+07   8.860270e+03          1      55140
       30    1.664874e+07   1.308645e+07   9.281819e+03          1      57336

Terminating training
  Status         : iteration_limit
  Total time (s) : 9.281819e+03
  Total solves   : 57336
  Best bound     :  1.308645e+07
  Simulation CI  :  2.179678e+07 ± 4.737844e+06
------------------------------------------------------------------------------

I know that the time es slow, but the strange thing it's that it's not using the other workers again.

Thanks for all!

odow · 2021-12-07T21:45:42Z

There's probably something todo with the channels to the remote processes disconnecting due to the time it takes to receive.

Unfortunately, I don't have the time to look into this in any detail. (What institution are you with?)

Here's the majority of the parallel code. It'd take some digging to find the problem:

SDDP.jl/src/plugins/parallel_schemes.jl

Lines 170 to 264 in 5ec8a2e

    
           function master_loop( 
        
               async::Asynchronous, 
        
               model::PolicyGraph{T}, 
        
               options::Options, 
        
           ) where {T} 
        
               # Initialize the remote channels. There are two types: 
        
               # 1) updates: master -> slaves[i]: a unique channel for each slave, which 
        
               #       is used to distribute results found by other slaves. 
        
               # 2) results: slaves -> master: a channel which slaves collectively push to 
        
               #       to feed the master new results. 
        
               updates = Dict( 
        
                   pid => Distributed.RemoteChannel( 
        
                       () -> Channel{IterationResult{T}}(Inf), 
        
                   ) for pid in async.slave_ids 
        
               ) 
        
               results = Distributed.RemoteChannel(() -> Channel{IterationResult{T}}(Inf)) 
        
               futures = Distributed.Future[] 
        
               _uninitialize_solver(model; throw_error = true) 
        
               for pid in async.slave_ids 
        
                   let model_pid = model, options_pid = options 
        
                       f = Distributed.remotecall( 
        
                           slave_loop, 
        
                           pid, 
        
                           async, 
        
                           model_pid, 
        
                           options_pid, 
        
                           updates[pid], 
        
                           results, 
        
                       ) 
        
                       push!(futures, f) 
        
                   end 
        
               end 
        
               _initialize_solver(model; throw_error = true) 
        
               while true 
        
                   # Starting workers has a high overhead. We have to copy the models across, and then 
        
                   # precompile all the methods on every process :(. While that's happening, let's 
        
                   # start running iterations on master. It has the added benefit that if the master 
        
                   # is ever idle waiting for a result from a slave, it will do some useful work :). 
        
                   # 
        
                   # It also means that Asynchronous() can be our default setting, since if there are 
        
                   # no workers, ther should be no overhead, _and_ this inner loop is just the serial 
        
                   # implementation anyway. 
        
                   while async.use_master && !isready(results) 
        
                       result = iteration(model, options) 
        
                       for (_, ch) in updates 
        
                           put!(ch, result) 
        
                       end 
        
                       log_iteration(options) 
        
                       if result.has_converged 
        
                           close(results) 
        
                           wait.(futures) 
        
                           return result.status 
        
                       end 
        
                   end 
        
                   while !isready(results) 
        
                       sleep(1.0) 
        
                   end 
        
                   # We'll only reach here is isready(results) == true, so we won't hang waiting for a 
        
                   # new result on take!. After we receive a new result from a slave, there are a few 
        
                   # things to do: 
        
                   # 1) send the result to the other slaves 
        
                   # 2) update the master problem with the new cuts 
        
                   # 3) compute the revised bound, update the log, and print to screen 
        
                   # 4) test for convergence (e.g., bound stalling, time limit, iteration limit) 
        
                   # 5) Exit, killing the running task on the workers. 
        
                   result = take!(results) 
        
                   for pid in async.slave_ids 
        
                       if pid != result.pid 
        
                           put!(updates[pid], result) 
        
                       end 
        
                   end 
        
                   slave_update(model, result) 
        
                   bound = calculate_bound(model) 
        
                   push!( 
        
                       options.log, 
        
                       Log( 
        
                           length(options.log) + 1, 
        
                           bound, 
        
                           result.cumulative_value, 
        
                           time() - options.start_time, 
        
                           result.pid, 
        
                           model.ext[:total_solves], 
        
                           duality_log_key(options.duality_handler), 
        
                       ), 
        
                   ) 
        
                   log_iteration(options) 
        
                   has_converged, status = 
        
                       convergence_test(model, options.log, options.stopping_rules) 
        
                   if has_converged 
        
                       close(results) 
        
                       wait.(futures) 
        
                       return status 
        
                   end 
        
               end 
        
               return

anavarro01 · 2021-12-09T14:48:03Z

Thanks for the answer.
I'm working with Pontificia Universidad Católica de Chile.
I will look the parallel code to see what i find. I will see if the process get disconected because of the time that takes to solve the sob problem.

Thanks!

odow · 2021-12-09T18:45:17Z

Ah so this is for the new Chilean model?

Try the undocumented option SDDP.Asynchronous(use_master = false)

parallel_scheme = SDDP.Asynchronous(use_master = false) do m
    env = Gurobi.Env()
    for node in values(m.nodes)
        set_optimizer(node.subproblem, () -> Gurobi.Optimizer(env))
        set_silent(node.subproblem)
    end
end

anavarro01 · 2021-12-09T20:27:39Z

Yes, i'm working with the Chilean model.
I've just tested the option with use_master = false and it worked! Now all the workers make new cuts on the iterations. Anyway, i think that every cut is not that effective as the past, but i will test the time and convergence to be sure about this.
The output of the train is this one:

------------------------------------------------------------------------------
                      SDDP.jl (c) Oscar Dowson, 2017-21

Problem
  Nodes           : 36
  State variables : 26
  Scenarios       : 1.03144e+64
  Existing cuts   : false
  Subproblem structure                                              : (min, max)
    Variables                                                       : (21109, 21109)
    GenericAffExpr{Float64,VariableRef} in MOI.GreaterThan{Float64} : (3352, 3354)
    VariableRef in MOI.LessThan{Float64}                            : (1, 1)
    GenericAffExpr{Float64,VariableRef} in MOI.Interval{Float64}    : (5585, 5585)
    VariableRef in MOI.GreaterThan{Float64}                         : (19568, 19568)
    GenericAffExpr{Float64,VariableRef} in MOI.LessThan{Float64}    : (5050, 5051)
    GenericAffExpr{Float64,VariableRef} in MOI.EqualTo{Float64}     : (6507, 6517)
Options
  Solver          : Asynchronous mode with 7 workers.
  Risk measure    : SDDP.Expectation()
  Sampling scheme : SDDP.InSampleMonteCarlo

Numerical stability report
  Non-zero Matrix range     [1e-03, 1e+03]
  Non-zero Objective range  [9e-02, 1e+06]
  Non-zero Bounds range     [5e+05, 5e+05]
  Non-zero RHS range        [1e-02, 4e+04]
No problems detected

 Iteration    Simulation       Bound         Time (s)    Proc. ID   # Solves
        1    1.104024e+08   6.981477e+05   5.127468e+02          3         60
        2    1.067315e+08   5.849218e+05   5.284406e+02          2        120
        3    9.880152e+07   6.981528e+05   5.483199e+02          4        180
        4    1.176214e+08   5.849218e+05   5.630537e+02          5        240
        5    1.101718e+08   5.849218e+05   5.796307e+02          6        300
        6    1.051674e+08   5.849218e+05   5.951814e+02          7        360
        7    1.037979e+08   6.981528e+05   6.077584e+02          8        420
        8    5.060291e+07   5.755976e+06   9.617168e+02          3        480
        9    5.258060e+07   5.788610e+06   9.809755e+02          2        540
       10    5.360229e+07   5.851016e+06   1.007858e+03          5        600
       11    5.022065e+07   5.851016e+06   1.019422e+03          6        660
       12    4.750682e+07   5.851016e+06   1.030757e+03          4        720
       13    5.085912e+07   5.851016e+06   1.042366e+03          7        780
       14    4.859930e+07   5.851016e+06   1.079381e+03          8        840
       15    1.957667e+07   5.851016e+06   1.392243e+03          3        900
       16    2.180245e+07   5.851016e+06   1.403444e+03          2        960
       17    2.287774e+07   5.912278e+06   1.415303e+03          5       1020
       18    1.647738e+07   5.912278e+06   1.426319e+03          6       1080
       19    2.357789e+07   6.048069e+06   1.436571e+03          7       1140
       20    2.668270e+07   6.048069e+06   1.447359e+03          4       1200
       21    2.818930e+07   6.048069e+06   1.490012e+03          8       1260
       22    1.360967e+07   6.048069e+06   1.774825e+03          3       1320
       23    1.069678e+07   6.048069e+06   1.801416e+03          2       1380
       24    1.635207e+07   7.168963e+06   1.819407e+03          6       1440
       25    1.477921e+07   7.168963e+06   1.837739e+03          5       1500
       26    2.053416e+07   7.168963e+06   1.850033e+03          7       1560
       27    2.138778e+07   7.168963e+06   1.867305e+03          4       1620
       28    1.942672e+07   7.168963e+06   1.909715e+03          8       1680
       29    1.574227e+07   7.168963e+06   2.190866e+03          3       1740
       30    1.549916e+07   7.168963e+06   2.224452e+03          2       1800

Terminating training
  Status         : iteration_limit
  Total time (s) : 2.224452e+03
  Total solves   : 1800
  Best bound     :  7.168963e+06
  Simulation CI  :  4.712925e+07 ± 1.306993e+07
------------------------------------------------------------------------------

Thanks!

odow · 2021-12-09T21:00:02Z

Okay. I obviously have some scheduling issues switching between the serial and parallel modes.

Anyway, i think that every cut is not that effective as the past

Yes. This is true. You'll need more cuts to achieve the same bound compared with serial mode.

anavarro01 · 2021-12-10T15:40:09Z

I've been testing, and with use_master = false it's working really well.
I used a model with simplified the transmission losses, and the Asinchronous model (with 3 extra workers) need like 55 iteretions to pass the convergence test. The same model in Synchronous mode need like 44 iterations, but the time that they take is like the double of the asynchronous case.

Thanks!

odow · 2021-12-10T23:03:00Z

Okay, great that it's working. At some point I'll look into the master scheduling issue.

Be very careful with how you measure convergence. If you have 26 state variables and 36 stages, you're likely going to need hundreds or thousands of iterations. Read #178.

anavarro01 · 2021-12-11T19:57:51Z

Actually, based on #178 i'm using SDDP.statistical with a 2.5% confidence interval to define the convergence of the model. Like with 50 or 55 train iterations the model converges with 3 extra workers, so i use 60 right now. By the way, it needs really fewer iterations when i use a reasonable lower bound for the objective function.

odow · 2021-12-11T21:20:45Z

That suggests the myopic policy (do the best now, ignore the future) is near optimal.

I'd still try a run with a lot more iterations (like 500) can compare plots of the two policies.

odow · 2022-02-10T23:48:15Z

Closing because there doesn't seem to be anything actionable here. I'm aware that the parallel scheme needs work, but I don't have any concrete plans to work on it.

If, future reader, this is important for you, I'm available for paid consulting.

odow closed this as completed Nov 27, 2021

odow reopened this Dec 6, 2021

odow closed this as completed Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train model with Asynchronous mode #492

Train model with Asynchronous mode #492

anavarro01 commented Nov 25, 2021

odow commented Nov 25, 2021

anavarro01 commented Nov 26, 2021

anavarro01 commented Dec 6, 2021

odow commented Dec 6, 2021

anavarro01 commented Dec 6, 2021

odow commented Dec 6, 2021

anavarro01 commented Dec 7, 2021

odow commented Dec 7, 2021

anavarro01 commented Dec 7, 2021

odow commented Dec 7, 2021

anavarro01 commented Dec 7, 2021

odow commented Dec 7, 2021

anavarro01 commented Dec 9, 2021

odow commented Dec 9, 2021

anavarro01 commented Dec 9, 2021

odow commented Dec 9, 2021

anavarro01 commented Dec 10, 2021

odow commented Dec 10, 2021

anavarro01 commented Dec 11, 2021

odow commented Dec 11, 2021

odow commented Feb 10, 2022

Train model with Asynchronous mode #492

Train model with Asynchronous mode #492

Comments

anavarro01 commented Nov 25, 2021

odow commented Nov 25, 2021

anavarro01 commented Nov 26, 2021

anavarro01 commented Dec 6, 2021

odow commented Dec 6, 2021

anavarro01 commented Dec 6, 2021

odow commented Dec 6, 2021

anavarro01 commented Dec 7, 2021

odow commented Dec 7, 2021

anavarro01 commented Dec 7, 2021

odow commented Dec 7, 2021

anavarro01 commented Dec 7, 2021

odow commented Dec 7, 2021

anavarro01 commented Dec 9, 2021

odow commented Dec 9, 2021

anavarro01 commented Dec 9, 2021

odow commented Dec 9, 2021

anavarro01 commented Dec 10, 2021

odow commented Dec 10, 2021

anavarro01 commented Dec 11, 2021

odow commented Dec 11, 2021

odow commented Feb 10, 2022